Scouts who work for the club will use data captured which is related to opposing clubs to develop a more thorough analysis of that opposing team for the purpose of gaining an on-field competitive advantage. Those same scouts may participate in a completely separate data collection endeavor having nothing to do with the professional clubs activity. Many scouts are dedicated to gathering and analyzing as much data as possible on young amateur players to help the big club make investment decisions relating to those non-affiliated players. The persona of this data has a requirement to be near-real-time but the analysis the data is used for will be mostly strategic. However, on occasion the use of this same data will require ultra-fast analysis. In the case of a professional player draft conditions may change very close to the actual moment (often a 60 second interval) that a club must make the draft decision. If new data is injected into the analysis algorithm it is likely that the results of that analysis must be computed very quickly as after the draft decision is made that data becomes totally worthless.
Another group of interested professionals exists who use that same data to move around billions of dollars daily. The odds-makers determine the probability of success or failure of teams and individuals based on essentially the same data. The professional teams usually use the data as part of a long term analytical project so their version of the data can be managed and manipulated in a more deliberate fashion but the protection of their resulting data is paramount. The gamblers situation however varies from moment to moment and the odds are altered almost continuously. The gamblers need the immediate and real-time analytical capabilities that only ultra-low latency applications can deliver. The gamblers need in-memory data managements systems. It is clear that the exact same data, when dispersed to different groups takes on different personas which are context dependent. Every individual dataset or subset has a unique persona.
Throughout the single baseball game there are countless photographs taken by professional photographers as well as amateurs. The same should be said of videos. The pseudo-scientific crowd has their participation as well. Every pitch speed is recorded, weather data is logged, the speed of the individual players running on the field is examined and recorded and the various producers of nutritional supplements are present collecting their data as well. Even the capability of the equipment used is scrutinized scientifically.
During and after the game newspaper articles are written and blogs are updated. Each of these examples as well as many others not mentioned here as well as those yet to be conceived constitute massive amounts of data. These examples of data may appear very different than the data mentioned above but none the less these are all forms of data. This is the realm of unstructured or Bigdata. Hadoop plays an important role as massive amounts of these types of data are inconceivable to utilize in any useful manner without some implementation of a map-reduce methodology.
Some pictures are important and need to be permanently stored. Most are discarded. Videos become video clips and they are made available to others for various purposes. Some video is selected and becomes part of a more focused product set and may subsequently be utilized as a newsworthy highlight on a sports news program or within the content of a marketable highlight film. Ultimately it all can be made available to every potential user for every conceivable purpose. And each set of this data will have its own persona.
Some datasets will exist to be analyzed; some will exist as official archives. Some data sets will exist to be immediately consumed and most will forever languish in anonymonimity. The same points can be made about all text that is created pertaining to the game. An infinite variety of word searches will be executed on each pdf and blog entry generated each of which may have a completely unique persona as each was generated with distinct criteria. A simple conclusion can be drawn from each of these usage models. The “moment of attention” described earlier is very different to each and every group of interested data users. The gambler has a very short attention span, where as the fantasy player has a longer interest. However the baseball historian has lasting interest as that data defines the very essence of sport.
Much of this data may be obscure and seemingly lost forever but it exists to someday be potentially used. A long lost video of Babe Ruth and his mythical “Home Run pointing moment” in the 1934 World Series in Chicago surfaced a few years ago. Many thought that the mythical moment would finally be unambiguously explained but it only led to more controversy. But the photo as well as unlimited amounts of varied data in a plethora of different forms exist and can be accessed and managed according to that datasets persona because it is all now part of the Dataverse. And through a Unified Data Strategy the reality of the value of this data will be revealed and leveraged on an ongoing basis.
The processes mentioned above: capture, define, filter, store and disperse each produce sets of data with potentially unique personas especially when considering the individual datasets protection requirements. The most notable being the dispersal process because this is where the consumers of the data use the that data and apply their own innovative processes to it further creating new data sets with even more unique personas.
A baseball game is wonderful and massive producer of data. The common fan sitting in the upper deck does not think of it as such but the data scientist recognizes the innate value of the varied manifestations of the different forms of data being continuously produced. Much of this data is being used now but it will require a true Unified Data Strategy to fully exploit the data as a whole.