Creating Analytical Data Models

Designing a data model that supports the reporting and analytical functions is no different, initially, than any other modeling effort.  Understanding the data is crucial.  The data architect or modeler needs to feel comfortable with dimensional modeling techniques and needs to obtain a working knowledge of the universe of discourse for the subject-at-hand.  A good start in gathering this knowledge begins by reviewing the operational data structures containing the identified source elements.  The challenge in designing analytical solutions is found in applying best practices for analytics simply and effectively. 

The desired solution will likely incorporate star schemas; actual implementation might be accomplished via relational tables, or as cubes in a proprietary structure like that used inside Microsoft's Analysis Services, and another such analytics-specific tools.  For a star schema approach, everything is simplified and data points are either metrics or dimensional attributes.  Metrics are numeric in type and fall onto a fact structure; attributes are grouped together into various dimension groupings.  Each fact represents a business-definable event, an order, an invoice, a sale, etc.  It is these business events that are being reported and analyzed when the stars or cubes are used.  Unlike an operational database, an analytics database deals directly with the known.  New metrics may be defined for use, but the newness is in how a calculation is established; the raw numbers feeding into those metrics should be clear and acknowledged business concepts based on available information.  Everything feeding into the analytics database structures comes from existing operational systems, or from existing, known, industry-relevant third party data sources.  The data modeler for an analytics initiative never starts from a blank slate.  The base ideas, the functions, the data transitions, all are happening already.  Even within organizations lacking formal documentation, expertise relating to existing data is available, even if only inside the heads of certain very busy people. 

Analytics and reporting are not data mining.  True data mining likely involves establishment of data structures unique to a specific set of questions or interrelationships under exploration.  Data mining structures are likely best implemented as "flat files" that mix all the needed data elements together in every single row.  A star schema is, or a set of star schemas are, useful for simplified reporting, trending and varying levels of analysis.  While "star schemas" and "data mining" are both terms involved in analytics, these ideas do not have a symbiotic relationship.  Star schemas do not enable data mining, and data mining does not mean star schemas are in use.  When the user community engages the term data mining, the data modeler or solution architect should tread lightly and gather further details.  Actual data mining often will involve statisticians engaged in exposing and understanding certain interrelationships.  The data mining focus is on processing sets of data against varying statistical formula.  If, instead, the user focus appears to be on trends and nice graphics, a star schema is fine.

In many ways, the multiple stars that may be defined for a given solution represent a logical description of part of the organization.  Since the business and its data already exist, the focus becomes how to channel the needed data items and derived calculations into a proper set of facts and related dimensions that both accurately describe the business and provide the right level of detail for use in the desired analyses.  In this sense, analytics and reporting are passive functions.  New and never-before-seen business events are not being created out of thin air.  If requests from the user community seem to imply the creation of new business events, then for that part of the project at least, it appears an operational component is necessary.