Planning the Data Flow for Combinatorial Dimensions: Utility Versus Possibility

Bookmark and Share

Dimensions are the workhorses of a multidimensional design.  They are used to manage the numeric content being analyzed.  It is through the use of dimensions that the metrics can be sliced, diced, drilled-down, filtered and sorted.  Many people relate to dimensions by thinking of them as reference tables.  Such thoughts aren't exactly accurate.  A dimension groups together the textual/descriptor columns within a rationalized business category.  Therefore, much of the content coming from relational tables may be sourced from reference tables, but the relationship between each source reference table and the targeted dimension is unlikely to be one-for-one.  These grouped-format dimensions often contain one or more hierarchies of related data items used within the OLAP queries supported by the structures.  

In merging the elements within a business category, things may not always be direct.  Certainly, it is an ideal circumstance when one can join together a few simple source tables under a single straight-forward query and arrive at the proper content of a dimension.  At other times, bringing everything together requires more effort.  These more awkward circumstances often arise when one or more items within the business category comprising the dimension do not have a tight relationship to the other items within the dimension.  This circumstance results in having pieces of a dimension that fit together more as a combinatorial result rather than any more direct relational dependency.  It is the difference between stating an explicit relationship like, "Product X ships from Warehouse B" versus an open arrangement of "Any Product may ship from any Warehouse."  When planning the data flow for populating a combinatorial dimension, there are two options to consider.  The first is to represent within the dimension everything possible and simply perform the logical equivalent of a Cartesian product, matching everything on one side with each and every thing on the other side.  The other alternative is to add to the dimension everything actually used as each used combination is encountered.

In building up the dimension using the “Possibility” technique containing every possible combination, the algorithm is very simple.  Implementation may be very easy.  A downside may arise if the nature of the data leaves a large percentage of the created dimension entries as never referenced.  The non-reference situation may be made further awkward should there be a very large number of entries, and more so should those entries drive drop-down lists on one's OLAP/reporting tools.  It could be hard for a user to select actual valid combinations having facts associated with them. 

Following the “Utility” technique, as fact details are processed only the actual valid combinations found within the fact data get employed to populate the dimension.  This causes the dimension to have only content that is known to be used while leaving the dimension table with the fewest number of rows.  Everything is optimized; even the OLAP/reporting tool drop-down lists are as small as possible.  But there is a trade-off for this optimization; the logic built into the ETL generating the dimension may be very tedious, requiring long timeframes to code and test.  Also, the dimension may only be populated concurrently with processing the fact details, which increases the necessary cycle to get through the full workload. 

As a third option, one also may consider splitting off the combinatorial-causing elements into their own separate dimension.  However, that separate dimension may have issues as well; otherwise, it likely would have been split off naturally from the very beginning of the design's creation.  Both of these approaches, Possibility and Utility, are valid for supporting one's dimensions.  The option chosen for implementation should be the one that best fits your circumstances.  As with many things in the data world, the most important choice is always the one best fit for use.