Divide and Conquer

Jul 11, 2012

By Todd Schraml

The whole world can be divided into two groups, these being splitters and lumpers. Design battles are waged across conference rooms as debates rage over whether to split or to lump. Splitters take a group of items divide them up into sub-groups and sub-sub-groups occasionally going so far as to end with each lowest level becoming a group of one. On the other side of the design fence, lumpers combine items until everything is abstracted into group objects covering very broad territory, such as a “Party” construct, or ultimately an “Object” object. Within data modeling, arguments arise, such as whether to sub-type an entity. Or perhaps lumping is discussed as the grain of a multidimensional fact is proposed. This debate underlies much of the decision-making involved in determining what domains to create within a data model. The split-versus-lump issue is ubiquitous and universal. The question to split or lump arises across many kinds of choices, in addition to the entity definition, table grain, or the domain grain mentioned in the previous examples; this issue is at the heart of deliberations regarding establishing functions, overriding methods, or composing an organizational structure.

Einstein once stated that “Reality is merely an illusion, albeit a very persistent one.” If reality is but an illusion, then whether one splits or one lumps really is of little concern. And in practice, proper and useful solutions are built using one or the other, or more often, a hybrid of splitting and lumping decisions along the way. Statis is an unlikely goal, as it does not seem possible that anyone can ever convincingly state that “Splitting is always best” or that “Lumping is eternally the most useful choice.”

However, circumstances certainly arise where inside an application one or the other approach is more effective. Fact table grain is likely the best case for observing these choices. If users may need the facts at both a detailed and an aggregate level, but the detailed data users are few and the storage costs are of concern, plus the circumstance is such that dividing the aggregated data is relatively easy, then one can consider lumping and keeping the fact table at an aggregate level. But even under this scenario, if the calculated details are later aggregated, rounding errors may now creep in, thus causing the calculated aggregate to not match with the actual aggregate, so that leaving things unlumped may prove optimal.

The real question concerns which issues are the more painful to the organization, storage costs or a bad practice of re-aggregating? This underlying prospect over rounding errors is a driving issue behind the general advice to keep a fact table at the lowest grain possible, thus non-lumping or staying split. The choices made in practice should be driven by the specific uses being made in the results of the design. In normalized database designs, abstractions are very much an act of lumping; and within the practice of data modeling, abstraction is a powerful tool. But when abstraction is abused, the resulting data model does not help explain the business. Therefore, abstraction should be used sparingly, for instance when the organization truly requires a level of flexibility, or purposeful ambiguity.

Yes, the world can be looked at as being comprised of splitters and lumpers, as it often seems that people tend to lean one way more than the other. As discussions unfold about almost any solution’s design issues, if one listens closely one may recognize the threads of splitting and lumping that trace through the disagreement. In recognition comes the power to rise above and look for ways of determining which option is more effective for the circumstance. The only thing that seems uncertain, however, is whether or not dividing the world into two groups is an act of splitting or an act of lumping.

Newsletters

Divide and Conquer

White Papers

Sponsors