The ‘GROUP BY’ Jitterbug: All-Too-Common SQL Development Dance Step

It is a sad fact that many places understand their data and systems less today than ever before. It is a natural outcome as more consultants are used and—whether they are using a consultant or not—far more talent migrates between IT organizations than ever before. It is a drain on precious resources to have talent spending excessive periods constantly relearning an organization’s data and the systems that create it. The maintenance of  more  detailed  metadata might help relieve some of this drain, but documentation is still  viewed  as its own too costly use of resources. And since good, clear, concise writing is a dying art, many places that have documents available have documents that still do not offer clarity.

One symptom of an organization in the middle of a  knowledge vacuum is evidenced by SQL that often includes what appears to be extravagant usage of the GROUP BY clause. This usage is extravagant in that GROUP BYs appear everywhere, and they appear unexpectedly, in places that cause people to furrow their brows. GROUP BYs normally are used to aggregate data. Alternatively, they are used as a mechanism to remove duplication. Ubiquitous GROUP BYs may be found in businesses where source data contains duplication of content with no apparent attempt to track down the reasons for content duplication. Rather than investigating and discovering the ultimate cause of data duplication—inappropriate source system activities or responses, inappropriate joins between data objects, or simply flawed systems—a GROUP BY is blindly slammed into an SQL statement to remove the duplicates that are causing problems, and those making this choice simply move on just addressing a symptom but leaving the disease untouched.

Sometimes, even if there are experts who can resolve the issues, it is a commonly known fact that touching the code is going to be awkward/ time-consuming/costly. While a half-step up  on the evolutionary scale might be organizations that understand the cause of the duplication, but have chosen for whatever reason to let it remain unaddressed, ultimately the GROUP BY usage becomes officially endorsed policy. At least they do know the cause and that is a good thing.

If there is an isolated system, small, with data that isn’t really shared, and making the needed changes would be messy, time-consuming, and awkward for the organization, then forcing developers into dancing the GROUP BY jitterbug may be a necessary evil. But when data is shared, unhandled duplication is bad; when source systems are shared and large, such duplication is even worse. When the GROUP BY clause usage exists to support aggregate functions, one actually needs to sum numbers. Outside of that circumstance, GROUP BY can be a problem. How can one say with certainty that only “bad” duplicates have been removed? If there are “good” duplicates that one needs to be very concerned about, the organization cannot know because all duplicates vanish in the dance.

Accepting poor systems, accepting poor results, living with poor data all present an attitude that “poor is acceptable,” which infects an organization and starts showing up in other places. And aren’t we better than that? We shouldn’t accept cheap work- arounds when it is known what needs to be done. Duplicates in the operational systems are really a big deal. When will the “powers that be” man up  and address the things that need to be addressed?