When there is a desire to process and truly transform a large dataset, to create new datapoints from old, an expansion of one’s horizons is in order. Moving data from point A to point B is a fairly straightforward process, and opportunities for speeding up performance are known. But once one is messing around with the data, especially joining pieces to one place or another and adding in layers of complexity, flexibility in one’s thoughts is an asset. Options that optimize and speed up things may not be so obvious. Many people get stuck in a rut of the usual step-by-step thinking, linearly transforming each row inside the set. They may not necessarily acknowledge that they are scheming in a strict sequential fashion, but it is possible their processing might as well involve cursors reading one record at a time.
Often, circumstances may exist where the process can benefit from coding anew to try an alternative that may or may not bear fruit. Jumping around between temporary tables, common table expressions, or predefined permanent table structures can only be properly tested by sometimes writing and executing a full-dataset test against each version of the process. Therefore, a potential solution may require a run-through multiple times, as this approach will take time to work through and evaluate the options. About the only speedy thing that can be brought to bear is writing code quickly to get to the testing part. Creativity focuses in painting the solution canvas. How might the various elements need to be broken apart into sets following similar rule groupings? Each subset may need to calculate similar resulting datapoints in drastically differing ways. What rules apply to the majority? How many exceptions exist? How many kinds of exceptions must be employed? How many different join patterns can be leveraged to gather all the data together? Once the whole is broken into pieces, how quickly should they merge back together? These are the kinds of questions that demand examination in order to start composing the steps across the process under consideration.
An effective approach to processing and transforming large datasets is likely comprised of multiple steps. The large data will likely be split apart into several smaller sets, maybe even in a couple of differing fashions with a common and understandable theme. But there should not be too many split-apart variants; rather, as with the three bears, it should be just the right number of smaller datasets. And then, similar to solving a Rubik’s Cube, a twist or two at the very end brings all the new and old datapoints together in a complete and organized fashion.
There is no guaranteed best approach. All one can really rely on is knowing that enough options have been explored and tested so that a level of confidence exists that the chosen option seems to be the best. In many cases, an infinite number of variations could exist, but only a handful of approaches will be tested. In order to test quickly, it is best to test each piece in isolation, with care taken so that the resulting rows, if they vary across each step, still make sense, while differences are expected, known, and understood. Often, gratuitous “DISTINCTs” added into a query are a sign of a lack of understanding; so avoid the use of DISTINCT unless it is expected for good reason. Don’t go overboard and create many hundreds of lines of SQL trying to resolve everything on the context of one monster query. Such mega-queries become un-maintainable over time. Similarly, do not create a script of many hundreds of tiny queries with everything split up to a very low level of granularity. Finding a balance is where development becomes an art form. Express yourself wisely.
Todd Schraml has more than 20 years of IT management, project ?development, business analysis, and database design experience across many industries from telecommunications to healthcare. He can be reached at TWSchraml@gmail.com.