Correlation Versus Causation: Why Should a DBA Care?

Aug 15, 2018

By Rob Mandeville

Shark attacks and ice cream sales follow the very same trajectory when charted. When ice cream sales go up, so do shark attacks. When ice cream sales decrease, shark attacks decrease. With a correlation coefficient of very close to 1, we can deduce that sharks like to attack after we’ve recently had ice cream!

You may find the deduction funny, silly, or scary, but it happens a lot. However, a more reasonable cause would be the increasing temperature—people buy more ice cream when it’s warmer and are more likely to go in the water, increasing the chances of a shark incident.

I run into enough situations where the difference between correlation and causation is still fuzzy for some. So, what is the distinction and what are some negatives to look out for?

Definitions (for this discussion’s purposes)

Correlation—In statistics, dependence or association is any statistical relationship, whether causal or not, between two variables

Causation—Indicates a relationship between two variables where one variable is affected by the other

Who Cares? Aren’t They Close Enough?

Some would certainly argue yes, that the two can be read as synonymous when referring to statistics. Statistically speaking, you’d rarely talk about causation—you’d talk about the correlation and how good or bad it is. To talk about causation pretty much implies that you’ve proven the connection between two variables, and that when you do X, then Y happens—100% of the time.

Where Do Databases Fit In?

Good question. Some direct ways:

Data science
Graph databases
Machine learning
Performance

We model our universe for competitive advantages. Modeling in this way helps us understand our customers, business, and systems better. With enough data, we can even become predictive and identify anomalous behavior. All of that sounds great (and for the most part it is).

The Downside

What if we’re wrong? What if our model is inherently flawed? What if we’ve introduced a bias into the equation that’s affecting outcomes? What if sharks don’t attack because we consumed ice cream?

Without going into great detail here, I’d like to introduce the idea of statistical accuracy not being equivalent to fair or even desired behavior. Feel like I just turned left? What do fairness and desired behavior have to do at all with statistical modeling?

Consider this scenario: You are the CEO of a major corporation and want to start a program to foster future leaders in your company. You task your data scientists to come up with an algorithm to determine traits and profiles with the intent to hand it off to HR to comb through personnel files identifying likely candidates. The data scientists look back at the last 30 years of data (people that have been promoted multiple times, recognized, gotten higher salary bumps, etc.). Would some of the traits identified using this model be:

Graduated from university
Male
White

Statistically, it may be accurate, but idealistically, it is wrong. Modeling can perpetuate previous bad behavior.

Back to Databases

Statistical modeling is really at the heart of most of the items listed as directly fitting in. Performance is a bit different though. Historically, we’ve used correlation to determine health and make judgment calls. Think back to when you got a support ticket because systems were slow. Likely you looked at the system from a resource vantage to at least determine health (am I maxed out on anything that could be causing the slowness?). What if you weren’t maxed out, but CPU was at 95% of capacity? Some data professionals would call that out as the culprit. However, what if a session currently held a lock that was blocking the application causing the slowness? We’d throw CPU at our system to “fix” the issue. The danger here is (potentially) increased licensing costs, missing the mark, wasting time chasing false leads, and not resolving the problem at all—just because of an observed correlation.

Have a Method

Correlation and causation are not synonymous. Make sure you have a method to determine causation that minimizes risk and, when modeling, try to think about/prevent potential bias being introduced (by internal or external sources). Modeling and pattern recognition can be great tools when applied correctly. And, finally, sharks like ice cream.

Newsletters