NEXT-GEN DATA MANAGEMENT - Dangers of Statistical Modeling

As we enter a world of machine learning and data science, are there any gotchas or negatives? It sounds as if it is all sunshine and rainbows, but, as the title to this post alludes, I believe there are. Here are some of the dangers I thought of or came across while researching this post:

  • Normal does not equal good.
  • Beware statistical bias.
  • We stop thinking and just believe the model is right.
  • Statistical models can suffer from the “boiling frog” syndrome.

Let’s take a closer look and see how each one could ultimately work against us.

Normal Does Not Equal Good.

Regardless of which statistical model is used for forecasting, all of them use a series of data (whatever you want to predict). But just because it is what we observe, it may not be what we want or determine to be “good.” For example, say management wants to be made aware of any anomalies in worker behavior. OK, we decide to model some data, which includes logging in and out. The employee who always comes in late to work and continues to do so will raise no alarms, but I’m fairly certain that’s not the desired behavior.

Beware Statistical Bias.

There are many types of statistical biases that can affect forecasting accuracy. Statistical bias simply means that the sampling or set of data you are using to forecast with is not representative of the general population of data. In other words, your datapoints are biased in some way. One of the more recent examples of sample bias for a survey was in the Democratic primary election. Hillary Clinton was supposed to run away with Michigan, with polls predicting Bernie Sanders had less than a 1% chance of winning1. The sampled data was incorrect in a big way, partially due to sampling bias—the survey called landline phones that the younger generation is much less likely to have.

We Trust Too Much in the Model.

Keeping things mathematical attempts to take personal bias out of the equation. However, just because it always was doesn’t mean it always will be as models would predict. In other words, we have intuition and reasoning skills that take into account other predictive data than “what were the datapoints previously.” Ignore these skills and intuition at your peril. By the time most models take large swings or adjustments into account, it could be too late to react. A good example of this would be predicting storage needs for our databases. When we trend historical data growth, it doesn’t take into account the huge data migration onto our platform in a week. When it happens, the model will let you know it was anomalous behavior, but that doesn’t remedy the downtime due to running out of capacity.

The ‘Boiling Frog’ Syndrome.

The anecdote goes: If you put a frog into a pot of boiling water, it will jump right out to save itself. On the other hand, if you put a frog into a pot of room temperature water and slowly turn up the heat, the frog will not notice until it is too late. I have not personally tried this experiment. Let’s use storage capacity as the example again. You monitor and forecast storage needs over time and want to get notified of anomalies (large increases or drops in storage usage in the next polling). If you have your warnings set to one standard deviation from the norm, the upward trend in storage consumption can reach the max storage capacity without ever raising the alarm for an anomaly due to a gradual increase.

Insert Humanness Into the Equation

Machine learning and data science can offer a lot to businesses wanting to do more with the data they have. However, be careful of potential pitfalls when using statistical modeling for predictive purposes. If you are using predictions to alert you to anomalies, you might also want to consider some hard-stop limits as well. Surveys used to obtain datapoints can fall victim to statistical bias. Bias can come in many forms. Words matter. As always, don’t forget or hesitate to insert your humanness into the equation.

1“Why the Polls Missed Bernie Sanders’s Michigan Upset,” June 25, 2018;