Anomalies—Predicting the Past

Dec 4, 2018

Definition of anomaly: something that deviates from what is standard, normal, or expected.¹

If you are a modern database professional, you have likely heard about or looked into anomalies and how to detect them. At a very simple level, anomaly detection looks at historical values and predicts future values based on them. An anomaly occurs when we miss the prediction with enough significance.

In a very literal sense, we are predicting the past.

Innate Anomaly Detection

We come with an innate ability to pick out anomalies. In fact, part of our success as a species can be attributed to anomaly detection. The idea in this case is to identify outliers for more rigorous inspection or analysis.

That spider doesn’t look similar to other ones I’ve encountered.
The branch wasn’t broken before.
This creek is much lower than usual for this time of year.

Noticing these things can help with survival. It could be argued that’s a good thing. So, the fact that one of the hot focuses of data science is anomaly detection should come as no surprise. However, anomalies by themselves are not good or bad; they are just different from the norm. No judgment. Additionally, the absence of an anomaly doesn’t imply good or bad—it just implies normal. There still has to be the human element to help define the good and bad.

The Science

If we are going to try to automate anomaly detection programmatically, we need to decide on an algorithm that fits best. This is likely an algorithm that best predicts future datapoints based on historically observed datapoints and gets better over time with more data. The solution would also likely account for seasonality (same day of the week, same month of the year, etc.). On the surface, anomaly detection sounds great; and it can be, as long as you don’t base too much judgment on it without the investigation and analysis as to why the anomaly occurred.

Before we get to judgment, there’s still work to be done—we’re not quite done with how we detect anomalies with our algorithm yet. We have to define how far off from expected the new data point is to call it significant (standard deviation being a popular choice). Once we have put some math behind defining how our algorithm appears and how far off any next value needs to be before we call it abnormal, we can start looking for anomalies in an automated fashion.

Example

Consider two workers—both of their jobs have them starting at 8:00 a.m. One always arrives a few minutes before 8:00 a.m. and one always arrives right at 8:30 a.m. As long as their arrival patterns do not vary, there will be no anomaly. If the 8:00 a.m. worker arrived at ?8:05 a.m., would we call that an anomaly? Or how about 8:15 a.m. or 8:30 a.m.? What if the second worker arrived at 8:00 a.m. one day? 7:30 a.m.? Would we call those anomalies even though the behavior is better than the norm? In the strictest sense of anomaly detection, we would call everything out that was considered an outlier. Then, depending on whether the outlier was below or above our expected value, we would be ready to place a judgment.

The Judgment

At this point, we are ready to assign a good or bad determination to our anomaly. In our example, the second worker comes in late each and every day. The fact that one day she came in at 8:00 a.m. or earlier was a good thing. In fact, it was the normal that was bad. Now measuring something this simplistic can of course be corrected by also comparing data ?values with an absolute or threshold, but just ?using anomaly detection, this might create a flawed model. In fact, check out this post (www.dbta.com/Columns/Next-Gen-Data-Management/NEXT-GEN-DATA-MANAGEMENT?---Dangers-of-Statistical-Modeling-127922.aspx), where I discuss statistical bias and how it can impact results in quite an unintentional, yet negative, way.

The Wrap Up

Using anomaly detection to surface deviations from the norm can be a powerful tool when automating monitoring for just about any statistical value. However, it will not uncover the bad behavior that may be happening all the time and is only part of the story. Combining data from anomaly detection plus data that can surface good behavior versus bad behavior when it occurs as the norm starts to get into more intelligent analysis. The hype is legit. Anomaly detection can be a very powerful tool—just not in complete isolation.

I did not cover the concept of confidence interval intentionally—saving that for a later date.

¹Cited from Google Dictionary using search term anomaly definition