If I were to ask if your database load profile is good, bad, or ugly, how would you reply? If you are unsure about how to answer that, you are not alone. There are a lot of factors that come into play—one of the biggest being, “Compared to what?”
Most people would agree that having a historic baseline or profile of database activity is important. So, let’s move beyond that and get into why it’s important and what we can do with those historic datapoints.
We keep historical performance data or baselines to detect when the behavior deviates from what’s normal. What we’re talking about now is anomaly detection or variances. If presented in the right manner, it would be easy to identify anomalies by viewing a human-friendly graph or chart (though this would be subjective depending on the observer and tolerances). However, I don’t believe any of us want to sit in front of our monitor waiting for an anomaly to occur. We’d want to programmatically detect it and get a notification. This is where it starts to get fun.
Thus far, there’s probably a consensus on all of these thoughts. Let’s introduce some controversy: the mathematical model that will raise the alarm. Using statistics, there are two popular methods for detecting anomalies: standard deviation and percentiles.
Think of a bell curve. How datapoints vary from the mean directly affects standard deviation. If you have wildly fluctuating datapoints, the standard deviations will be larger. With tightly grouped datapoints, your standard deviations will be smaller. This means that if the datapoints are all over the place or skewed to the outside, using standard deviation will be less sensitive. If datapoints are more tightly grouped, standard deviations shrink and become more sensitive. In terms of anomaly detection, think of standard deviation as establishing a threshold that will trigger when a datapoint varies too far from the average. Note: Extreme datapoints can skew the average unless discarded.
Think of percentiles as buckets we can throw datapoints into. Take all of your datapoints and divide them into four equal buckets by value (lowest, lower, higher, and highest). Each bucket should have the same number of datapoints. We now have our percentiles.
- Lowest = 25th percentile
- Lower = 50th percentile
- Higher = 75th percentile
- Highest = 100th percentile
These buckets can be made as granular or coarse as you like. This method uses a median as our north star, not to be confused with average (mean) or mode. Anomaly detection works by seeing what percentile the next datapoint lands in—for example, if it hits in the 100th percentile bucket, we can call it an anomaly and alert accordingly.
The trick is to get valid notifications of anomalies without false positives. You don’t want to sound the alarm if there really isn’t an issue. I’m sure this brings up memories of installing and configuring monitoring software, getting a deluge of notifications that things are melting down, and finally setting up a rule to throw all notifications into a folder that is emptied out without scrutiny. What is the use of setting up notifications if they will be ignored? Getting anomaly detection done correctly is essential to avoiding “noise.”
So, which is the correct way to measure anomalies? I love the phrase “It depends,” because it really does. Both methods have their pros and cons, and it likely depends on expected/observed datapoints (dispersion of values). If we think we have a fairly tight grouping of values, then standard deviations might lend the right sensitivity to sound the alarm (a good rule of thumb is that if the standard deviation is more than half of the average, the dispersion is high and normal distribution of values is suspect). If we have a wild variety of values, then using percentiles might be the right way to go.
In conclusion, if you are getting false alarms with your monitoring, dig in and understand which method of anomaly detection is being used. Make sure it is the right way to be watching your environment. You don’t want correct anomaly detection to be … well, an anomaly.