Data center downtime and outages are undoubtedly costly—with 40% costing between $100,000 and $1 million each according to a 2020 Uptime Institute report. The good news is data center failures are largely avoidable. The bad news is most operators aren’t employing the data center analytics solutions needed to proactively predict and prevent downtime.
In fact, most maintenance that could prevent outages has traditionally been handled as an arbitrary process and still involves human intervention, which comes with the potential for human error. Thus far, there hasn’t been a way for operators to proactively identify failure points, despite the massive step forward this would offer for the industry. Ironically, data center management hasn’t always been driven by much data.
There are other industries that leverage data much more efficiently to predict potential failure, whose lead the data center industry can follow. Most notably, the airline industry has replaced manual inspections with data-driven maintenance, reducing the fatal accident rate by 500% since the 1940s. Plane maintenance is no longer based on standard, arbitrary maintenance cycles. Rather, it uses analysis based on the 5TB to 8TB of data generated on every flight. This allows airlines to focus their maintenance operations where they’re truly needed and service planes at the most immediate risk of failure and streamlining the process by providing insight into potential issues.
The data center industry, on the other hand, struggles to look at data—let alone create value from it. Until now, analytics have been reactive and underutilized in most mission-critical environments, but this doesn’t need to be the case. Instead, operators need to take a systematic approach to predictive analytics that considers the data center as a complex system rather than individual assets. This allows potential failures to be proactively identified and can predict asset performance degradation before a component failure can impact the overall system uptime.
Looking Beyond Individual Assets
The data center needs to be looked at as a system in which each individual part is just as important as the next. Thus the data collected from each segment can be viewed as part of a collective whole, helping to drive predictability. For example, failure degradation modes are currently viewed independently when they should be looked at holistically to gauge how the entire system behaved during a failure.
By using as many datapoints as possible, operators can develop rules-based models to streamline and improve operations. After all, what isn’t measured is very difficult to improve. These models can help answer a range of questions and potential issues that operators face, including how temperature changes—such as data center cooling—will impact performance or estimate the impact of a gear failure. For instance, data analytics that can correlate attributes such as internal asset temperatures with operating conditions, such as percent load, and be trained to understand normal temperature variations within the system versus anomalies that may predict a component failure.
Creating a data infrastructure allows for enough data to be collected that can then be used to build high performance machine learning models, which are the precursor to Artificial Intelligence (AI). This process starts with the cloud, followed by instrumenting and ensuring telemetry is in place. Having this consistent asset model across the system will deliver high value analytics and provide operators with better control of the context where they’re sourcing insights. An ideal system would unify all the data streams into a single gateway and feed to the cloud through a cyber-secure portal where subject-matter experts can develop and train a rules-based predictive analytics platform.
The Redundancy Dilemma
Adding to system complexity is the fact that data centers involve a large footprint of equipment that is typically redundant—sometimes triply so. As data center footprints continue to expand over time, this redundancy will only increase. Predictive analytics is the only tool that can handle and improve this redundancy using the concept of failure modes and effects analysis (FMEA). It’s an approach that’s already common in the aerospace industry, which has mastered the use of predictive analytics.
FMEA looks at every component within a system and analyzes what impact it would have in a particular failure mode and then determines what the effect would be on the overall system. Typically, only a couple of points in any system are critical to failure, which is a novel idea for the data center industry. There’s a real need to change this perception and use live data to understand how individual assets are performing within a system. This will allow for each individual component to be re-rated, creating a risk hierarchy within the system to protect against the overall potential for failure.
Even though there’s room to fail in most industrial analytic applications, redundancy often obscures how individual assets are performing—so the only thing operators can see is overall performance degradation. Leveraging data and predictive analytics can provide the needed line of sight into assets and potentially reduce redundancy, which in turn reduces capex. Sustainability benefits also arise from a reduction in redundancy, as a streamlined operation will use less energy.
The Future of Predictive Analytics
Employing a system-based approach to predictive analytics will bring a host of benefits for data center operators. By discovering failure points and maintenance needs before they become issues—whether it’s for a rack PDU, UPS power supply, battery backup, or beyond—operators can save on upfront costs and longer-term investments in failed assets. Analytics also decrease the number of failures and need for interventions, leading to higher uptime rates and longer meantime between failures.
Two key areas that also directly impact data center operations are the ability to prolong the life of the mission-critical assets, or the consumables within those assets such as batteries, and capacitors, and the ability to optimize service interventions so the minimum number of human interventions are required. With many new data center facilities being deployed in remote “edge” applications and the difficulty finding skilled labor in most locations, the ability to streamline service interventions is becoming increasingly important to the overall strategy of deploying these mission-critical systems.
Recent events have shown us that our ability to rely on human intervention can be challenging during crises. Building systems that are connected, predictive and prognostic will become the standard for future data center deployments. The addition of futuristic technologies such as AI, machine learning, and augmented reality will lead to a significant reduction in the skillset needed for mission-critical first responders. The prognostication abilities of predictive systems will mean quicker time to repair, increased number of one-time interventions and reduced cost to serve by requiring less specialized knowledge of the first responders.
Even with all these benefits, predictive analytics in the data center context still has a lot of room for growth and improvements. There are four core areas the industry should focus on to reach a breakthrough point. First, is to include more domain expertise in analytic models. Only at this point will they become truly predictable. Second is to leverage AI and machine learning at a higher rate, so the systems and analytics can get smarter over time as they learn. The more data a model consumes, the smarter it will be at deriving conclusions. Third, operators must ensure that any asset-based model they build is also able to scale. Lastly, we must do this together as an industry, there is no one company that alone can build this model successfully. Industry leaders, value chain players, and manufacturers must work together to achieve success and provide a model that others in the industry can follow.
Ultimately, the bigger goal is to be able to replicate a predictive analytics system across multiple data center locations and geographies. The full value of analytics becomes clear when operators can compare mission-critical environments from around the world to each other and create benchmarking. The more data available, the better these insights will be.