IT Needs to Stop Firefighting and Get Down to Business with Preemptive Tactics

Nov 22, 2017

By Mike Flannagan

To sustain and drive viable businesses now, and into the rapidly changing future, IT needs to stop firefighting. Reacting when things go wrong—after an incident has caused damage—incurs costs. Taking a proactive, preventative approach using predictive knowhow moves IT out of the proverbial hot seat and gives them much-needed control to keep critical business systems up and running. For decades IT proved its worth by responding quickly when disaster struck. Happily, front-line response is no longer the only option; predictive resources can help tamp down fires before they happen.

Seeing the future to prevent and avoid disaster has long been a goal for many and it is becoming a reality for those managing IT landscapes. Technologies now exist to help identify and prevent critical incidents before they occur. And these predictive technologies are getting better every day. When IT teams are proactive they can save time, resources, money, and reputations, and they can ensure greater system up time.

In our digital world, IT operations need to be prophetic seers. Our new business reality, which has IT and the network at its core, needs oracles who can see around the corner and invoke proactive measures to advert or minimize impending disasters.

Consider the advantages that IoT and preventative maintenance are introducing to manufacturing and automotive fleets. These groups are focusing on preventing breakdowns and unexpected down time that disrupts the supply chain, cuts into revenue, and adds emergency repair expenses. Having a lory sitting at the side of the road or a failed machine bottlenecking a manufacturing flow has become unacceptable because these breakdowns are preventable. Companies are proactively avoiding the unexpected by applying predictive skill, technology, and knowhow, using sensor and log data. Why should IT, the cornerstone of modern business, be any different?

Sensor and log data abound in IT, and that data can be transformed into departmental savings. Unexpected downtime is costly. A datacenter outage, for example, can cost $9,000 per minute on average according to a Ponemon Institute study. Can you afford a $9,000/minute mishap? With predictive analytics potential problems that could cost huge amounts of time and money, potentially even bringing down the datacenter, can be warded off. An IT problem anticipated is an IT problem you can prevent.

Anticipate, Predict, Avert

Predictive capabilities, including time-series forecasting, which can require little or no data preparation, advanced models that can predict server failures, and confusion matrix and causal analysis, can revolutionize IT with huge time and operational savings. Predictive time series forecasting can be used for capacity planning and management. Growth, or lack thereof, and usage trends and patterns can justify budget requests and ensure that resources are optimally placed and allocated. IT is most likely doing this in some shape or form already, but predictive forecasting can transform alerting, as well as allow for automated measures.

When using predictive forecasting and detecting outliers, you can be alerted when something is out of the norm instead of on static indicators. This can alter alerting behavior making it more effective and efficient. Instead of getting an alert every Monday morning at 9am when the authentication servers spike over 80% CPU utilization as everyone logs on, none will be sent. If the spike comes Sunday morning at 9am, you need to know about that change, and an alert will be sent.

Using predictive forecasting also means you are alerted to deviations from anticipated results so that the team can investigate and take action before a system or the network moves to incident stage. Consider a server registering 30% CPU utilization at a point in time when a 70% CPU load is expected. What is causing this aberration? Is the problem the server, an application, hardware, or a switch?

Static indicators can’t provide the information you need, as the server load varies over time and it is difficult to select a meaningful alert threshold. Predictive forecasting with outlier detection can do this easily. Not only will prediction ease alert storms and warn you of situations you would not otherwise be notified of, it can change the way you react through automated actions.

The old IT adage of “have you tried turning it off and then on again” has value. Sometimes stopping and restarting a service, or rebooting an entire server solves the problem. This fix can be automated to happen when the appropriate trigger comes in. When using predictive forecasting, traffic load and other critical aspects can be anticipated and actions can be automated to avert what would otherwise be potential issues.

In a virtualized environment--a growing and very common characteristic of the current IT landscape—resources can be added and removed as needed. Foreseeing that a resource on a virtual server will reach a critical level means you can augment that resource to avert an issue. Better yet, the process can be automated. For example, if you can anticipate a virtual sever will run out of storage based on predictive forecasting, management software can be triggered to add storage to the virtual machine before an application runs out of quota and there is a performance issue. Virtual resources can be automatically reclaimed and reallocated as needed when demand is projected to drop below a predetermined threshold. A crisis predicted is a crisis that can be adverted.

Building Models to Identify Failures Before They Happen

More advance predictive techniques such as server failure models, confusion matrix, and causal analysis are key to contemporary state-of-the-art IT operations. These require some data preparation and input from specialized talent to get things going, but they can take IT operations to an entirely different level.

Gathering a history of data and building models to predict potential server issues and failure brings the reality of preventative maintenance to IT. You can identify which components are most likely to fail and apply confidence levels to them based on the models and on patterns associated with the precursor and antecedent information, coupled with elapsed time and average time to failure. Being able to see what servers or components in the in the IT landscape have a particular propensity for failing means you can investigate and take proactive action before an incident occurs, any damage is done, any revenue lost or compromised.

Not only can you monitor your IT landscape, being alerted to elements with given thresholds of failure, but you can actively investigate potential concerns on key elements at times when business critical events are occurring, such as end-of-year financial statements. Looking at servers that are crucial to quarter or year-end activity can ensure that no impending problems prevent business-critical activities from happening from an IT perspective.

Can You Set Priorities Based on Business Value and Impact?

Enriching IT data with master data can bring a whole new dimension to IT operations management. Now activities can be viewed and prioritized with an eye to business value and impact. Knowing things like the SLA, business unit, business and revenue impact shines a whole new light on IT management and can allow prioritization. Receiving an alert that three servers have a 10% chance of a serious event and knowing the SLA levels, associated business units, how the servers fit into business operations, and revenue impacts gives IT the information they need to know where to focus their resources are and which servers to address first.

Utilizing a confusion matrix for assessing correct and incorrect classification (false positives, false negatives, true positives, and true negatives) can determine accuracy, and it allows for more detailed analysis. It can answer questions like: What classifications were strong and which were weak, what is the cost of one type of misclassification vs. another?

Causal analysis, another predictive analysis technique, is probably one of the most valuable facets for IT. When an event occurs, based on an analysis of available data, understanding the likelihood of potential causes along with associated probabilities means being able to take steps to reduce the chances of a reoccurrence.

Data-driven IT Leadership

When you are proactive—and not trapped in firefighting mode—you can have the right people in the right places and not waste resources. It leads to a calmer, more organized, and less stressed IT environment.

To reach a point of proactive IT, your business will need real-time access to data without latency. Analytics and predictive models run on data as it comes in. Having this insight immediately let’s IT act in the moment—avoiding or remedying issues before users or customers tell you they exist. Being proactive shifts IT from out of control to in control. Predictive analytics is the key to that shift and ongoing success.