To Avoid Costly Outages in 2016, Build IT Maturity Beyond Monitoring

Feb 24, 2016

By Per Bauer

With the holiday shopping season now behind us, it’s interesting to note that 2015 marked the 10^th anniversary of Cyber Monday, a term coined in 2005 when retailers noticed a surge in online shopping on the first Monday after Thanksgiving. It’s hard to imagine now, but in those days most of us had faster internet connections at the office than at home, and we weren’t carrying 4G supercomputers around in our pockets. Cyber Monday is still a big deal, with more than 900 retailers offering online deals.

Cyber Monday is still a big deal for retailers. In other words, it’s game time for data centers. Website outages during the holiday season are more costly than usual. More shoppers turned away per hour of downtime means more revenue loss, but also more disgruntled customers who might never forget how they were inconvenienced by the offending retailer.

As enterprises become ever more reliant on data centers, for both internal and customer-facing services, the costs of downtime are increasing apace. The average cost of data center down time rose from $5,600 per minute in 2010 to $7,900 per minute in 2013, with a worldwide annual impact of $1.7 trillion. The average cost per incident higher during the holiday season, particularly for e-commerce retailers. Depending on the scope, timing, nature, and duration of the outage, many businesses will not be able to recover, especially those without proper disaster recovery and incident response plans.

With the 2015 holiday seasons now safely in the rear view mirror, it's time to start thinking now about how to improve availability and resilience in 2016. Look ahead to 2016 with a vision for bringing your IT service operations to a new level of maturity.

Resilience is a Survival Strategy

It’s no wonder that businesses are focused on building higher levels of resilience into their IT operations. Cyber crime, weather, natural disasters, geopolitical instability, power outages, software failure, and human error—the causes of data center outages are myriad and often unpredictable. And the business impact isn’t lost sales revenue. A recent survey of data managers found that a majority (54%) of enterprises have suffered productivity losses, 33% experienced loss of customer confidence and loyalty, and nearly a third cited delays in development. In this same survey, a full one-third of respondents reported low levels of satisfaction with their current data availability strategies. Enterprises need more uptime, but IT departments are struggling to meet SLAs as it is. As they are asked to do more (and more quickly) with fewer resources, IT must find a way to develop their capabilities past the disorganization of the chaotic and reactive modes. It’s concerning to note that 89% of mid-sized companies (arguably those most vulnerable to outages) remain in these early stages of IT maturity. More than half of the IT organizations surveyed were classified as “chaotic,” the lowest level of maturity.

A core attribute of IT maturity is the ability to get out in front of problems before they appear. Most enterprises have basic performance management in place, but smart analytics can take your system monitoring to a whole new level.

Proper capacity planning is crucial to an IT department’s ability to keep running smoothly. Not only can it can help sustain your IT infrastructure when systemic problems crop up, it builds pathways for better alignment between business goals and IT priorities.

For those critical, reactionary moments, it’s imperative that you have a sound Data Center Infrastructure Management (DCIM) strategy that allows you to act quickly in preventing system downtime in order to mitigate the negative impact on your reputation and revenue stream. But how do you change your approach from reactive to proactive, lessening the chance of costly errors and crashes before they have a chance to crop up? The solution requires a fundamental change in mindset and the right tools in place when it comes to capacity planning.

The Building Blocks of Monitoring

If you want to improve your monitoring practices, it’s best to start with the basics. IT infrastructures are complex systems that require performance data measurements at the granular level. You should be able to rely heavily on your performance monitoring platform. Not only should it be data agnostic and equipped to collect information regardless of the source, but it must also allow high frequency polling and offer sufficient data retention to meet its requirements.

In this age of large-scale data processing, your monitoring platform should be able to manage the breadth of your data with room to spare. The last thing you want is to be forced to choose between monitoring one area over another. If, for example, your systems are skewed heavily to analyzing potential threat vectors, but are not fully monitoring throughput, your end users will notice the suboptimal performance before you do. Your platform should be able to keep up with all of your data without breaking a sweat.

It’s important to figure out what normal performance looks like in your environment. Your management system should do this automatically as it collects data. This will provide a baseline for comparison as you move forward, which will help you react quickly to future changes in capacity requirements. Baseline performance indicators can also be used to track anomalies, important for detecting cyber security incidents, software glitches, and unexpected interdependencies. With well-rounded historical metrics, you can better plan for seasonal surges in traffic as well as one-time application upgrades or service launches.

Proactive Transitions

Once you master the fundamentals, opportunities to stay ahead of potential problems will begin to appear. Powerful capacity planning tools offer automated and insightful reports that can predict errors, help you manage your resources more effectively, and reduce the overall cost of your operation.

However, any proactive capacity planner will tell you that volume isn’t the only important consideration. In addition to determining your IT system’s normal activity levels, it’s important that you understand where the activity is coming from. Numbers are key, but without proper context, they’re largely meaningless. The ability to monitor data, extract meaningful information, and then apply it in a business minded way is the essence of smart capacity planning.

Scenario planning and testing is central to building resilience and maturity into enterprise IT. This includes not only incident response planning, but forecasting exercises that use predictive analytics to explore the potential results of infrastructure changes and upgrades. IT can be much more responsive to business requirements once they have integrated the ability to perform data-driven cost and performance testing before going live with a new product or service.

It’s also important to consider priorities. It’s not realistic to expect that every service or data store can be available 100% of the time. IT and business have to work together to set parameters for acceptable risk and downtime on less critical systems, and focus higher level planning, testing, and monitoring on the most valuable, mission-critical systems.

Emerging Challenges and Opportunities

The information technology landscape shifts, erupts, and settles again, keeping data center operators on their toes. The ongoing cycle of innovation, hype, and integration makes it hard to keep up with the day-to-day operations while assessing the Next Big Thing. The current surge in cloud adoption, the looming challenges of IoT, the ever-mutating threat vectors of organized cyber crime—each creates a unique set of challenges and opportunities that require careful planning and monitoring.

Cisco’s Global Cloud Index reports that in the 5-year period 2014-2019, global data center traffic will grow threefold. By 2019, 83% of all data center traffic will come from the cloud and 4fourout of five data center workloads will be processed in the cloud. It’s no wonder then, that we’ve heard a lot more about cloud outages in the news lately. Capacity planning for virtualized systems and cloud-ready data centers involves unique complexities, and data center operators will need powerful tools and automated analytics to keep up with the dynamics of cloud demand. After all, just because space on the cloud is infinite doesn’t mean it’s free: efficient, cost-effective resource consumption requires careful planning and budget management based on an accurate forecast of workloads and storage needs.

A Better Future

Companies should be looking into better capacity management information systems and standardized optimization reports. Automated, advanced analytics and visibility tools can help struggling IT teams improve their data measurement techniques and manage multiple types of resources more effectively. Better reporting leads to better communication with stakeholders across the enterprise, another key component of IT maturity. Having comprehensive, real-time dashboards makes it easy to react to issues quickly before they have time to snowball and get out of hand.

Data center optimization is a moving target, and operators have to stay agile and balanced. IT maturity and advanced capacity planning are key to creating the resilience and readiness necessary to face emerging technology and user requirements. Cultivating the ability to absorb the impact of change through proactive, data-driven planning and modeling is certain to build competitive advantage, especially when closely aligned with business goals.

Look ahead through 2016 with a vision for bringing your IT service operations to a new level of maturity. Advance into proactive planning and intelligent decision-making by applying the power of analytics and automation to your performance monitoring tool set.