Operations in the New Age of IT

By Omer Trajman

Aug 25, 2016

We all know the view from the trenches when there’s an operations or security issue. Everything’s on fire, sysadmins are scrambling, information is missing, accusations are flying and no one has a cohesive view of what actually happened. What fewer people know is the view from the C-suite.

As a CIO, CISO having risen through the ranks, the inclination is to roll up your sleeves and dive in. The only problem is you have no idea where to dive into. You have dozens of monitoring tools, each with a different sample of a different part of the infrastructure. It’s like looking for a needle in a haystack when there are fields that haven’t been hayed yet. Meanwhile your peers - the CMO, CRO, CFO - and your boss, the CEO, is largely in the dark. Unless you already built a dashboard pinpointing the exact problem you’re having, the charts are all meaningless.

IT is Changing (for better and for worse)

IT has always been a firefight but things seem to be getting worse. This isn’t because gray hairs are getting crankier or the coffee getting weaker. Modern technology infrastructure is radically different than just a few years ago when applications were built and managed in silos. In the past, IT operators only had to manage self-contained, isolated systems. For example, a desktop based ERP application ran on a dedicated database with attached storage, dedicated networking gear, a custom installed and tuned application server and custom desktop applications. An admin team could specialize on this system and know every nook and cranny. Vendors knew exactly how the system was configured and other companies had similar static configurations. Life wasn’t easy but it was predictable.

Today’s infrastructure looks radically different. Not only is the network shared and increasingly dynamic (hello Software Defined Networking), now the application is running on shared compute infrastructure and the database and storage system is shared. For the past decade, advances in virtualization and cloud computing have finally given businesses the flexibility they need to respond to dynamic market conditions. Even though processes and policies still need to evolve, technology can now accommodate near instant access to additional capacity for new applications and to handle more users. The challenge for IT is that server lifecycle is much more dynamic. Instead of setting a server that would live for three to five years, virtual machines may only run for a few months and new advances in container technology mean that a container instance may just run for a few minutes. As a result the number of systems that IT needs to manage and monitor have grown from a few hundred or thousand to a few hundred thousand. On top of the unprecedented scale of compute, the mapping of application instances is much more dynamic. When an issue occurs, it’s not clear which application container was running on what hardware, making it even harder to diagnose the root cause.

If IT just had to handle shared network and elastic compute, this might be a manageable task. The most recent generation of log analytics products have give IT a valuable tool to gather data from a wide variety of systems and search for problems in one place. New shared data management, DevOps methodologies and the proliferation of micro-services are taxing even the most advanced log analysis tools. For example, the advances in data management and storage - including technologies such as Hadoop, Cassandra, S3 - give businesses the ability to handle the increased scale of data. Shared data management is more efficient than collections of database silos on systems that are more cost effective and more flexible than traditional data warehouse systems. The challenge is that these next generation systems are more complex and because these systems are shared, they are less predictable than dedicated databases.

Impact on business

The CIO and CISO’s peers have a very different view from that of IT. As much as technology has entered our everyday lives, with smartphones permanently attached, gradeschoolers using tablets and laptops the mainstay of highschool and college students, most executives are largely ignorant of what it takes to make all of these systems run. Non-technology executives, from the CRO and CMO to the CFO and CEO, are largely focused on the impact that new IT technologies enable. They want to know how to directly connect with customers, how to better understand what customers want and how they are using their services, and they want to do it as cost effectively as possible. For IT this boils down to more systems, with more data, and less predictability. Alternatively, omnichannel engagement, instrument everything, be elastic - i.e,. everything that’s really hard to run in production.

Companies are figuring this out, IT challenges be damned. Whether by smart engineering or brute force, this vision of one stop shopping backed by elastic, scalable, efficient systems exists. B2C has lead the charge, whether in search and advertising, social networking, retail or even banking, consumers have unprecedented access to self-service and businesses can track and tailor experiences to those consumers while flexing their infrastructure into the cloud and back as needed. Amazon famously doesn’t maintain systems for the peak capacity needed during holiday shopping, it bursts into its own cloud. Walmart, a company founded on cutting costs and passing the savings to consumers, does the same. Not embracing the executive vision for more nimble IT infrastructure that is more capable means getting left in the dust (or replaced).

It’s when these systems fail, or services degrade - an expected behavior of cloud architecture - that non IT executives are up in arms. A shopping cart and catalog that doesn’t scale may kill a retailer in the long run but a system that’s down on cyber Monday will see all the executives looking for new jobs come January. Access to online check deposit and e-bill payment makes for a better customer experience, but it’s not as important as safeguarding sensitive data. This push for more access coupled with tighter control creates a catch-22 for IT. It’s particularly difficult when IT is straddled with managing legacy systems that can’t be shut down, staffing experts in domains that are being aged out, and implementing new technologies all while reducing spend. The board room is an unfriendly place for IT executives when they’re behind on delivering new capabilities and it’s a firing squad when those new capabilities break.

Coping with IT Change

There are many ways to tackle this new world order of IT technologies. Web-scale companies will hire senior developers to build automation systems and carry pagers (this is where DevOps came from). Companies with dozens of monitoring systems will put collections of charts on large monitors so they can see multiple streams simultaneously. Many companies are shoving more data into search based log analytics systems.

These solutions are all top gaps that keep the lights on but at the cost of future planning and IT sanity. There are only so many Phd Computer Scientists pursuing jobs in IT operations and they command salaries well beyond what most companies can afford. Even the companies that created DevOps are realizing that staffing IT with engineers is not a panacea. There is rarely the time to rigorously develop monitoring and operations software with the quality available from vendors. As a result, groups of engineers creating a continuous stream of new tools while the old ones never get updated or refactored. Instead of dozens of 3rd party monitoring tool silos that the vendors support, these DevOps companies have dozens of home grown monitoring tools that don’t get any support.

Managing multiple third-party tools is more cost effective but presents its own challenges. Screens can only get so big and dashboards can only get so small to fit all of the metrics that IT needs when all layers of the stack and flexible and shared across all applications. It quickly becomes impossible to track every combination of interacting systems. IT staff doesn’t know where to pay attention and it is impossible to tell the relative importance of seemingly disconnected issues. When they need to get to the root cause of a particular problem, sysadmins are flooded with data and have no way to sort through it.

Log analytics tools promise to solve the data deluge problem by providing a single interface to search through log files from any system. This helps IT by giving them a Google-like interface rather than squinting at hundreds of dashboards. Brute force search works well when the search results are in the hundreds or even thousands. And advances in log analytics are helping sysadmins sift through log files even more quickly. At the core these are still brute force search systems and they don’t handle full stack monitoring from hundreds of thousands of ephemeral systems. The basic search, browse and graph workflow ends up creating more work since search results can leads admins down a rats nest.

Taking Control

With an aging technology stack that needs to be maintained and business pushing IT into new frontiers, it’s easy to see how everyone from the operations teams to IT executives, all of whom have spent years honing their skills and crafting their careers, feel like they’re losing control. Every new project seems to bring with it a new monitoring system that purports to replace one of the existing monitoring systems except that it ends up creating yet another silo. Regaining control doesn’t actually require a lot of new tooling and it probably doesn’t involve sunsetting existing tools, at least not right away. In order for IT to manage the spread of new technologies while maintaining their existing infrastructure, the teams need better visibility. In order to regain control over their infrastructure, operations teams need an integrated view of their full stack, from applications to servers and across data centers.

Every IT system, from the oldest mainframes to the newest elastic container micro-services shared data center operating system, has specialized monitoring tools. Whether first party or third party, vendors have created specialized dashboards for every corner and for each group within IT. There are tools for network monitoring, application performance monitoring, enterprise message bus monitoring, virtualization monitoring, database monitoring, storage monitoring, pick a system monitoring. The fundamental control problem that IT has is no longer understanding the performance of any individual system. IT is struggling because they can’t clearly see how systems relate and don’t have a comprehensive history of how infrastructure interoperates.

The concept behind cross functional visibility isn’t new to businesses. The enterprise data warehouse is one example of how organizations have addressed the need to relate the performance of different functions. Integrated supply chains and next generation customer relationship management systems have all expanded in scope since they were first introduced because business needed to understand the impact on different parts of the organization. These systems all have both operational and analytic uses. IT needs it’s own centralized view. The first step towards taking control is gaining visibility, which requires a single source of all IT truth. Luckily for IT, all of the building blocks are already in place. Unlike what other parts of the business had to endure in order to get access to data, IT is drowning in it. Every system that has monitoring, which is all of them, also has a data spigot. IT doesn’t have a problem getting infrastructure data. It has a problem managing it and making sense of it.

Next Steps

It has been said that the first step on the road to recovery is admitting you have a problem. Though self-deprecating humor is a staple of coping with a job where firefighting is a daily event, it’s helpful to think about how we’re used to running infrastructure as an addiction. IT gets good at brute force searching for the root cause or problems and develops a sixth sense for how issues cascade. But we know that this is stretching us thin and in the long run is not sustainable. Crawling out of the pit requires accepting that we need a new way to operate infrastructure, which means a new way to correlated the feeds from all of the systems spewing out monitoring data.

There’s a three-step process to get to sanity and it’s a road that other businesses and industries have been down. Step one is to collect everything. Not as much as we currently collect and not a sample or aggregate of each piece of the stack. Collect everything. Second is to apply analytics that can highlight differences in behavior. Each system has a pattern and by surfacing systems that change patterns together, even the most junior admin can quickly reduce the source of any issue. Intelligent analysis must be built in. The third step is using a visual interface that is purpose built for operators. This is different than a general purpose business intelligence tool or even a log analytics tool. Properly visualized, a single sysadmin can comb through billions of data points in seconds and effortlessly navigate to the patterns, surface by the underlying analytics, that stand out. That’s not random pie charts and incomprehensible graphs, it’s purposeful visualization.

Step one: Collect Everything

Collecting everything is not trivial but it’s not impossible either. There are many solutions available for data collection at scale. The most critical part of data collection is adopting an open system. There is a lot of overlap between the types of systems used in IT but each company has specialized applications and different configurations. An open system allows you to easily customize how data is collected, how long its retained and who has access to what data. An open system also lets you keep the critical system specific monitoring tools but manage the data feeds that go into those systems.

In addition to an open system, data collection has to be reliable. The only thing worse than not having a global monitoring system is having a system you can’t trust. Reliable data collection guarantees at least one delivery of every event that happens and informs you of monitoring system health, so you know that it’s working and that you aren’t missing information. An open and reliable data collection system should be the center of how you operate, like a nervous system for the data center.

Step two: Intelligence Analysis

Every large company has analysts. There are business analysts, security analysts, intelligence analysts, and most likely operations analysts. Analysts generally have the luxury of time. They are fueled by curiosity and want to investigate the myriad of possible relationships looking for some predictive indication of how the business should function. Operators don’t have time for analysis. Even when out of firefighting more, IT needs quick answers to questions that haven’t been asked yet. Time to formulate a complex query is what vacations are for. So any system we rely on must have intelligent analysis built in.

This doesn’t mean the artificial intelligence that will someday take over our jobs and then the world. The intelligence we’re looking for is really just mass calculation. It’s a lot of work for a human to look at every feed and describe the pattern and look for changes. But a machine can do that all day long. When you’re collecting everything in an open reliable system, you can apply intelligent algorithms that build a pattern for every feed and then continuously compare it, flagging when the pattern changes then repeating that process. It can also compute every variation on every metric you could ever need. Moving averages, 90th, 95th and 99th percentiles; latency for each service, host, application; usage in 1 min, 5 min, 10 min, 1 hour, 1 day, 1 week buckets; these should come out of the box. That’s intelligent analytics.

Step three: Purposeful Visualization

Every stream of data can be visualized hundreds of different ways, but not every visualization explains what’s going on in the data. There are purposeful ways of looking at data depending on the job at hand. For running modern IT infrastructure, the right visualization should highlight changes, be intuitive to navigate and otherwise get out of the way. It doesn’t require training manuals, complex query languages or an explanation on how to use it. Purposeful visualization is obvious when you first look at it and makes it obvious what the next steps are. If you’ve ever stared at a monitor on the wall with no idea whether it’s describing something good or bad, you’ve seen the opposite of purposeful visualization. In contrast, if you’ve ever looked at a stock chart, you know instantly whether you should be buying or selling. The visualizations used in the data center must be as clear as those used in finance.

Get Started

The firefighting isn’t going to stop and business demands aren’t going to slow down. A new age of IT is dawning and the only solution is to start looking at your infrastructure as a whole, by collecting everything, applying intelligence analysis and using purposeful visualization. Start with the data that’s dropping on the floor, select an open and reliable system and see how these three steps will immediately benefit your team. Follow that by incorporating data from the systems you have and you will be on your way to regaining control.