When IT Operations Became a (Bigger) Big Data Problem

The world of IT operations has always had a big data problem. Instrumentation of end users, servers, application components, logs, clickstreams, generated events and incidents, executed notifications and runbooks, CMDBs, Gantt chartsyou name it, people in the IT operations area have had to cope with mountains of data. And yet the scope of the problem has been enlarged once again, thanks to industry-wide trends such as bring-your-own-device, the Internet of Things, microservices, cloud-native applications, and social/mobile interaction.

Today’s Application Development Paradigm:  Service-Oriented, in the Cloud

Traditional applications, whether in the mainframe or distributed computing era, had a relatively consistent and predictable architecture and deployment paradigm.  From the top, they had presentation layers and business-logic tiers, perhaps an integration bus in the mid-tier, databases (relational and otherwise) in the data tier, and generally similar infrastructure (O/S, server, storage, networking) underlying it all.  On the deployment side, these applications were mostly deployed in in-house data centers or using dedicated hosting (anybody remember the “colocation” business?).   And while efforts in infrastructure-based virtualization have resulted in more copies of this infrastructure (“spinning up VMs” instead of physical installations), they haven’t altered its basic architecture. 

In addition, the rate of change in these applications was relatively slow.  Many business systems of record (ERP, CRM, HCM, etc.) might push out updates once a month, once a quarter or even less often.  Customer-facing web applications might be updated more often (some of us in the industry are old enough to remember the “rapid application development” craze of the late 1990s), but even those more frequent updates were largely in the business logic, not in the architecture itself.  You might add cool new features, but you deployed them into the same app server/DB combo that you had before.  The same model holds true in newer IaaS-based cloud computing paradigms.  IaaS containers are handy to spin up, but tend to follow the virtualization model mentioned earlier, in which we are essentially using the same application architectures, just hosted in a different place. 

The one earlier meaningful departure from this model, which actually foreshadows today’s IT Operations big data problem, was Service-Oriented Architecture (SOA).  A true SOA environment consists of a massive number of small, independent, reusable components that are stitched together in various combinations to form end-to-end applications and business services.  So a given “login” SOA service might be re-used by dozens of unrelated applications, creating huge efficiencies in terms of developing new features (no recoding the “login” feature), but also a complex web of dependencies and resulting governance issues.  Today’s cloud-computing paradigm has offloaded the infrastructure portion of that challenge since there are now so many cloud providers offering platform-level services such as PaaS database and middleware containers, cloud-based microservices, data-as-a-service feeds and similar packaged, reusable components.

However, the governance challenge remains.  Line-of-business application sponsors as well as application developers are understandably super-excited about the potential to roll out a new business service every week consisting of micsroservices from this cloud provider, data feeds from that cloud provider, application and business logic hosted over at this other cloud provider, et cetera, et cetera.  But IT operations folks are terrified, looking at a potentially ungovernable mess that is going to be held to some kind of a service level … which brings us back to our big data problem.

Data, Data Everywhere

The world of IT operations has always had a big data problem.  There is the instrumentation of end users, servers, application components, logs, clickstreams, generated events and incidents, executed notifications and runbooks, CMDBs, Gantt charts. Spend time walking around any network operations center (NOC) or application support desk, and you’ll see tons of data and overwhelmed people. 

According to industry analytics, most organizations have at least six IT operations systems of record today (user monitoring, server monitoring, CMDBs, diagnostics, maybe some log monitoring, cloud provider consoles, analytics, etc.).  The saving grace of these redundant, mostly-unconnected silos of operational data is that the things they were being used to monitor weren’t changing that fast, so you had a reasonable shot of at least knowing you might be able to get to the data when it is time to troubleshoot (though even that is not always true).

And yet the scope of the problem has been enlarged once again and IT operations now has a bigger big data problem.  Not only are they drowning in the data they are collecting today, they are drowning because of the data they are not collecting in the cloud application development paradigm.  For example, when the developers of the customer-facing service add new cloud-hosted microservices every week from a new provider that IT operations doesn’t even know about, how is IT operations supposed to provide any kind of an SLA?  For that matter, how is development supposed to provide it?  How do we even know what the SLA is anymore?

Visibility and Information

The answer has two parts. First, IT operations needs more visibility, and more data into these new components.  Then, second, IT operations needs to be able to process this mountain of new data into useful information. 

For the first partno problem, say the microservice providers, we not only have our own consoles (because everyone needs another console!), but we also have lots of logs.  Just look in them, and all of your questions will be answered.  But logs as it turns out are complicated animals.  They are dispersed, they can be structured or unstructured, they are voluminous, they are repetitiveand for them to be useful you need to be able to quickly look at them across time and in context of the problem you are troubleshooting.  Log management and analysis is a big data problem, and it’s a big, big data problem since some applications can routinely generate terabytes of logs a day. 

But how to know what logs we need in the first place?  Modern application performance management (APM) solutions learn about application architectures by watching the application behavior rather than by being programmed by an operator.  That means, as service topologies are altered or as new components are added to applications, the APM solution sees it, starts to monitor it, and can help identify relevant logs.  Maintaining visibility across a rapidly-changing topology becomes a reasonable goal, even if those changes aren’t previewed to operations personnel in advance.

Once we are collecting the right data, we need to turn it into useful information.  Cloud-based solutions providers can unify data from APM and Log Analytics sources into a single big data lake that not only alleviates the burden of storing, correlating and indexing this data, but also provides real processing horsepower and machine learning against them.  It’s important not to trivialize this effortcustomers can in theory do this on-premises also, given unlimited resourcesbut most customers don’t operate with unlimited resources and data scientists for this kind of problem. 

A New Vision of Success for IT Operations: delivering the functional promise of SOA with the efficiency of DevOps

IT operations has always had a tough role in application delivery organizations, often being forced into a “bottleneck” role that gates delivery until instrumentation is in place, or a reactive troubleshooting role when people work around their delivery gates and deploy to production anyway (exemplified by the rise of “shadow IT”).  Indeed, even today, much of the literature surrounding IT operations’ value to an organization focuses on an evolving “service broker” role, in which IT operations remains a controlling gate in the approval process between development and production deployment.

However, when IT operations is instead armed with a cloud-hosted big data lake that unifies all the various operational data silos, interesting possibilities start to emerge.  Since the operations organization will be able to shift effort away from low-value brute-force data collection and process gates, maybe they can work collaboratively with upstream developers on higher-value activities:

  • Instead of simply brokering a palette of services picked after-the-fact in response to developer demand, maybe they can apply predictive modeling to guide future development decisions in the first place
  • Instead of reactive war-room firedrills that result in coding band-aids, maybe they can apply wholesale analytics to identify coding best-practices that correlate to the highest level of end-user adoption
  • Instead of end-of-year planning cycles informed by incomplete information, maybe they can help make real-time investment decisions about optimizing workloads across elastic infrastructure

In fact, if we do even some of these things, maybe, just maybe, the line can become so blurred between development and operations that we achieve the functional promise of SOA with the operational efficiency envisioned by the DevOps movement.  Now there’s a (bigger) big idea worth investing in.