In the AI era, we’re constantly talking about how important data is—storing data, disseminating data, and protecting data. As data specialists, we understand bad data management leads to bad use of AI which leads, quite frankly, to bad business outcomes.
So, what happens when we run into data management issues, or an IT disruption threatens the data that has become so valuable to business success? When facing these problems there’s always that temptation to believe the right technology can be a catch-all to fix every issue. However, proper IT functions and data management in the age of AI demand something more. It needs a strategy, a framework, and even a mindset rooted in operational resilience.
Understanding Operational Resilience
A recent study of 600 IT leaders and professionals defines operational resilience as the “ability to identify, anticipate, and mitigate risks to help prevent future issues while also accelerating responsiveness to ongoing disruptions when they do occur. It is achieved by understanding the different parts of the business and how they interact across teams, workflows, and tools, while also driving a culture of intentional learning and adaptation.”
Almost every IT leader that responded to the survey—9 in 10—described their IT function as “resilient.” However, there are certain core functions that IT team leaders and database managers know they must support for business success. When asked about confidence in these functions, the percentages dropped well below 90%. For example, only 38% of IT leaders felt confident in their ability to support the use of AI in their organization. Less than half or 45% felt they could support a distributed workforce. Slightly more than half, 52%, exhibited confidence in dealing with cyberthreats.
This data speaks to the need for operational resilience. An inability to manage the use of AI, deal with cyberthreats, and support a distributed workforce can lead to productivity issues. More importantly, it can threaten the data and systems necessary for a business to operate, leading to harmful system disruptions. If systems are down, customers may become frustrated, which can lead to brand damage and loss of revenue.
Tools, Teams, and Workflows
As we mentioned before, when IT teams face issues, there’s a temptation to look for a technology-based solution that can make things easier. While the right tooling is important for every IT function, technology in a silo may actually create more issues. In the survey, a higher percentage of IT leaders believed workflows and teams were more important for operational resilience than tooling. For example, 51% said processes were making it difficult for them to respond to IT disruptions quickly and 36% said teams—or not having enough people—didn’t allow them to be as resilient as they’d like. Only 13% cited tooling as their roadblock to operational resilience.
Building the resilience necessary for today’s IT teams sits at the nexus of tools, teams, and workflows. When all three work together, it becomes much easier to prevent disruptions due to user error, cyber incidents, and/or system downtime. Achieving the right connection between these three components begins with analyzing relationships between people and technology.
Paving a Path to Operational Resilience
To properly analyze the relationships between people and technology, teams should first ensure they have a complete understanding of their IT environment. Develop a map that shows how each piece of data, IT asset, and login credential correlates within the system—a comprehensive observability tool can help highlight the relationships between these important IT functions.
After mapping IT assets, review your organization chart to determine the relationships between team members. Find out who works together, who each person reports to, and how big each team is.
Once you understand the relationships between tooling and teams, begin to find out which processes are working, and which are not. One of the best ways to accomplish this is with a survey—formal or informal—of current team members. As the people who are sometimes closest to both people and technology, they will be best equipped to point out room for improvement on each team and with each tool.
After identifying issues, it’s time to address them. Fixing people issues may range from a simple discussion on work styles to a decision to restructure certain teams. It’s also possible that the team actually works well together but simply lacks the proper technology. If this is the case, it’s important to develop a detailed pitch for different tooling that will resonate with leadership and map back to business goals.
Measuring Success
Once you’ve implemented steps to improve operational resilience throughout your IT function, it’s important to measure how successful these improvements are. For many in the tech industry, the MTTx metric—also known as Mean Time to Detect, Mean Time to Acknowledge, and/or Mean Time to Resolve—is a great way to measure improvements in incident management and response times. If for some reason there is an uptick in IT incidents and it’s taking longer to resolve them, IT leaders may need to go back to the drawing board to review what’s not working. If the goal is operational resilience, IT leaders are driving toward not only a reduced mean time to resolve but also a drop in the incidents that could potentially harm their data, assets, and overall IT system.