Black Friday, Cyber Monday, and Beyond: How IT Operations Teams Prepare for and Prevent Outages

IT operations teams have been furiously preparing for this year's holiday season - the retail industry's busiest time of the year for web traffic. Here is a checklist of the key factors IT operations teams need to consider to ensure their IT infrastructure is ready now - and stays that way throughout the shopping season.

Retail companies, such as Gap, do more than 30% of their yearly sales within the short holiday shopping timeframe. Sales are continuing to increase year over year both online and on mobile. According to the National Retail Federation, 84% of retailers expect to see online site conversion rates increase, followed by average daily site traffic at 71%.

Here’s what IT and operations teams need to consider:

  • Utilize an IT monitoring platform that will reduce risk and find structure in unstructured patterns of noisy data. During the holiday season web traffic increases dramatically for retailers. Being prepared for what will be a massive fluctuation and load on their system is key.
  • Prepare for alert storms with an effective correlation platform. For IT operations teams, spikes in workload can cause what is known as alert storms. Using an IT correlation platform that makes sense of the overwhelming amount of data produced by IT systems during the holiday season is mission critical so that operations teams can effectively triage issues and take action.
  • Review monitoring metrics – there may be some human processes to alter. Work with developers to test applications for stability. Companies may need additional support to accommodate the peak season.
  • Security testing – ensuring necessary changes are made because the potential for unanticipated load or exposure to hackers is a real threat. Utilize a unified search capability that allows for retrospective and future planning.

The following outcomes can be expected by utilizing an alert correlation platform during this time:

  • Alert storms are reduced to a manageable number of incidents
  • Alert compression makes it much easier to separate signal from noise
  • Reduction in MTTR allows teams to triage and remediate issues before they impact the customer
  • Increase in efficiency – IT Operations teams can handle the busy season much easier and on a more manageable scale

In addition, these retailers run a high risk of outages that could occur during peak times.

The most destructive root cause of 75% of outages during big online events like Black Friday and Cyber Monday are unplanned configuration changes to a system – when IT and operations teams find something they think might cause a problem and try to fix it immediately, unintentionally creating a much bigger issue for the web or mobile site.

Here’s a list of top recommendations for preventing outages during next week’s big online shopping days and throughout the entire holiday shopping season:

  • Know what your mission critical systems are. Many companies don’t know what their absolutely critical systems are and try to treat their entire system as business critical – and this is a mistake.
  • Have a bulletproof plan for your critical services. Once you’ve identified what your critical services are, know how to keep them running with a bulletproof uptime plan. For instance, if Amazon checkout goes down, you should have a disaster and recovery plan to which to refer. But if the Recommendation Engine has problems, while unfortunate, it is not at the same level of critical service.
  • Tier your services. Having three to five tiers makes prioritization and response much easier, quicker and more effective when there is a problem. And make sure you have a backup and failover plan for the highest tier of your services.
  • You don’t need failover for everything. IT and operations teams who try to have failover for everything often discover that they don’t need failover ready for everything.
  • Don’t become overly focused on the components of infrastructure. Make sure you are spending more time and focus on your services.
  • Make sure you have planned for load capacity. Not planning for the sheer volume of people visiting your web or mobile site accounts for 25% of outages during big online events.
  • Use a tool that allows you to consolidate your IT data. Implementing an alert correlation platform allows IT and operations teams to focus more on the customer experience and effectively learn from the past by providing consolidated view of IT incidents. This allows them to stop being reactive firefighters and become proactive, using insights made available through alert correlation.