Image courtesy of Shutterstock
Officials at the New York Stock Exchange have confirmed that the outage on Wednesday, July 8, which halted stock trading was caused by a software glitch related to new software configuration compatibility issues.
But no matter what the reasons, a series of unfortunate events added up to a lot of bad news for data security and availability during just one week.
In addition to the NYSE outage, IT issues resulted in the grounding of United Airlines planes for 2 hours on the same day, now thought to be due to a network problem. And, new revelations surfaced about the scope of the unauthorized access to data stored by the U.S. Office of Personnel Management (OPM), the government’s human resource department - now believed to be data related to more than 21 million people, including Social Security numbers. That situation led to the resignation of OPM director Katherine Archuleta. Adding to the pile-up, at the end of the week, on Friday, July 10, TD Ameritrade revealed that it had expererienced problems related to an order router supporting one of its trading platforms, but that the issues had been resolved by 10 am.
At the NYSE, the suspension of trading lasted nearly 4 hours and drew initial speculation that it and the problems at United Airlines might be part of a cyberattack.
Routine Events Can Have Big Consequences
“The fact that a major site such as the NYSE can be knocked offline for 4 hours by a software upgrade shows that it's often the mundane events - not cyber-attacks as originally feared – that can wreak the great havoc and downtime,” commented Unisphere Research analyst Joe McKendrick. “Data center administrators rightfully worry about hackers and natural disasters. But it's the smaller, less-threatening blips, such as power surges, usage spikes, and, yes, software updates that can create the most headaches.”
The technology exists to today to make sure outages such as the one at the NYSE never happen, said Michael Corey, president of Ntirety, which offers remote DBA, database consulting, and DBA on-demand services. Corey questioned the competence of a team that would do an upgrade without adequate testing.
Alluding to the earlier rumors of a cyberattack, Corey repeated a comment by Allan Hirt, managing partner at SQLHA and a Microsoft Cluster MVP: "You will die by incompetence well before hackers get in.”
Had a proper test of the upgrade been performed any issues could have been resolved in advance and prevented the NYSE outage, said Corey, who is an Oracle ACE, VMware vExpert, and Microsoft SQL Server MVP.
At IT management software provider SolarWinds, Thomas LaRock, senior technical product marketing manager and head geek, observed that, “as a system administrator, you get paid for performance, but you keep your job with recovery,” and said that the underlying principle in that statement applies to what was witnessed this week at the NYSE.
“They were smart enough know three very important things,” LaRock added. “First of all, redundancy matters. Having the ability for stocks to be traded on other exchanges helped avoid a ‘single point of failure’ for them. This is quite similar to what the cloud promises us, too, with the ability to easily move your servers and workloads to a different data center as needed. Second, always have a backup. In this case, the backup servers were located more than 50 miles away from Wall Street, but they had a backup. Finally, don't be afraid to rollback. They seem to have made a handful of attempts to move forward, but at some point decided that it was best to recover from backups. I've been part of many failed rollouts and the idea of rolling back is always seen as admitting failure, but that’s the wrong way to look at it. This is especially true because, as we’ve seen, so much of modern business is riding on IT.”
According to LaRock, managing infrastructure performance to achieve no-to-little downtime, even if it means sometimes rolling back, is key to success. “I hope this provides an example that rolling back is an acceptable course of action, as opposed to trying to continuously fix things in production” he noted.
And, far too often, IT processes are viewed in isolation without considering the larger infrastructure and context, added Kevin Kline, director of engineering services at SQL Sentry, Microsoft SQL Server MVP, and a founding board member of the international Professional Association for SQL Server (PASS).
“IT today is like surgery in the 1920s. We knew about germs for many decades by that time, so surgeons and nurses paid a lot of attention to keeping the OR clean. But we didn’t know much about the things around successful surgery, like anesthesia,” said Kline. It’s the same situation today for IT, he said. “We build powerful apps, but the surrounding components like testing, integration, and maintenance are neglected. The NYSE is a perfect example of this analogy.”
The Importance of Protecting Infrastructure
Providing insight into the confluence of problematic IT missteps that occurred this week, Charles Weaver, CEO of the International Association of Cloud & Managed Service Providers (MSPAlliance), pointed out that no matter what the causes of the glitches at the NYSE, United, and the Office of Personnel Management, the issue is actually one and the same. “The question is: are there enough resources? Are these organizations capable of safeguarding their infrastructure and their intellectual property sufficiently that they can protect themselves in the future?" he asked.
"That gets to the heart of what our profession has been doing for 2 decades or more, which is enabling or augmenting what existing IT departments struggle with every day just because of lack of resources,"said Weaver. If the outage at the NYSE occurred simply due to an improper software update, he said, that is clearly a flaw in the process "because whether it was an internal mistake or not, it is a preventable error."
“We're seeing that even some of the most powerful government agencies are at the mercy of hackers and breaches," said Suni Munshani, CEO of data security provider Protegrity, referring to the OPM breach. "Ultimately, these agencies are accountable for this disaster and need to take steps to better protect the information of employees and citizens."
All data custodians should learn from OPM occurrence, as well as the many other highly publicized breaches, that a traditional security model based on simple authentication and network controls alone is no longer sufficient to protect sensitive data, said Munshani. “The data itself must be protected with strong authorization controls, policy governance and real-time alerting for atypical data access. Furthermore, the compromise of a single individual’s authentication should not risk the exposure of sensitive data that is extremely valuable and harmful in the wrong hands.”
More Work is Needed
In total, the string of IT vulnerabilities that were exposed across many different types of organizations, said Weaver, has resulted in “a very sobering moment, and a moment that demands a larger public discussion of what is happening to our national infrastructure – private and public – and what can we do to better protect it.”
In Ntirety’s “The Disruption Epidemic” report, Corey said, “we found 90% of the SQL Server instances failed our disaster recovery review, 88% failed a configuration review. Too many companies out there are burying their heads in the sand, and are ignoring the ticking time bombs within their organization.” This is not a technology issue, he said. “This is a people issue.”
Just as auditors review a company’s financials to determine if they are a going concern, said Corey, outside technology audits should also be performed on annual basis to look at process, security, internal controls and a number of other objectives.