Newsletters




Massive AWS Outage Felt Across the Internet, Renews Calls for Decentralization


Amazon Web Services went down on October 20, kicking off at 3 am ET at AWS’ main US-East-1 region hosted in northern Virginia—the heart of its cloud services.

According to the website Downdetector, user reports indicated problems at sites including Amazon, Disney+, Lyft, the McDonald’s app, The New York Times, Reddit, Ring doorbells, Robinhood, Snapchat, T-Mobile, United Airlines, Venmo, Verizon, and more.

DesignRush’s quick DNS analysis of Fortune 500 websites found that around 60% rely on AWS edge or DNS infrastructure such as CloudFront and Route 53. So, when one AWS region fails, it isn’t just tech users who notice it, it affects the entire global economy.

Although AWS dispatched a fix in the early hours of the morning, large swaths of the internet remained affected throughout the day.

“We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations," the company said on its AWS Health Dashboard.

A fix was ultimately successful at 6 pm ET, with AWS listing the issue as resolved and blaming it on a DNS issue and nin internal subsystem that didn't resolve after a DNS fix.

Last year, a similar outage occurred with CrowdStrike issuing a faulty update, leading to hours of internet disruptions that affected airlines, POS systems, and even some medical services. It cost $5.4 billion in losses for Fortune 500 companies and impacted a wide variety of companies globally, said Parcelhero’s head of consumer research, David Jinks.

“Businesses are past treating events like this as one-offs. If your stack depends on a single region or control path, you’ve designed a single point of business failure. Minutes of disruption cascade into checkout stalls, payments fail, ads go dark, support explodes. Production-ready today means failure-tolerant by default, so a regional event becomes a non-event for customers and cash flow,” said Spencer Kimball, CEO and co-founder, Cockroach Labs.

As per CNN, the latest outage serves as a reminder of how fragile the internet’s backbone can be, even if the disruption is brief, and how reliant the world has become on these online services. Although AWS and its competitors are generally robust, the internet is a complex web of overlapping services that are only as reliable as their weakest code. 

The internet was originally designed to be decentralized and resilient, yet today so much of our online ecosystem is concentrated in a small number of cloud regions, said Rob Jardin, chief digital officer at cybersecurity firm NymVPN.

“Similar outages occur almost every year, and they can be a reminder of how extensive software supply chains have become, showing how a simple issue on a handful of Amazon Data Centers caused thousands of issues to their clients,” said Aras Nazarovas, senior security researcher at Cybernews. “Clients of affected services were impacted by failing to access their resources and data hosted by AWS for 4 hours. Impact of such a failure to ensure availability can vary greatly depending on the specific business and industry that used impacted AWS services, in worst case scenarios such an outage could have had serious consequences in critical infrastructure sectors.”

Cloud outages have only highlighted the vulnerability of one cloud deployment. Even the largest cloud hosts—Google, AWS (as we have just seen), Microsoft—suffer from outages, said Jamil Ahmed, director and distinguished engineer, Solace.

"Having all your digital eggs in one cloud basket leaves businesses at risk of serious failure as we keep on seeing. These businesses need to build a fault tolerance into their infrastructure, a buffer that enables their business to always remain operational and to ride the outages, and that comes down to how the workload is transferred and handled by another cloud provider," Ahmed explained. “Businesses, who proactively adopted multi-cloud strategies with the safeguard of event-driven architecture (EDA), are already reaping the benefits of stronger and more resilient infrastructures. Now, even more businesses can follow suit and unlock the potential of a multi-cloud future."

Kevin Cochrane, CMO of Vultr agreed, stating, the AWS outage once again exposes the fragility of today’s cloud infrastructure, where too much global traffic is concentrated in a single hyperscale provider and even a single region. Multi-cloud isn’t enough if it simply means multi-vendor contracts. 

"Enterprises need true infrastructure resiliency with the ability to keep their systems running on an entirely separate cloud when one goes down. To protect their business continuity, companies must adopt a distributed cloud strategy that mirrors an immune system—redundant, autonomous, and always on," Cochrane concluded.


Sponsors