Tech Matters: Understanding recent cloud outages
Photo supplied
Leslie MeredithLast Friday, I was waiting on a Microsoft Teams call for a customer to join. She messaged with a Google Meet link, and once we were on the call she explained that Microsoft was experiencing an outage and she wasn’t able to connect. The outage came just a week after another major outage, both by cloud service providers: Microsoft’s Azure and Amazon’s AWS. She blamed it on hackers taking advantage of the government shutdown. This was a good example of the rumors that can start when people don’t understand the systems behind the technology they’ve come to rely on every day.
Microsoft announced the outage was a system error, not a hack, and its Azure cloud services went back online after about eight hours. In addition to Microsoft’s 365 service, which includes Teams, Outlook and other Office products, the outage affected Alaska Airlines and sites for Heathrow and NatWest, among others. Previously, the AWS outage knocked big names offline, including Snapchat, Fortnite, Alexa, ChatGPT and the McDonald’s app. According to Microsoft, the Azure incident was triggered by a configuration fault in its content delivery network, a service called Azure Front Door.
Like with Azure, Amazon said the root of the problem was a bug in one of its database services. AWS posted a detailed timeline of when each service broke and how they brought it back, but again, it was a system error, not a hack.
We can think of cloud services as a highway system that provides lanes for delivering web pages, images and app traffic. When Microsoft pushed a bad settings update, many couldn’t start up with the new settings, so they shut themselves off. That dropped lanes from the highway. Traffic was then squeezed onto the remaining lanes, which got overloaded and started failing too, including in places that weren’t part of the original problem. Microsoft froze further changes and rolled the system back to the last working setup, then brought them online slowly to avoid another pileup. Microsoft said it will post a detailed post-incident report within two weeks.
For AWS, the first break happened the night of Oct. 19 in Northern Virginia. A DNS bug kept other AWS services from talking to the database. After engineers fixed the DNS entry in the early morning, they hit a second wave: its control systems had lost track of a lot of physical hosts while the database was unavailable. Rebuilding that state took time and created backlogs. Engineers throttled requests, restarted subsystems and cleared queues to finish recovery.
How did this become so widespread on both platforms? Cloud services are layered and interdependent. A small flaw can create thousands of identical faults at once. That’s different from a single server failing. The same is true for AWS’s DNS race condition: One empty record for a regional endpoint broke connectivity for any service that depended on it until caches expired or engineers restored the record.
Some tech experts are saying it’s time for the cloud services market to be diversified. “It’s no surprise that questions are being raised about concentrating too much in the hands of a few American tech providers,” wrote Nicole Kobie at ITPro. That argument resurfaced because the Azure outage followed the AWS outage by just days.
The cloud services market offers companies eight choices, of which six are based in the U.S. and two in China. The American companies include the three market leaders: Amazon with about 30% of the market, Microsoft at around 20% and Google in third place with 13%. They are followed by Alibaba, Oracle, Salesforce, IBM and Tencent. All of these companies offer their cloud service in the U.S. Should the sector be forced to diversify further or spread customers more evenly across the leaders?
I say no. Instead, the focus should be on designing additional safeguards so an outage is better contained. Microsoft has already blocked vulnerable points in its system related to the outage and will no doubt continue that work. AWS has published a clear post-event summary of what failed and what they’re changing. Allowing providers to improve their products to remain competitive is the way to go. Let businesses decide which providers to use.
For cloud services, companies may choose to diversify by using services to spread the risk, but this will come at a cost if their own systems need to be modified to support a second service. For consumers, the practical takeaway is simple: Sometimes the internet breaks, and it’s not hackers. It’s complex systems doing what complex systems do. The fix is better engineering, not punishing the providers that built the tools we all use.
Leslie Meredith has been writing about technology for more than a decade. As a mom of four, value, usefulness and online safety take priority. Have a question? Email Leslie at asklesliemeredith@gmail.com.


