The cloud has long promised limitless scalability and near-perfect uptime. But if you tried to access your Microsoft 365 dashboard or recline your smart bed last week, and got nothing but a spinning icon, you weren’t alone.
In the span of 10 days, both Amazon Web Services (AWS) and Microsoft’s Azure Cloud suffered widespread outages that rippled across industries.
Banks, airlines, retailers, and gaming networks went dark for hours as engineers scrambled to reroute traffic and restore connectivity.
It was a rare one-two punch for the nearly trillion-dollar cloud industry, and a stark reminder that even the backbone of the digital economy can have single points of failure.
And for every business that depends on these platforms, the two largest providers today, hence nearly everyone, that’s a wake-up call worth heeding.
When The Cloud’s Powerhouse Stumbled
Early October 20, 2025, what started as rising error rates quickly spiraled into a full-scale outage, affecting thousands of apps and services. AWS’s busiest region (69%), us-east-1 (Northern Virginia), became the internet’s biggest bottleneck.
The cause wasn’t a cyberattack or power failure but a software bug inside AWS’s internal DNS automation system (Amazon DynamoDB).

Credit: Down Detector
When the “phonebook” that helps cloud services talk to each other failed, the impact rippled, from fintech platforms to streaming and smart home apps.
AWS restored operations within hours, but of the more than 2,000 companies affected, social media platforms like Reddit were still reporting elevated error rates and access issues the entire first week of November.
Related read: When AWS Goes Down: What It Means For Your Cloud Costs
Then Came Microsoft’s Turn
Just days later, Microsoft’s Azure, the world’s second-largest cloud provider, had its own crisis that lasted a business day.
Thousands of users across the world began reporting outages. Websites couldn’t load. Cloud apps stalled. And enterprise dashboards, including Microsoft 365, went dark.
Airlines couldn’t process bookings, retailers saw payment systems fail, and collaboration tools like Teams briefly went offline. Players like Kroger, NatWest’s website, and even Minecraft had issues.
The culprit this time wasn’t deep in the data center, although it was still a similar issue to AWS, but at the edge. A misconfiguration in Azure Front Door (AFD), to be precise. That is Microsoft’s global routing and content delivery service, and it disrupted traffic flow across multiple regions.


Credit: Down Detector
By the time Azure engineers rolled out a fix, the damage to uptime charts and customer confidence was already taking in water across continents.
The Hidden Price Tag Of Dependence
When AWS or Azure goes down, it doesn’t matter how solid your internal codebase is. If your foundation wobbles, everything above it shakes.
Yet, downtime means lost revenue, missed transactions, and frustrated customers. SaaS platforms scrambled to explain outages they didn’t cause. And the invisible costs often run deeper.
Many organizations discovered that even if their workloads weren’t hosted on the affected provider, their vendors and partners were. A payment API here, a data analytics service there, all built atop the same cloud.
When one link broke, the chain stalled.
Also, failing over to another region or spinning up redundant capacity mid-crisis often means double infrastructure costs for that period. Cross-region data transfers and replication also spike egress fees, which can balloon during recovery.
Related Read: Here’s How The Different AWS Regions Affect Your Cloud Costs
In the end, businesses with multi-region or multi-cloud architectures weathered the storm better, a finding echoed by analysts at INE and others. Those that didn’t are now left tallying the cost of ‘putting all their data eggs in one cloud basket.’
How To Bulletproof Your Cloud Resilience
The back-to-back outages have sparked an uncomfortable but necessary question for many IT and business leaders, “What’s our Plan B when the cloud goes dark?”
It turns out, resilience goes beyond better uptime into smarter architecture.
Many organizations are now rethinking their cloud strategy through the lens of diversification. Instead of relying on a single provider, more teams are adopting hybrid or multi-cloud models. They are blending AWS, Azure, Google Cloud, DigitalOcean, among others, and even on-prem systems. The goal is to ensure if one fails, another can take over.
It’s not cheap, but it’s a lot less expensive than hours of global downtime.
The same logic applies to multi-region deployments. Running workloads in at least two separate regions, say, US-EAST-1 and US-WEST-2 on AWS, can prevent a regional issue from becoming a company-wide outage.
Many teams are also leaning into chaos engineering. This is the practice of intentionally breaking things in controlled environments to see how their systems respond, so real incidents don’t become hours-long customer churn and revenue losses.
Dependency mapping can also help. You can’t protect what you don’t understand. So, knowing every third-party service, SaaS vendor, and API that touches your environment helps you pinpoint where single-provider risk hides.
Of course, building for resilience doesn’t mean losing grip on your costs. In fact, one of the biggest concerns we see in hybrid, multi-cloud, and multi-service setups is understanding what that resilience actually costs. So, you’ll want to architect your systems to fail over intelligently while still tracking and managing the cost impact of doing so.
Related read: How To Combine Multi-Cloud Spend Into One Single View (And Make It Make Sense)
Even Service Level Agreements (SLAs) deserve a closer look. They define what you can expect, and what you can’t, when outages strike. Knowing those limits helps you plan backup coverage and response priorities more effectively.
And finally, resilience is not a one-time project, but a living discipline. So, ensure regular failover tests, updated runbooks, and recovery drills. These can be the difference between a headline-making outage and a quick, quiet recovery.
Related read: The Outage Anxiety Test: Can You Answer These 3 Questions In Under 10 Minutes?
Make Resilience And Cloud Cost Control Go Hand In Hand
Building resilience across clouds and regions doesn’t have to force you to choose between resilience and cost visibility.
Yet, for most companies, that’s often the trade-off. Pay more for uptime, or risk being offline when it matters most.
But it doesn’t have to be a blind trade.
With CloudZero, you can see and manage your cloud spend across major clouds, platforms like Kubernetes and Snowflake, as well as on-premises environments. All in one place.
If you’re considering a hybrid or multi-cloud strategy to spread risk across providers, CloudZero can help you keep all that complexity under control.
From migrations and data egress to Kubernetes clusters and Snowflake workloads, CloudZero surfaces every cost driver in a single pane of glass, complete with immediately actionable insights like cost per service, per deployment, per feature, and beyond. Plus, you get real-time anomaly alerts delivered straight to your inbox.
So when the next outage inevitably hits, you’ll be online and in control.
to experience CloudZero yourself (like the leading teams at Toyota, Duolingo, Skyscanner, MalwareBytes, and Grammarly already do).


