Beyond the SLA: The Cloud Never Fails… Until it Does

Link copied

One of the core selling points of cloud computing has always been that the Big Three hyperscalers (AWS, Azure, Google Cloud) offer virtually limitless redundancy. Their massive global networks, we are told, make outages a thing of the past and eliminate the need for expensive disaster recovery strategies.

But recent events have proven otherwise. The AWS outage of October 20th disrupted thousands of businesses worldwide due to a single DNS resolution failure in the DynamoDB API for the US-EAST-1 region. That single point of failure rippled through global operations and its effects are still being felt.

The very next week, a global outage triggered by an “inadvertent configuration change” crippled Azure Front Door and associated platforms, including Microsoft 365, Minecraft, and Xbox Network. And in June of this year, Google Cloud went down taking Spotify, Snapchat, and Fitbit with it—for hours.

Collectively, these interruptions are estimated to have cost billions.

The most desired measure of system availability is the Five Nines, meaning a system or application is available and operational 99.999% of the time. That’s about five minutes and fifteen seconds of downtime per year or, if you like, 43 seconds per month. But this level of availability was never a guarantee from the hyperscalers.

Recent events show that even Four Nines (less than an hour of downtime per year) may be out of reach for hyperscalers. Cloud infrastructure after all is rented, not owned. Customers have no control over availability of the underlying systems; they can only trust that their cloud providers uphold the promise of resilience.

For many years, continuous-availability architectures (zero planned downtime plus highly resilient failover) have been the domain of mission-critical, on-prem systems like stock exchanges, telecom networks, and 911 emergency services.

Cloud computing hasn’t yet been able to reach that bar.

As more organizations move mission-critical workloads off-prem, CIOs are being forced to reevaluate risk. The assumption that cloud equals continuity has eroded. Now, the question isn’t whether to move, but how to do it safely.

The path forward starts with visibility.

Enterprises should run an assessment of their cloud environments to uncover weaknesses in reliability, cost efficiency, and security posture. A well-executed architectural review identifies single points of failure, quantifies exposure, and helps balance performance with cost and resilience. The goal: Restore confidence in cloud operations by designing for availability, not just assuming it.

If your organization depends on continuous uptime, now is the time to take a closer look at what “resilient” really means in the cloud era. Start by assessing where your risk lives and what’s within your control.

About the Author

Mark Kujawski is a Principal Director at apiphani. He leads the company’s Advisory Strategy Practice.

Contact Us

Tell us more about your business and what you need from automation and business software.
53 State Street
Suite 505
Boston MA, 02109
Request a Quote: +1 (833) 695-0811

Beyond the SLA: The Cloud Never Fails… Until it Does

Contact Us

Get in Touch

Subscribe to our newsletter