Business Continuity: 4 Pillars of Infrastructure Resilience After the AWS Outage

When Half the Internet Went Dark, AWS Taught Us About Leadership.

The call came at 3:11 AM Eastern, October 20, 2025.

For some of you, it was your on-call engineer. For others, it was your monitoring system lighting up like a Christmas tree. A few of you were already awake, watching in real-time as AWS’s Northern Virginia data center began its slow cascade into chaos.

For the next fifteen hours, you had a front-row seat to one of the most instructive failures in cloud computing history. Not because of what broke (a tiny monitoring system triggering a DNS cascade through DynamoDB) but because of what it revealed about how we’ve been thinking about resilience.

Or more accurately, how we haven’t been thinking about it.

The Question Your CEO Will Ask

In the days after October 20th, CEOs across the country asked their CTOs the same question: “Could this happen to us?”

Some of you said yes. Some said no. Some said “it depends.”

But here’s the question you should have been asking yourself: “When this happens to us, what will I tell the board?”

Because it’s not if. The pattern is clear now: 2017, 2020, 2021, 2023, 2024, 2025. Major cloud outages are becoming a predictable feature of the infrastructure landscape, not an anomaly. Your CEO knows this. Your board knows this. The question is whether you’re ready to have the conversation about what you’re doing about it.

That conversation isn’t about servers and availability zones. It’s about business continuity. It’s about the gap between the reliability your business model assumes and the reliability your architecture actually provides.

Three Companies, Three Outcomes, Three Conversations

Let me tell you about three CTOs and the conversations they had with their CEOs on October 21st.

The first CTO spent Monday morning explaining why they’d been offline for fifteen hours. Their multi-AZ (multiple availability zone) architecture, recommended by AWS and implemented by good engineers, had failed completely when the regional control plane went down. The CEO’s question was direct: “We’ve invested heavily in cloud infrastructure. Why didn’t it work?” The answer, “we followed best practices”, didn’t land well. The follow-up was worse: “How much revenue did we lose, and could this happen again?”

The second CTO reported six hours of downtime. They had a multi-region architecture and a documented failover plan. But execution took hours. Unfortunately the Route 53 console was in the failed region, monitoring dashboards were unreachable, and the runbook had steps that couldn’t be executed. The CEO’s question: “We have a disaster recovery plan. Why did it take so long?” The answer, “we’d never actually tested for that kind of failure”, revealed a gap between documented readiness and operational reality.

The third CTO sent a brief report: “AWS US-EAST-1 experienced a 15-hour outage. Our automated failover executed in under 6 minutes. The failure was in off-peak hours. Customer impact was minimal. No action required from leadership. We are continuing to monitor the situation.” Their CEO’s response: “This is why we invest in infrastructure.”

Three CTOs. Same event. Completely different conversations about leadership, preparedness, and business continuity.

Which conversation do you want to have?

The Architecture of Business Continuity

Here’s what October 20th revealed: The gap between your architecture and business continuity isn’t technical. It’s a CTO leadership gap.

Your CEO doesn’t need to understand Kubernetes. But they do need to understand the answer to one question: “If our primary infrastructure fails, how long until we’re back online?”

That answer lives at the intersection of four capabilities that we call the four pillars of resilience. Your role as CTO isn’t to implement all of them tomorrow. It’s to understand them well enough to explain to your CEO where your gaps are and what closing them costs versus what leaving them open risks.

Pillar 1: Redundancy at the Right Level

The companies that went down on October 20th weren’t running on single servers. They had load balancers, auto-scaling, multiple availability zones, and all the standard redundancy patterns. And they still went down.

Why? Because they had redundancy implemented at the wrong level.

Multi-AZ protects you from data center failures. It doesn’t protect you from regional control plane failures. When AWS’s DNS system for the entire US-EAST-1 region broke, it didn’t matter how many availability zones you were spread across. You all shared the same failing control plane.

The leadership question: Do you know what level of failure your redundancy actually protects against?

When you tell your CEO “we’re running across multiple availability zones,” they hear “we’re protected from outages.” But what you mean is “we’re protected from data center failures, not regional failures.” That’s a critical gap in understanding, and it’s your job to close it.

The business question your CEO needs answered: What would it cost to be protected at the regional level? What does it cost the business when we’re not?

For Company A, offline fifteen hours at $100K/day revenue, the answer was $62,500 in direct losses, plus customer support costs, plus reputational damage, plus engineering time firefighting instead of shipping features. Their monthly infrastructure bill was $50K. Going multi-region active-passive would have added $20K/month. One outage paid for an entire year of the redundancy they didn’t have.

That’s not a technical calculation. That’s a business decision. And it’s your job to frame it that way.

Pillar 2: Graceful Degradation Under Pressure

The CFO’s dashboard goes down. The payment processor gets slow. Your recommendation engine times out. In most architectures, any one of these degrades into total system failure either because there’s no circuit breaker, or because every feature is treated as equally critical, or because the timeout strategy assumes everything should wait forever.

This is what we call “binary failure architecture”: systems that are either 100% up or 100% down, with nothing in between.

The leadership question: When something breaks, does your entire application go down, or do you fail partially?

Your CEO doesn’t need to understand circuit breakers. But they need to understand this: Company B’s payment processor got slow. Instead of their entire checkout flow timing out and failing, they queued payment requests and showed users “Payment processing temporarily delayed. We’ll complete your order within 10 minutes.” They lost the ability to process payments in real-time. They didn’t lose the ability to take orders.

That’s a 20% functionality loss instead of a 100% revenue loss. The difference is designing degradation into your architecture ahead of time.

The business question your CEO needs answered: When our critical dependencies fail, what breaks? What keeps working?

If the answer is “everything breaks,” you have a conversation to have about fault tolerance. If the answer is “we lose recommendations but checkout still works,” that’s a conversation about acceptable degradation. Both are business continuity discussions, not technical ones.

Pillar 3: Fault Isolation and Blast Radius

One customer runs an analytics query that scans your entire database. CPU hits 100%. Every customer starts timing out.

One team deploys a service with a memory leak. The Kubernetes node runs out of memory. Other teams’ services get evicted.

One API you depend on gets slow. Your thread pool fills up waiting for responses. Your entire application grinds to a halt.

These are all shared fate architectures where one component’s failure or misbehavior cascades to affect everything else.

The leadership question: When something goes wrong, can you contain it, or does it take down everything?

This is about isolation. Isolation ensures that failures happen in bounded contexts instead of propagating through your entire system. It’s rate limiting per customer. It’s circuit breakers on external calls. It’s resource limits on pods. It’s making it safe for non-critical components to fail without affecting critical ones.

The business question your CEO needs answered: If one customer misbehaves or one dependency fails, does it affect everyone else?

If yes, you’re one bad actor or one slow API away from a complete outage. That’s a business continuity risk that has nothing to do with AWS or infrastructure reliability and everything to do with architectural choices you control.

Pillar 4: Tested Recovery vs. Documented Recovery

After October 20th, the most painful conversations weren’t about companies that had no disaster recovery plan. They were about companies that had detailed plans that didn’t work.

Company C had a runbook: “Step 3: Update Route 53 to point to US-WEST-2.” Step 3 failed because the Route 53 console was in US-EAST-1 and unreachable. The runbook assumed they’d be able to access AWS tools in the failed region. They’d never tested that assumption.

The leadership question: Do you have a disaster recovery plan, or do you have a tested disaster recovery process?

Documentation is not the same as capability. Your CEO needs to understand this distinction. You can have beautiful Confluence pages detailing every step of your regional failover. But if you’ve never actually executed those steps under realistic conditions, you don’t have disaster recovery. You have disaster documentation.

The business question your CEO needs answered: When was the last time we tested our disaster recovery plan? How long did recovery actually take?

If the answer is “we’ve never tested it” or “we tested it two years ago,” you’re operating on hope, not evidence. And hope is not a business continuity strategy.

The companies that survived October 20th weren’t lucky. They had run game days. They had found the gaps in their runbooks, the consoles they couldn’t reach, the monitoring they couldn’t see, and the DNS changes they couldn’t make. They had fixed those gaps. And when the real outage came, their recovery was measured in minutes because they’d already practiced it.

The Conversation Your CEO Needs You to Initiate

Here’s what most CTOs get wrong about resilience: they think of it as an infrastructure problem. It’s not. It’s a business risk management problem that happens to be implemented in infrastructure.

Your CEO manages risk constantly: market risk, competitive risk, regulatory risk, financial risk. They make tradeoffs between the cost of mitigation and the cost of exposure. They accept some risks and invest to reduce others.

Resilience is no different. The question isn’t “should we be perfectly resilient?” The question is “what level of resilience does our business model require, and what gaps exist between that requirement and our current capability?”

Your job is to make that gap visible and the tradeoffs clear.

Framing the Business Continuity Conversation

When you sit down with your CEO to talk about resilience, don’t lead with availability zones and DNS. Lead with this:

“I want to talk about how long we can be down before it meaningfully impacts the business, and whether our current architecture can meet that requirement.”

Then walk through the four pillars as business capabilities, not technical implementations:

On Redundancy: “Our current architecture can survive individual server failures and data center issues. It cannot survive regional control plane failures like the October 20th AWS outage. If our cloud hosting region goes down, we go down with it. The question is whether that risk is acceptable given how often these regional failures are occurring. This was one of multiple major US-EAST-1 outages in recent years (2017, 2020, 2021, 2023, 2024, 2025).”

“Going multi-region would cost us approximately $X additional per month. Based on our revenue, a 15-hour outage like October 20th would cost approximately $Y. The investment pays for itself after one major incident.”

On Graceful Degradation: “Right now, when any critical component fails, the entire application tends to fail. I’d like to invest in making our architecture degrade gracefully. This means that non-critical features can fail without taking down core revenue-generating functionality. This means customers might lose recommendations or see slower search, but they can still check out and make purchases.”

“The tradeoff is engineering time with approximately X weeks to implement circuit breakers and feature flagging for graceful degradation. The benefit is converting what would be total outages into partial degradations.”

On Fault Isolation: “Currently, one customer’s bad behavior or one team’s deployment issue can affect all customers. I’d like to invest in better isolation: rate limiting per customer, resource limits per service, circuit breakers on external dependencies. This ensures that when something goes wrong, the blast radius is contained.”

“The cost is a bit more architectural complexity and X hours of engineering time. The benefit is that one customer’s analytics query or one team’s bug doesn’t take down the entire platform for everyone.”

On Tested Recovery: “We have disaster recovery documentation, but we haven’t tested it under realistic failure conditions. I’d like to start running quarterly ‘game days’ where we intentionally break things and practice recovering. This will help us find gaps in our runbooks and build confidence in our recovery procedures.”

“The cost is 4-8 hours of engineering time per quarter for Y people. The benefit is knowing our disaster recovery actually works before we need it in a real emergency.”

The Bottom Line for Leadership

You’re not asking permission to eliminate all risk. You’re asking for alignment on which risks the business is willing to accept and which it wants to mitigate.

Some businesses can tolerate 15 hours of downtime. Many can’t. Some need instant failover. Most can live with 5 minutes. Your job is to understand where your business sits on that spectrum and whether your architecture can deliver it.

October 20th revealed a stark truth: The companies that stayed online didn’t get lucky. They had made different investment decisions months or years earlier. They had allocated engineering time to resilience. They had accepted higher infrastructure costs in exchange for for multi-region redundancy. They had prioritized testing their disaster recovery over building new features.

Those weren’t technical decisions. Those were business priority decisions, made collaboratively between CTOs and CEOs who understood that infrastructure resilience directly impacts business continuity.

The Question That Matters

Three weeks ago, half the internet went dark for fifteen hours. Some CTOs spent that time helplessly watching. Some scrambled for hours executing untested failover procedures. Some monitored automated recovery with their morning coffee.

The difference wasn’t technical sophistication. It was leadership by CTOs who had framed resilience as a business continuity question and CEOs who had agreed to invest accordingly.

The next major outage is coming. It might be AWS again. It might be Google Cloud or Azure. It might be a different kind of cascade we haven’t seen yet.

When it comes, what conversation will you have with your CEO?

Will you explain why you were offline for fifteen hours despite “following best practices”?
Will you report that recovery took longer than expected because the runbook had gaps?
Or will you be sending a brief note: “Major cloud outage occurred. Our systems executed automated failover. Customer impact was minimal”?

That choice starts with the conversation you have today about readiness, not the one you have during the next outage.

Your CEO is already thinking about business continuity. They think about it every time they review financial reserves, insurance policies, and succession plans. Resilience is just another form of business continuity planning, one that happens to live in infrastructure.

Make it visible. Make the tradeoffs clear. Make the gaps explicit.

Because the question your CEO will ask after the next major outage isn’t “what happened?” It’s “why weren’t we ready?”

The answer to that question is being written right now, in the investment decisions you’re making—or not making—about resilience.

Choose wisely.

Kathy Keating is co-author of the CTO Levels framework and facilitates the 7CTOs Growth Group Coaching program, where she works with startup CTOs and engineering leaders to navigate hypergrowth challenges. An experienced CTO and advisor, Kathy helps technical leaders build the systems and practices that enable companies to scale. This article is based on the Infrastructure block from CTO Levels, explored in depth during this month’s Growth Group Coaching sessions.