How do you know your country still exists?
A strange question.
It’s a little-known detail that, in the absence of any other sign that the United Kingdom still exists, the UK’s nuclear deterrent depends on submarine commanders being able to detect broadcasts of the BBC’s “Today” programme. If there is no such signal for three consecutive days, then the commander’s “Letters of Last Resort” are opened. These may, for example, require the commander to hand over control of the deterrent to the U.S. Navy, to retreat to Australia or Canada, or to launch a retaliatory strike. The Prime Minister writes these orders as one of their first tasks in office, and it’s reportedly a sobering reality check after the thrill of election victory.
I heard this tale from an acquaintance who worked on a Vanguard-class submarine (repeated here by historian Peter Hennessey), and it’s a lesson in systems resilience that has stuck with me ever since.
The principle of the last dependent signal
The principle is straightforward: when all other systems fail, you need one independent signal that still works. The Royal Navy understood this and chose a broadcast system with no shared infrastructure, no authentication requirements, and no dependency on their more conventional monitoring systems.
Most of us don’t operate a nuclear deterrent, but the same principle applies to our crisis response systems. When primary operations fail, we need the ability to communicate with customers, coordinate the response, and monitor the situation. These capabilities must work at the very moment when everything else doesn’t.
The October 2025 AWS us-east-1 outage
In October 2025, AWS’s us-east-1 region suffered a DNS configuration error that took down services for 14 hours. This particular region is a core dependency for several of AWS’s own services, so the blast radius was severe: major brands (including Snapchat, BT, Vodafone, Lloyds Bank, and more) were affected during business hours across the UK and Europe. Customer-facing services were down. Internal communication systems were affected. Revenue stopped.
Significantly, AWS’s own status page was also down, and it was the last affected service to recover. The humble “status page” is easy to overlook – after all, when everything’s working perfectly, it’s rarely consulted. But when disaster strikes, the moment for the status page to shine at last arrives. Unfortunately, this time, the outage took out the status page along with many other services.
Silence is not neutral: the business cost of missing information
The business impact of AWS’s missing status page was substantial. Affected organisations couldn’t tell their customers what was happening or when service might resume. Support teams fielded the same questions repeatedly through whatever channels remained. Social media filled with speculation. The absence of authoritative information amplified the reputational damage beyond the technical failure itself.
This pattern recurs across incidents of all scales. The systems organisations depend on to manage crises share infrastructure with the systems that fail during those crises. Status pages go down along with the services they monitor. Incident management tools become unreachable when you need them most. Communication platforms fail when you need to coordinate a response. Monitoring systems lose visibility during the outages they should be tracking.
DNS as the hidden single point of failure
Whether due ultimately to an attack or human error, the primary system that is often at the root of a major incident is DNS – the Domain Name System, i.e. the global “phonebook” that translates domain names and IP addresses, allowing resources to be identified across the Internet. Indeed, it was a DNS configuration flaw that caused the October 2025 AWS outage.
Consider what could happen during a major incident at any organisation – and consider what would happen at yours – when DNS is impacted. Your domain name is affected, naturally. So your email is affected. Staff logins are affected. What else? Call centre systems. Incident management tools. Content management systems. Monitoring dashboards. And perhaps your status page, if it shares infrastructure with the affected systems (including, importantly, DNS infrastructure).
These otherwise hidden dependencies may be discovered at the worst possible moment, when the cost is highest.
Why organisations miss these dependencies
I think most serious organisations, if asked, would claim their crisis response systems are independent and robust. I hope they are right. But clearly, not all are right: look at how many major brands went offline as a consequence of just one AWS region going offline.
Did the people who built the DNS-related infrastructure also build the communications and monitoring systems? The call centre? Staff authentication systems? It’s unlikely –did they even consult each other about shared dependencies? It’s hard, and these systems develop over time.
Testing all of this properly is non-trivial – it requires simulation of realistic failure modes without breaking production systems. But at least the status page: has that been tested? That one, independent signal (like the radio broadcast for the nuclear submarines), can that be tested, and is it really independent of everything else? Clearly not, in AWS’s case. The absence of their status page is not analogous to the absence of the BBC broadcast – it’s supposed to be informative about the state of the platform! – and if they can get that wrong, so can anyone else.
Three common failure patterns in incident response
I’ve observed three common failure patterns.
- Shared DNS or network dependencies. Superficially, this is obvious, but aspects are easily overlooked. A status page cannot depend on the same DNS configuration as the systems on which it reports. A similar principle applies to monitoring systems, incident management tools, and internal communication platforms.
- Shared authentication systems. Crisis response tools might be hosted independently, but if accessing them requires logging in through systems that are down, they cannot be used. The people who need to coordinate the response cannot reach the tools designed for that purpose.
- Testing only during normal operation. Systems work correctly when nothing is wrong. During an actual incident, under load, with degraded network conditions, they fail in ways nobody anticipated. This extends beyond status pages to all crisis response capability: can monitoring systems handle the spike in queries during an outage? Can communication platforms function when primary systems are degraded?
And what happens when all the systems that were down come back online at the same time? As galling as it may seem, I’ve seen this delay recovery more often than you might think (with anything from power spikes to floods of concurrent requests to blame).
Why this matters to boards and regulators
The business case for tackling this properly is straightforward. During incidents, reputational damage compounds quickly. Customers who cannot get information make assumptions. The longer the silence, the worse the assumptions. Social media amplifies uncertainty. The incident itself may be a technical failure and, often, beyond reasonable control; communication failure is an organisational choice that makes it worse.
For regulated organisations from banking to energy to transport, the stakes are higher. Regulators increasingly expect robust operational resilience, including the ability to communicate during disruptions. The absence of working crisis communication during an incident becomes evidence of inadequate operational resilience.
At Softwire, we build systems for organisations where failure would carry serious consequences: critical national infrastructure, regulated financial services, healthcare systems, even systems ensuring our fundamental democratic processes. These systems include independent capabilities to manage and communicate status during incidents. More importantly, they include verification that this capability actually works when needed. And, yes, they include resilient status pages.
I’ll finish with this: if you’re confident your crisis response systems will work during an incident, you will have thought through these dependencies and tested them properly. But if you have any doubts, that’s a reasonable position. Most organisations (as abundantly demonstrated in the October 2025 AWS outage) haven’t solved this problem well.
If you’d like to discuss how to test whether your incident response capability will actually work during an outage, or how to build something genuinely independent, let’s talk. We’ve been there before, and we can share what works. And we’ve a lot more to recommend than listening out for radio broadcasts, but if you’re reading this from a nuclear submarine: I very much hope they’re still on air!