Traffic was still hitting the dead primary. A monitoring alert fired at 02:32, just seconds after the on-call engineer closed the incident bridge call at 02:31, declaring the DNS failover ‘complete.’ The records updated. The runbook was followed. Health checks on the secondary were passing. Yet, for a full 18 minutes, the system failed operationally. Only one of those numbers—the 18 minutes—actually mattered.
This wasn’t a typical config error or a vendor hiccup. Every component, from the DNS records to the health checks, performed exactly as designed. The flaw wasn’t in the individual layers but in the assumptions about how they’d behave in concert during a crisis. The problem is the ‘Declaration Gap’ – the insidious period between when a failover is declared complete and when traffic actually begins flowing to the new destination.
The Declaration Gap: A Model of Failure
What is this gap, really? It’s the operational reality of DNS failover, which isn’t a clean switch flicked off and on, but more akin to a slow drain. Traffic doesn’t magically jump ship the instant a DNS record is updated. Instead, it continues to flow to the old address until every active path that was relying on that record has exhausted its cached state and re-resolved. This re-resolution process takes time—time that most runbooks conveniently ignore.
The incident described is a textbook example. TTLs had been wisely reduced to 60 seconds two weeks prior, a sensible precaution for planned maintenance. Health check intervals on the secondary were set to a brisk 30 seconds. The DNS record update itself propagated to authoritative nameservers within 90 seconds. By all the documented metrics, the failover should have been instantaneous. But the system designers, and the engineers following the runbook, had the wrong definition of ‘complete.’
Here’s the breakdown of the four layers that, while individually functioning correctly, collectively created this critical delay:
DNS TTL: A Floor, Not a Ceiling
The reduced 60-second TTL meant that resolvers re-querying after that period would receive the updated record promptly. TTL, however, is merely a guideline. Resolvers aren’t obligated to adhere to it rigidly; many, especially under load, will cache records for longer durations than specified. This 60-second TTL certainly minimized the blast radius, but it didn’t eliminate the lag.
Health Check Lag: Endpoint vs. Traffic State
The health check confirmed the secondary’s availability before the failover was declared. This verified the endpoint’s state. What it didn’t account for was the transitional window: the period between the primary’s failure being recognized and all traffic paths ceasing to point to it. Health checks assess operational status, not traffic flow dynamics.
CDN Origin Cache: The Hidden Holdout
Content Delivery Networks introduce their own caching layer with independent TTLs. Even after the DNS record change, the CDN didn’t immediately re-resolve the origin. It continued serving from its cached data for the remainder of its own cache’s TTL. This meant traffic routed through the CDN kept reaching the old origin until the CDN’s internal cache expired—a separate timing event that hadn’t been factored into the Recovery Time Objective (RTO).
Client-Side Resolver Persistence: Independent Caches Abound
Enterprise clients, corporate recursive resolvers, browsers with built-in DNS caches, and mobile devices all maintain their own DNS record caches. Each of these systems correctly honored their individual caching logic, independent of authoritative server changes. The aggregate effect? The system failed, despite each part doing its job.
Most DNS failover testing validates the wrong thing. A test that confirms the DNS record updated and the health check passed has validated the protection plane. It has not validated the recovery plane — whether traffic actually moved, when it moved, and what the distribution looked like during the transition window.
This scenario underscores a critical blind spot in many testing methodologies. Verifying DNS record propagation and secondary health confirms the protection plane—the mechanisms in place to enable a failover. It completely bypasses the recovery plane, which is the actual user experience: when traffic moves, and how smoothly it transitions.
Rethinking Failover Testing and RTO
True DNS failover testing needs to go beyond checking record updates and health. It must actively measure traffic distribution. This requires simulating production traffic, accounting for CDN-transited requests and enterprise resolver caches, and timestamping both the DNS record change execution and the point at which traffic distribution on the secondary crosses a predefined threshold. The delta between these two timestamps is the true contribution of DNS failover to the RTO.
Pre-reducing TTL is a necessary step, but it’s insufficient on its own. The CDN’s cache TTL must also be pre-reduced. Neglecting this critical detail makes the CDN the bottleneck, negating aggressive DNS TTL tuning. Monitoring during a failover window should focus on application-layer traffic distribution, not just nameserver-level DNS propagation.
The fundamental truth is this: DNS failover isn’t complete when the record changes. It’s only complete when the traffic distribution changes. This simple distinction demands a complete reevaluation of how we model RTO for any system relying on DNS-based failover strategies.
🧬 Related Insights
- Read more: Packing PDFs and Docs into GitHub Repos: Smart Fix or Git Bloat?
- Read more: Linux Server Security Isn’t Boring—Here’s Why Your SSH Port Is Being Attacked Right Now
Frequently Asked Questions
What is the ‘Declaration Gap’ in DNS failover?
The Declaration Gap is the period between when a DNS failover is declared complete and when traffic has actually shifted to the new destination. This occurs because traffic doesn’t immediately stop flowing to the old record; it persists until cached DNS records expire across various layers.
How can I prevent my DNS failover from failing?
Ensure your failover testing measures actual traffic movement, not just DNS record updates. Account for multiple caching layers like CDNs and client resolvers, and pre-reduce TTLs on all relevant services. Monitor application-layer traffic distribution during transitions.
Is a 60-second TTL enough for failover?
A 60-second TTL is a good start for reducing lag, but it’s not a guarantee. Other caching mechanisms, like those in CDNs and client resolvers, can cause traffic to persist on the old endpoint even after the DNS record has updated and the TTL has expired.