Back to Open Playback
SEV-1public access

Pages returned 500 errors after octodns automation deleted a backend DNS record

GitHub · Source

Started
Apr 13, 2026
Duration
1h 37m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
GitHub Pages traffic globally
Services
pages, pages-frontend, octodns
configuration errorcustomer-facingdata repairdnsdns failurefrontendpartial outagescheduled job run

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

An automated DNS management tool, octodns, deleted a DNS record for a Pages backend storage host after its upstream data source intermittently failed to return the record. The automation treated the missing record as stale and removed it, causing Pages requests routed to that host to return HTTP 500 errors. Engineers re-created the deleted record to mitigate. The incident exposed the fact that the Pages frontend did not fail over to healthy backend hosts when one became unresolvable.

Impact

During the impact window, the Pages service had an average error rate of 10.58 percent, peaking at 12.77 percent of requests, resulting in approximately 17.5 million failed requests returning HTTP 500 errors.

Root cause

The octodns automation periodically syncs DNS records against an upstream data source.

The upstream data source intermittently failed to return a record for a Pages backend storage host.

octodns interpreted the missing record as stale and deleted it from authoritative DNS rather than treating the missing data as ambiguous.

Pages frontend routing did not handle an unresolvable backend by failing over to healthy hosts; instead it returned 500s.

There was no safeguard preventing octodns from deleting DNS records owned by other systems.

Resolution

Engineers re-created the deleted DNS record, which restored resolution for the affected Pages backend host and returned error rates to normal.

Lessons

  • Automated DNS management that interprets a missing upstream record as 'stale, delete it' is a one-step deletion of production state from a flaky data source.
  • A frontend tier that depends on a single backend host being resolvable rather than failing over to healthy peers turns any DNS hiccup into a user-visible 500.
  • The blast radius of automation that mutates production state is set by the worst-case behavior of its input source, not its average behavior.

Action items

  • Implement availability-zone-tolerant routing in the Pages frontend so that an unresolvable backend host triggers failover to healthy hosts rather than returning errors.
  • Add safeguards to prevent automated deletion of DNS records owned by other systems.
  • Treat intermittently missing upstream data as ambiguous in octodns rather than as authoritative deletion intent.