Billing service config change overwhelmed cache, degrading github.com, Codespaces, Packages, and Actions
GitHub · Source
- Started
- Apr 23, 2026
- Duration
- 48m
- Users affected
- Not disclosed
- Revenue impact
- Not disclosed
- Blast radius
- github.com web, Codespaces, Packages, Copilot, Actions
- Services
- billing-service, shared-cache, github-web, codespaces, packages, copilot, actions
Join the waitlist
Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.
Summary
A configuration change to an internal billing service caused a shared cache to be overwhelmed, leading to request timeouts and degraded experiences across github.com, Codespaces, Packages, Copilot, and Actions. Web requests returned 5xx errors, Codespaces create and resume requests failed at high rates, and a large fraction of Actions jobs were delayed or failed. The mitigation rolled back or corrected the billing configuration; Actions then drained its queued backlog.
Impact
Approximately 1.5 percent of all github.com web requests returned 5xx errors and unicorn pages. Codespaces failures peaked at 45 percent for create requests and 65 percent for resume requests. Packages was mostly affected on the Maven path with 50 percent download failures and 70 percent upload failures. Actions saw a peak of 8 percent failed jobs and up to 85 percent of jobs delayed by more than 5 minutes during the window.
Root cause
A configuration change applied to an internal billing service altered the workload that hit a shared cache.
The new workload pattern overwhelmed the cache, causing requests through it to time out.
Many user-facing services (web, Codespaces, Packages, Copilot, Actions) depend on this shared cache for billing and entitlement checks, so the cache degradation cascaded into a wide blast radius.
The shared cache had no isolation between billing's traffic and traffic from other consumers, so the billing change directly degraded other services' latency.
There was no pre-deploy load model or canary gate that would have surfaced the cache impact before broad rollout.
Resolution
The mitigation corrected the billing service configuration and removed the load pattern that was overwhelming the cache. github.com, Codespaces, Packages, and Copilot recovered quickly. Actions worked through its queued backlog before fully recovering.
Lessons
- Shared caches that sit between billing and many user-facing services are a hidden coupling point; a config change in one consumer can degrade every other consumer.
- Billing and entitlement checks on hot paths benefit from per-consumer cache isolation, since their traffic patterns can change suddenly with feature or pricing updates.
- Async work queues like Actions have a recovery tail beyond the time-to-mitigate; reporting recovery time should include backlog burndown.
Action items
- Add isolation or per-consumer rate limits on the shared cache so that one service's traffic pattern cannot overwhelm capacity used by others.
- Require canary or load-modeling gates for billing service configuration changes that affect cache traffic patterns.