Back to Open Playback
SEV-1public access

Billing service config change overwhelmed cache, degrading github.com, Codespaces, Packages, and Actions

GitHub · Source

Started
Apr 23, 2026
Duration
48m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
github.com web, Codespaces, Packages, Copilot, Actions
Services
billing-service, shared-cache, github-web, codespaces, packages, copilot, actions
cachecache stampedecascading failureci/cdconfiguration changeconfiguration errorconfiguration fixcustomer-facingdegraded performancefrontend

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A configuration change to an internal billing service caused a shared cache to be overwhelmed, leading to request timeouts and degraded experiences across github.com, Codespaces, Packages, Copilot, and Actions. Web requests returned 5xx errors, Codespaces create and resume requests failed at high rates, and a large fraction of Actions jobs were delayed or failed. The mitigation rolled back or corrected the billing configuration; Actions then drained its queued backlog.

Impact

Approximately 1.5 percent of all github.com web requests returned 5xx errors and unicorn pages. Codespaces failures peaked at 45 percent for create requests and 65 percent for resume requests. Packages was mostly affected on the Maven path with 50 percent download failures and 70 percent upload failures. Actions saw a peak of 8 percent failed jobs and up to 85 percent of jobs delayed by more than 5 minutes during the window.

Root cause

A configuration change applied to an internal billing service altered the workload that hit a shared cache.

The new workload pattern overwhelmed the cache, causing requests through it to time out.

Many user-facing services (web, Codespaces, Packages, Copilot, Actions) depend on this shared cache for billing and entitlement checks, so the cache degradation cascaded into a wide blast radius.

The shared cache had no isolation between billing's traffic and traffic from other consumers, so the billing change directly degraded other services' latency.

There was no pre-deploy load model or canary gate that would have surfaced the cache impact before broad rollout.

Resolution

The mitigation corrected the billing service configuration and removed the load pattern that was overwhelming the cache. github.com, Codespaces, Packages, and Copilot recovered quickly. Actions worked through its queued backlog before fully recovering.

Lessons

  • Shared caches that sit between billing and many user-facing services are a hidden coupling point; a config change in one consumer can degrade every other consumer.
  • Billing and entitlement checks on hot paths benefit from per-consumer cache isolation, since their traffic patterns can change suddenly with feature or pricing updates.
  • Async work queues like Actions have a recovery tail beyond the time-to-mitigate; reporting recovery time should include backlog burndown.

Action items

  • Add isolation or per-consumer rate limits on the shared cache so that one service's traffic pattern cannot overwhelm capacity used by others.
  • Require canary or load-modeling gates for billing service configuration changes that affect cache traffic patterns.