Back to Open Playback
SEV-1public access

Elasticsearch overload from suspected botnet traffic degraded search across GitHub

GitHub · Source

Started
Apr 27, 2026
Duration
6h 15m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
search-backed UI surfaces across GitHub: Issues, Pull Requests, Projects, Actions, Packages
Services
elasticsearch, search, issues, pull-requests, projects, actions, packages
abuse eventcapacity shortfallcustomer-facingddos or abuse trafficpartial outagescale downsearch

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

GitHub's Elasticsearch cluster became overloaded due to load that engineers later attributed to suspected botnet activity. Search-backed UI surfaces, including Issues, Pull Requests, Projects, Actions workflow runs, and Packages, returned timed-out or empty results. Engineers identified the source of the additional load and disabled it, allowing the cluster to recover. After the cluster stabilized, GitHub had to reindex Pull Request data, with reindexing continuing into the following days.

Impact

During the impact window, users experienced intermittent failures viewing Issues, Pull Requests, Projects, and Actions workflow runs. Search requests timed out or returned empty results. Pull Request listing pages did not show all indexed pull requests for an extended period after the immediate incident as Elasticsearch indexes were rebuilt. Packages and Actions also showed degraded performance during the window.

Root cause

An external traffic source, suspected to be a botnet, drove a large volume of search load against GitHub's Elasticsearch cluster.

The cluster's capacity headroom was insufficient to absorb the additional load while continuing to serve legitimate traffic.

Many user-facing pages (Issues lists, PR lists, Projects, Actions runs, Packages) read from Elasticsearch on the hot path, so the cluster degradation produced a wide blast radius.

The cluster's degraded state caused indexing and read paths to fall behind, requiring a multi-day reindex once the load was shed.

Rate limiting or shaping at the edge did not identify and isolate the abusive traffic before it overwhelmed the cluster.

Resolution

Engineers identified the source of the additional load and disabled it, after which Elasticsearch began recovering. Service degradation across Actions, Issues, Packages, and Pull Requests was mitigated by 22:35 UTC and the incident was closed at 22:46 UTC. Reindexing of Pull Request data continued for several days, with full backfill completing on May 1.

Lessons

  • Search clusters that back hot-path UI surfaces are an attractive target for abuse and need capacity headroom plus shaping that keeps them serving during an event.
  • When the read path falls behind during an event, recovery time is often dominated by reindexing rather than by mitigation of the original cause.
  • Coupling many user-facing pages to a single search cluster means the cluster's worst day is the platform's worst day.

Action items

  • Improve identification and isolation of abusive traffic patterns at the edge so they do not reach Elasticsearch.
  • Reduce coupling between hot-path UI surfaces and the search cluster, or provide degraded fallbacks when search is unavailable.
  • Build faster paths for reindexing so recovery is not multi-day after a cluster recovery event.