Back to Open Playback
SEV-1public access

Actions Ubuntu hosted runners delayed by performance regression in VM reimage process

GitHub · Source

Started
Apr 28, 2026
Duration
4h 28m
Users affected
Not disclosed
Revenue impact
Not disclosed
Blast radius
Actions Standard Ubuntu 22 and Ubuntu 24 hosted runner jobs
Services
actions, actions-hosted-runners, vm-reimage
capacity shortfallci/cdcontainer orchestrationcustomer-facingdelayed processingdeployregression from deployrollback

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A performance regression in the VM reimage process for Actions hosted runners slowed the rate at which Standard Ubuntu 22 and Ubuntu 24 runners returned to the available pool, lowering effective runner capacity. About 8 percent of jobs on those runners were delayed past 5 minutes or failed during the window. Engineers mitigated by rolling back to a known-good image version, after which capacity recovered.

Impact

Approximately 8 percent of hosted runner jobs using Standard Ubuntu 22 and Ubuntu 24 experienced delays greater than 5 minutes or failures during the impact window. Larger and self-hosted runners were not affected.

Root cause

A change to the VM reimage process for Actions hosted runners introduced a performance regression that lengthened the reimage step.

Slower reimage reduced the rate at which fresh runners returned to the available pool.

Lower effective capacity meant that jobs queued past their normal start time, with about 8 percent delayed or failing.

Telemetry on reimage performance was not granular enough to surface the slowdown immediately, contributing to time-to-detect.

There was no automated capacity-vs-queue-depth signal that would have triggered a mitigation before user-visible delay.

Resolution

Engineers mitigated by rolling back to a known-good image version, which restored normal reimage performance and let the runner pool refill. The incident was closed at 17:09 UTC after queue depth and start times returned to baseline.

Lessons

  • Capacity in pool-based systems like hosted runners is a function of recycle rate, not just count; a slow recycle is functionally a capacity loss.
  • Reimage performance is the kind of internal metric that doesn't show up on the user dashboard until users feel queue delay; explicit telemetry on it is worth the cost.
  • A rollback of an image is a clean mitigation when the regression is in the image-build step rather than in runner runtime.

Action items

  • Improve granularity of reimage telemetry across the Actions runner service and the underlying compute provider so similar regressions are diagnosed faster.
  • Address the underlying performance issue in the reimage process so the rolled-back image can be replaced with a forward-fixed version.