SEV-0public access

Data Loss from Database Migration TRUNCATE CASCADE

Linear · Source

Started: Jan 24, 2024
Duration: 3h 47m
Users affected: Not disclosed
Revenue impact: Not disclosed
Blast radius: —
Services: database, sync-engine, api, notifications

configuration errorcustomer-facingdata corruptiondata lossdata repairdatabasedeployfull outagehuman errorrollback

Join the waitlist

Aftermath helps you ship structured post-mortems like this one for your own incidents. Encore keeps narrative, timeline, lessons, and action items in one place so the document stays useful after the incident is closed. Join the waitlist on the homepage when you want that workflow for your organization.

Join the waitlist

Summary

A faulty database migration using TRUNCATE TABLE ... CASCADE accidentally deleted production data across multiple core tables — including issue and document descriptions, comments, notifications, favorites, and reactions. The deletion went unnoticed for 30 minutes due to multi-layer caching. Linear was taken offline for one hour, the database was restored from a backup taken several hours before the incident, and a two-day data restoration effort recovered over 99% of lost data.

Impact

All users experienced approximately one hour of total platform downtime. 12% of workspaces had data unavailable until restoration completed; an additional 7% lost automated-process changes such as generated Cycles. Over 99% of data was recovered within 36 hours, with a small number of unresolvable conflicts leaving roughly 0.44 sync packets lost per affected workspace on average.

Root cause

A database migration intended to clear data from two new, non-user-facing tables used the SQL statement TRUNCATE TABLE <new_table> CASCADE. The CASCADE keyword propagates the truncation to all tables with foreign keys pointing at the target table, which unexpectedly deleted production data from several critical tables. The migration was code-reviewed and tested locally, but the dangerous behavior of CASCADE in this context was missed by both the author and peer reviewers. Caching at multiple layers (local client cache and a database-layer sync cache) masked the data deletion for approximately 30 minutes after the migration ran, delaying detection.

Resolution

Linear was placed into maintenance mode to prevent further writes. The database was restored from a full backup taken at 04:47 UTC, approximately two hours before the migration ran. Point-in-time recovery was available but had never been tested or tooled, so the full-backup path was used instead. A sync-engine cursor conflict caused by the database rollback was resolved by resetting the Redis cache and restarting the sync service. A custom restoration script then replayed captured sync actions to recover changes made between the backup timestamp and the maintenance window.

Timeline

04:47DETECT
Full database backup completed — this becomes the recovery baseline.
database
07:01INVEST
Faulty migration merged and deployed to production. The TRUNCATE ... CASCADE statement executes, silently deleting data from multiple production tables.
database
07:20INVEST
Migration reported as completed. Impact remains invisible due to multi-layer caching.
database
07:52DETECT
Unusual notification behavior noticed by an engineer using their own inbox. Engineering and support alerted to investigate user reports.
notifications
08:10INVEST
Critical incident opened. Engineers join a Zoom call; additional responders paged.
database
08:36INVEST
Status page updated and incident posted publicly on X: investigating data access issues.
api
09:20INVEST
Additional status update published. Root cause narrowed to the faulty migration commit via a spike in GraphQL endpoint warnings correlated with the migration timestamp.
database
09:56MITIG
Linear placed into maintenance mode, blocking all writes and presenting a maintenance screen to all users.
api
10:48RESOLV
Database restored from the 04:47 full backup. Linear brought back online. Sync cursor conflict fixed by flushing Redis cache and restarting the sync service.
database
11:30RESOLV
Data restoration process started for all changes made between 04:47 and 09:56 UTC using captured sync action logs.
database
17:49RESOLV
Workspace-by-workspace restoration script executed after dry-run validation. 98% of affected workspaces recovered by end of day.
database
08:39RESOLV
Final workspace restoration completed. Over 99% of lost data recovered.
database

Attribution

Linear

By Engineering

Published Jan 24, 2024

View original source

Lessons

Multi-layer caching can mask data destruction events for extended periods — error metric baselines that exclude 'expected' warnings can silently miss catastrophic failures.
Point-in-time recovery capability is only useful if it has been tested and tooled; untested recovery paths are effectively unavailable during a high-pressure incident.
Dangerous SQL operations like CASCADE truncation require a category of review distinct from standard code review — database admin review with schema-aware linting is necessary.

Action items

Remove TRUNCATE privileges from all production database users.
Implement linting rules that flag CASCADE on TRUNCATE in migration diffs and block merge without explicit DBA sign-off.
Build and routinely test tooling for point-in-time database recovery.
Automate database migration testing in a staging environment populated with production-representative data.
Implement read-only mode so clients can continue reading data during recovery operations.
Add data integrity monitors that detect unexpected row-count drops across core tables.