Anatomy of a Production Outage
A detailed postmortem analysis of a cascading failure that brought down a payment processing system, and the systemic improvements that followed.
Executive Summary
On October 15th at 14:32 UTC, a routine database migration triggered a cascading failure across the payment processing pipeline. The outage lasted 47 minutes and affected approximately 12,000 transactions. This analysis examines the root cause, contributing factors, and the systemic improvements we implemented.
This is a reconstructed postmortem for educational purposes. All identifying details have been anonymized.
Timeline
- 14:32 — Database migration begins on primary payment database
- 14:33 — Lock contention causes query latency to spike from 15ms to 2,400ms
- 14:35 — Connection pool exhaustion begins cascading to upstream services
- 14:38 — Circuit breakers trip across 6 dependent services
- 14:42 — Customer-facing error rate exceeds 40%
- 14:45 — Incident declared, on-call SRE begins investigation
- 15:02 — Root cause identified: long-running ALTER TABLE holding row locks
- 15:12 — Migration rolled back, connection pools begin recovering
- 15:19 — All services return to normal operation
Root Cause Analysis
The root cause was an ALTER TABLE operation that acquired an ACCESS EXCLUSIVE lock on the payments table. This lock was expected to complete in seconds based on staging tests, but production's significantly larger table size caused the operation to run for over 40 minutes.
Systemic Improvements
- Implemented online schema migration tooling (gh-ost) for zero-downtime DDL
- Added lock timeout configuration to all database migrations
- Created staging environments with production-scale data volumes
- Improved circuit breaker configuration to fail fast under database pressure
- Added connection pool monitoring with proactive alerting
Lessons Learned
“The gap between staging and production is where incidents hide. Close the gap or accept the risk.”