postmortem 18 min 2025-12-10

Anatomy of a Production Outage

A detailed postmortem analysis of a cascading failure that brought down a payment processing system, and the systemic improvements that followed.

#incident#postmortem#cascading-failure#sre

Executive Summary

On October 15th at 14:32 UTC, a routine database migration triggered a cascading failure across the payment processing pipeline. The outage lasted 47 minutes and affected approximately 12,000 transactions. This analysis examines the root cause, contributing factors, and the systemic improvements we implemented.

This is a reconstructed postmortem for educational purposes. All identifying details have been anonymized.

Timeline

14:32 — Database migration begins on primary payment database
14:33 — Lock contention causes query latency to spike from 15ms to 2,400ms
14:35 — Connection pool exhaustion begins cascading to upstream services
14:38 — Circuit breakers trip across 6 dependent services
14:42 — Customer-facing error rate exceeds 40%
14:45 — Incident declared, on-call SRE begins investigation
15:02 — Root cause identified: long-running ALTER TABLE holding row locks
15:12 — Migration rolled back, connection pools begin recovering
15:19 — All services return to normal operation

Root Cause Analysis

The root cause was an ALTER TABLE operation that acquired an ACCESS EXCLUSIVE lock on the payments table. This lock was expected to complete in seconds based on staging tests, but production's significantly larger table size caused the operation to run for over 40 minutes.

Systemic Improvements

Implemented online schema migration tooling (gh-ost) for zero-downtime DDL
Added lock timeout configuration to all database migrations
Created staging environments with production-scale data volumes
Improved circuit breaker configuration to fail fast under database pressure
Added connection pool monitoring with proactive alerting

Lessons Learned

“The gap between staging and production is where incidents hide. Close the gap or accept the risk.”