Success

Escalation Management Runbook

Handle high-severity incidents without losing the account — roles, flow, comms, metrics and rollout.

War room with roles, timeline and customer comms for a major incident

Audience & situation

For CS, Support, Engineering and Sales leaders who must stabilize high-severity incidents (SEV-1/2) while protecting revenue and trust. Use this runbook when outages, security events, data issues or high-impact defects trigger executive visibility and churn risk.

Introduction

Most escalations are lost in the first hour—not because engineering cannot fix the issue, but because ownership and communication are unclear. Customers can tolerate defects; they will not tolerate silence or spin. A repeatable escalation system protects the relationship while the fix is in flight: we declare quickly, assign clear roles, publish a cadence for updates and document decisions in one place.

This runbook defines when to declare, who leads what, how we communicate and how we close. It compresses chaos into a rhythm: contain risk, inform stakeholders, fix fast, and learn so it does not happen again. We keep executives focused on decisions (risk, customer impact, credits) while the incident team handles technical work. Everyone sees the same timeline and facts.

We design for mixed contexts: SaaS outages, integrations failing at a key customer, security events with legal implications. In all cases, we bias to early declaration and over-communication. We prefer a clean “we are on it every 2 hours” over a perfectly formatted post at hour six.

Finally, we protect the account. Sales and CS are part of the core team from minute one. They inform the update cadence and align on credits, extension of terms or executive calls. We avoid promises the team cannot meet and escalate decisions that change scope or risk. After resolution, we run a blameless post-mortem that produces specific code/process changes and an outbound narrative that rebuilds confidence.

What good looks like

Common pitfalls

Playbook

1) Declare & triage (first 15 minutes)

2) Assign roles

3) Contain & communicate (first hour)

4) Fix & track

5) Close cleanly

6) Post-mortem & remediation

Artifacts

Core

  • Incident record (ID, severity, owners, start, status, affected scope).
  • Timeline log (UTC stamps, action, owner, result).
  • Customer update template (acknowledge → impact → action → next update).

Supporting

  • Executive brief (one pager, refreshed every 4h).
  • Post-mortem template (blameless, actions with owners/dates).
  • Credit/terms calculator and approval matrix.

Worked examples

Example A — SaaS outage (database failover)

Situation: Global read latency spikes → write errors for 18% of sessions.

Actions: SEV-1 declared in 9 min; rollback feature flag; failover to replica; status page + 60-min cadence; exec brief every 4h.

Result: Mitigated in 72 min; root cause: connection pool misconfig; shipped guardrails in 48h; zero churn; one credit issued.

Example B — Security incident (credential leak suspicion)

Situation: Anomalous logins in two regions; potential token reuse.

Actions: SEV-1; forced token rotation; geo blocks; forensic logging; legal engaged; 2-hour updates.

Result: No data exfil confirmed; rotated secrets; updated playbooks; customer trust preserved via transparent comms.

Example C — P1 integration failure at strategic account

Situation: Nightly sync fails; finance cannot close month.

Actions: SEV-2; hotfix patch; backfill job; daily exec call with customer CFO; MAP for hardening.

Result: Resolved in 36h; created reliability program; expansion continued.

Metrics

Leading: time-to-ack (TTA), time-to-first-update (TTU), update SLA adherence, executive brief cadence kept, % incidents with named IC and Comms Lead.

Lagging: MTTR, customer sentiment delta (before/after), SLA credits issued, churn risk change on affected accounts, recurrence rate within 90 days.

Escalation flow: Declare → Roles → Contain → Communicate → Fix → Resolve → Post-mortem → Remediate

Declare early, communicate on a clock, close with evidence and learning.

Implementation checklist

Measurement

Team level: % incidents declared <15 min, update SLA hit rate, MTTR trend, recurrence, credits cost as % ARR.

Individual level: IC cadence kept, timeline completeness, post-mortem timeliness, remediation action closure rate.

Team buy-in

Why it matters

Pair with the MAP for recovery plans and Stage Criteria to keep deal rhythm during incidents.

Metrics & pitfalls

Watch

  • TTA / TTU <= 15m / 60–120m
  • Update SLA adherence
  • Post-mortem scheduled <= 5d

Avoid

  • “We’ll update soon” (no timestamp)
  • Competing narratives across teams
  • Closing without evidence or actions

90-day rollout

Weeks 1–2 — Stand up

Weeks 3–4 — Drill

Weeks 5–6 — Pilot on live incidents

Weeks 7–8 — Instrument

Weeks 9–10 — Cross-functional SLAs

Weeks 11–12 — Bake into rhythm

Related

Next steps & CTA

Use the template

Sources & terms

Terms: SEV-1/2 (severity), IC (Incident Commander), MTTR (Mean Time to Restore), TTU (Time to first Update), PIR (Post-Incident Review), RFO/RCA (Reason/Root Cause Analysis).