Escalation Management Runbook
Handle high-severity incidents without losing the account — roles, flow, comms, metrics and rollout.
Audience & situation
For CS, Support, Engineering and Sales leaders who must stabilize high-severity incidents (SEV-1/2) while protecting revenue and trust. Use this runbook when outages, security events, data issues or high-impact defects trigger executive visibility and churn risk.
Introduction
Most escalations are lost in the first hour—not because engineering cannot fix the issue, but because ownership and communication are unclear. Customers can tolerate defects; they will not tolerate silence or spin. A repeatable escalation system protects the relationship while the fix is in flight: we declare quickly, assign clear roles, publish a cadence for updates and document decisions in one place.
This runbook defines when to declare, who leads what, how we communicate and how we close. It compresses chaos into a rhythm: contain risk, inform stakeholders, fix fast, and learn so it does not happen again. We keep executives focused on decisions (risk, customer impact, credits) while the incident team handles technical work. Everyone sees the same timeline and facts.
We design for mixed contexts: SaaS outages, integrations failing at a key customer, security events with legal implications. In all cases, we bias to early declaration and over-communication. We prefer a clean “we are on it every 2 hours” over a perfectly formatted post at hour six.
Finally, we protect the account. Sales and CS are part of the core team from minute one. They inform the update cadence and align on credits, extension of terms or executive calls. We avoid promises the team cannot meet and escalate decisions that change scope or risk. After resolution, we run a blameless post-mortem that produces specific code/process changes and an outbound narrative that rebuilds confidence.
What good looks like
- Clear declaration: severity, customer impact, incident commander assigned within 15 minutes.
- Roles & channel: IC, Comms Lead, Engineering Owner, CS/Sales owner; single incident channel and timeline log.
- Comms cadence: public updates every 60–120 minutes, exec brief every 4 hours, one source of truth.
- Binary status: Contained → Mitigated → Resolved, with exit criteria and evidence.
- Post-mortem: blameless, scheduled within 5 business days, actions with owners and due dates.
Common pitfalls
- Slow declaration: teams wait for perfect facts → customer fills the gap.
- Role confusion: IC vs. Eng lead vs. Comms lead not explicit → conflicting messages.
- Drift in cadence: “we’ll update soon” becomes hours → trust evaporates.
- Vague closure: “seems stable” without formal exit → incidents reopen.
- No learning loop: post-mortem postponed → recurrence risk stays high.
Playbook
1) Declare & triage (first 15 minutes)
- Assess user impact, data risk, security indicators and executive visibility.
- Set severity (SEV-1/2), name Incident Commander (IC) and open the incident channel.
- Create incident record (ID, start time, affected services, customers, first hypothesis).
2) Assign roles
- Incident Commander: owns flow, decisions, timeline, and status.
- Engineering Owner: leads technical triage/fix; assigns SMEs.
- Comms Lead: drafts updates (customer + internal) and runs cadence.
- CS/Sales Owner: customer briefings, sentiment, credits path, exec calls.
- Executive Sponsor: shields team, aligns on risk and approvals.
3) Contain & communicate (first hour)
- Publish first customer update: acknowledge, impact, current mitigation, next update time.
- Open executive brief doc (one pager): status, risks, asks, ETA, customer list.
- Decide mitigation (rollback, feature flag, rate limiting, hotfix) with risk notes.
4) Fix & track
- Timebox hypotheses; log each attempt in the timeline with result.
- Keep 60–120 minute public update cadence; internal updates as needed.
- Mark status transitions with evidence: logs/metrics for Contained, validation for Resolved.
5) Close cleanly
- Resolution update: cause (known/under investigation), customer impact window, prevention steps.
- Decide credits/terms with CS/Sales; book post-mortem within 5 business days.
- Handoff to remediation: owners, dates, acceptance criteria, comms plan.
6) Post-mortem & remediation
- Blameless write-up: timeline, contributing factors, detection gaps, “five whys”.
- Actions across code, infra, process and monitoring; tie to OKRs.
- Outbound “we learned” update to key customers where appropriate.
Artifacts
Core
- Incident record (ID, severity, owners, start, status, affected scope).
- Timeline log (UTC stamps, action, owner, result).
- Customer update template (acknowledge → impact → action → next update).
Supporting
- Executive brief (one pager, refreshed every 4h).
- Post-mortem template (blameless, actions with owners/dates).
- Credit/terms calculator and approval matrix.
Worked examples
Example A — SaaS outage (database failover)
Situation: Global read latency spikes → write errors for 18% of sessions.
Actions: SEV-1 declared in 9 min; rollback feature flag; failover to replica; status page + 60-min cadence; exec brief every 4h.
Result: Mitigated in 72 min; root cause: connection pool misconfig; shipped guardrails in 48h; zero churn; one credit issued.
Example B — Security incident (credential leak suspicion)
Situation: Anomalous logins in two regions; potential token reuse.
Actions: SEV-1; forced token rotation; geo blocks; forensic logging; legal engaged; 2-hour updates.
Result: No data exfil confirmed; rotated secrets; updated playbooks; customer trust preserved via transparent comms.
Example C — P1 integration failure at strategic account
Situation: Nightly sync fails; finance cannot close month.
Actions: SEV-2; hotfix patch; backfill job; daily exec call with customer CFO; MAP for hardening.
Result: Resolved in 36h; created reliability program; expansion continued.
Metrics
Leading: time-to-ack (TTA), time-to-first-update (TTU), update SLA adherence, executive brief cadence kept, % incidents with named IC and Comms Lead.
Lagging: MTTR, customer sentiment delta (before/after), SLA credits issued, churn risk change on affected accounts, recurrence rate within 90 days.
Declare early, communicate on a clock, close with evidence and learning.
Implementation checklist
- Publish severity matrix and declaration rules (SEV-1/2/3).
- Define roles (IC, Eng Owner, Comms, CS/Sales) and on-call rotation.
- Create incident channel convention and timeline logging template.
- Stand up status page or customer update mechanism.
- Adopt post-mortem template; schedule monthly drills.
- Set credit approval matrix and exec brief template.
Measurement
Team level: % incidents declared <15 min, update SLA hit rate, MTTR trend, recurrence, credits cost as % ARR.
Individual level: IC cadence kept, timeline completeness, post-mortem timeliness, remediation action closure rate.
Team buy-in
- Position escalations as trust-events: how we respond matters more than the defect.
- Coach blamelessness and single-source-of-truth comms; celebrate clean runs.
- Rotate roles to build depth; shadow IC on drills.
Why it matters
- Customer trust: clear ownership and cadence prevent rumor spirals.
- Revenue protection: faster, cleaner incidents reduce credits and churn risk.
- Learning culture: post-mortems compound reliability and speed.
Pair with the MAP for recovery plans and Stage Criteria to keep deal rhythm during incidents.
Metrics & pitfalls
Watch
- TTA / TTU <= 15m / 60–120m
- Update SLA adherence
- Post-mortem scheduled <= 5d
Avoid
- “We’ll update soon” (no timestamp)
- Competing narratives across teams
- Closing without evidence or actions
90-day rollout
Weeks 1–2 — Stand up
- Owners: Support/CS (lead), Engineering, Legal/Comms.
- Artifacts: severity matrix, role cards, templates (updates, exec brief, post-mortem).
- Exit: on-call rota live; incident channel convention published.
Weeks 3–4 — Drill
- Tabletop exercise; measure TTA/TTU; tune templates and handoffs.
Weeks 5–6 — Pilot on live incidents
- Run the cadence on 3+ incidents; collect customer feedback; adjust update timing.
Weeks 7–8 — Instrument
- Dashboard: TTA, TTU, MTTR, SLA credits; add IC cadence metric.
Weeks 9–10 — Cross-functional SLAs
- Legal/security response times; exec brief schedule; credit approval matrix.
Weeks 11–12 — Bake into rhythm
- Monthly drills; post-mortem review in ops forum; refresh examples quarterly.
Related
Next steps & CTA
- Publish the severity matrix and declare rules this week.
- Run a 45-minute tabletop with IC/Eng/Comms/CS.
- Adopt the companion template for the next incident.
Sources & terms
Terms: SEV-1/2 (severity), IC (Incident Commander), MTTR (Mean Time to Restore), TTU (Time to first Update), PIR (Post-Incident Review), RFO/RCA (Reason/Root Cause Analysis).