⚡ Promptolis Original · Coding & Development

🚨 Incident Post-Mortem Writer

Writes the blameless post-mortem that actually prevents recurrence — covering timeline, root cause, and specific action items with owners, not 'we'll be more careful next time.'

⏱️ 4 min to document 🤖 ~60 seconds in Claude 🗓️ Updated 2026-04-19

Why this is epic

Most post-mortems are either blame hunts or 'we'll be more careful' platitudes — both fail to prevent recurrence. This Original produces the structured blameless post-mortem (Google SRE style) that actually generates prevention.

Distinguishes contributing factors (multiple, usually) from root cause (singular, often deeper than it appears) — most post-mortems stop at the first plausible cause.

Produces specific action items with OWNERS and DEADLINES — not 'the team will investigate.' Generic action items = same incident in 6 months.

The prompt

Promptolis Original · Copy-ready
<role> You are an SRE / reliability engineer who has written 100+ post-mortems at scale. You apply Google SRE's blameless post-mortem methodology and know which action items prevent recurrence vs. which feel good but change nothing. </role> <principles> 1. Blameless. Focus on systems, not individuals. 2. Multiple contributing factors > single root cause. Usually. 3. 'Human error' is a stopping point, not a conclusion. Dig. 4. Action items need OWNERS + DEADLINES. 5. Publish broadly. Cross-team learning matters. 6. Write 48-72 hrs after incident. Cold is better. </principles> <input> <incident-summary>{what happened, user impact, severity}</incident-summary> <duration>{start-detect-mitigate-resolve timestamps}</duration> <timeline>{rough sequence of events}</timeline> <what-triggered>{initial cause or event}</what-triggered> <what-made-it-worse>{escalating factors}</what-made-it-worse> <how-resolved>{what finally fixed it}</how-resolved> <people-involved>{roles, not names}</people-involved> </input> <output-format> # Incident Post-Mortem: [Short title + date] ## Summary 2-3 sentences: what, impact, severity. ## Impact Users affected, revenue, SLO impact. ## Timeline (UTC) Start → detect → mitigate → resolve, with key events. ## Root Cause The deep why. ## Contributing Factors The conditions that made this possible. ## Detection + Response How we found out, how fast we acted. ## What Went Well Genuine positive notes. ## What Went Poorly Honest failures. ## Action Items Table: Item / Owner / Deadline / Status. ## Lessons Learned For the broader org. ## Follow-Up Review When to revisit action items. ## Key Takeaways 3 bullets. </output-format> <auto-intake> If incomplete: ask for summary, duration, timeline, trigger, worsening factors, resolution, people. </auto-intake> Now, write:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<incident-summary>Payment processing was degraded for 90 minutes. About 40% of checkouts failed with a 500 error. Severity: SEV-2.</incident-summary>
<duration>Trigger: 14:20 UTC (deploy). Detected: 14:45 UTC (monitoring alert). Mitigated: 15:30 UTC (rollback). Resolved: 15:50 UTC (traffic normal).</duration>
<timeline>14:20 deploy shipped. 14:25 checkout failure rate rises. 14:45 alert fires. 14:50 on-call engages. 15:00 hypothesis: new code path has null-pointer. 15:20 rollback started. 15:30 rollback complete, error rate drops. 15:50 confirmed stable.</timeline>
<what-triggered>Deploy added a new flag for A/B-testing a new checkout flow. The flag defaulted to 'on' instead of 'off' in production.</what-triggered>
<what-made-it-worse>Alert was delayed 25 min due to alert threshold being set for 10% error rate (this peaked at 40%, but built up from 0%). Also: on-call was in a meeting and missed initial page.</what-made-it-worse>
<how-resolved>Full deploy rollback. The new checkout flow was immediately disabled.</how-resolved>
<people-involved>On-call SRE, backend engineer who shipped the deploy, engineering manager.</people-involved>
🤖 Output

Incident Post-Mortem: Payment Processing Degradation, 2024-06-15

Severity: SEV-2

Status: Resolved

Duration: 90 minutes (14:20-15:50 UTC)

Summary

A deploy at 14:20 UTC introduced a new A/B-test flag that defaulted to 'on' in production, exposing an uncaught null-pointer error in the new checkout flow. Roughly 40% of checkout attempts failed with HTTP 500 over a 90-minute window before a full rollback restored service.

Impact

  • Users: approximately 4,200 failed checkouts during the window.
  • Revenue: estimated ~$65k in lost / delayed transactions (based on average order value × failure count).
  • SLO: checkout success rate SLO (99.5%) breached for the hour; error budget consumed.
  • Customer-facing: 38 support tickets received during the incident; no social media escalation.

Timeline (UTC)

TimeEvent
14:20Deploy shipped including new A/B-test flag (flag-checkout-v2)
14:25Checkout error rate begins to rise (0% → 5%)
14:35Error rate at 20%
14:45Monitoring alert fires (10% threshold crossed)
14:45-14:50Initial page to on-call; on-call was in meeting and did not acknowledge immediately
14:50Secondary page picked up; on-call begins investigation
15:00Hypothesis identified: new checkout path null-pointer exception on specific cart configurations
15:15Decision to roll back
15:20Rollback initiated
15:30Rollback complete; error rate drops below 1%
15:50Confirmed stable; incident declared resolved

Root Cause

The new CheckoutV2Service was deployed with a feature flag (flag-checkout-v2) intended to default OFF in production for gradual rollout. The flag configuration file had the production default set to true rather than false — a copy-paste error from a staging config.

The new code path contained an uncaught null-pointer exception on carts with zero line items (edge case for users adding gift cards to empty carts). Because the flag was enabled for 100% of traffic immediately, all affected users hit the bug.

Contributing Factors

1. Config file had wrong default value. The production flag was incorrectly set to `true`. Code review did not catch this because the change was formatted as a routine config update.

2. Feature flag rollout was not gradual. No ramp-up (e.g., 1% → 10% → 50% → 100%). A gradual rollout would have limited blast radius to a small percentage.

3. Null-pointer bug existed in the new code. An edge case (empty cart + gift card) was not covered by tests.

4. Alert threshold was too permissive. 10% error rate threshold meant 25-min delay before alerting; a lower threshold would have paged earlier.

5. On-call paging had no escalation when the first page was not acknowledged within 5 minutes.

Detection + Response

  • Detection time: 25 minutes (14:20 → 14:45). Too slow. SLO-related alerts should fire at 5-10 min for this severity.
  • Response time: 5 min from second page to investigation start. Acceptable after delay.
  • Mitigation time: 30 min from investigation to rollback complete. Appropriate — rollback was the right decision and was executed cleanly.

What Went Well

  • Rollback was fast once initiated. Deploy pipeline supported one-command rollback.
  • On-call engineer, once engaged, correctly identified the hypothesis within 10 minutes.
  • No data corruption. Failed checkouts did not leave partial-state records — the null-pointer aborted cleanly.
  • Customer comms posted to status page within 20 min of rollback decision.

What Went Poorly

  • Config error not caught in review. The production default was wrong; reviewer saw it as 'routine' and did not verify.
  • No gradual rollout. A new critical-path feature deployed to 100% of traffic immediately.
  • Alert threshold too high. 10% error rate is too permissive for checkout SLO; 3-5% would have fired faster.
  • First page missed. On-call was in a meeting and could not respond.
  • Edge case uncovered by tests. Empty cart + gift card scenarios not tested.

Action Items

#ItemOwnerDeadlineStatus
1Add automated validation that production flag defaults match explicit declarations (prevent copy-paste errors)@alex (infra)2024-06-29Open
2Standardize gradual rollout requirement for all new feature flags (1% → 10% → 50% → 100% over 24 hrs minimum)@maria (platform)2024-07-06Open
3Lower alert threshold for checkout success rate (10% → 3%)@jamie (SRE)2024-06-22Open
4Add escalation policy for unacknowledged pages after 3 min@jamie (SRE)2024-06-22Open
5Add test coverage for empty-cart + gift-card edge case@raj (backend)2024-06-29Open
6Code review checklist: for config changes, verify production default explicitly@manager2024-06-22Open

Lessons Learned

  • Feature flags without gradual rollout are a deploy hidden in flag form. Treat new flag defaults = full deploys. Require same rigor.
  • Edge cases in checkout have outsized blast radius. Checkout must have exhaustive test coverage of weird cart states — empty carts, gift-only carts, zero-value carts, etc.
  • Alert thresholds should be tied to SLOs. A 10% error rate on a 99.5% SLO is too permissive. Align.

Follow-Up Review

  • 2 weeks post: check all action items' progress in next Monday incident review.
  • 6 weeks post: verify that gradual-rollout policy is actually being followed on new flag deploys.
  • 12 weeks post: evaluate whether alert threshold change reduced detection time on other incidents.

Key Takeaways

  • Root cause: wrong production flag default + no gradual rollout + uncovered edge case. Three failures compounding; any one caught would have prevented the outage.
  • Detection time (25 min) is the biggest response failure. Alert threshold + paging escalation action items address this.
  • All feature flag deploys going forward require gradual rollout. Policy change is the highest-impact action item.

Common use cases

  • Post-outage documentation for engineering / leadership
  • Internal reliability review after major incidents
  • Near-miss documentation (no actual outage but close call)
  • Complex multi-team failure modes
  • Security incident post-mortems
  • Data loss / corruption incidents
  • Customer-facing incidents where status page matters

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.

Pro tips

  • Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.
  • Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.
  • 5 Whys are useful as a starting point, but often end at a superficial answer. Ask 'what conditions made this possible?' for deeper insight.
  • Action items must have an OWNER and a DEADLINE. 'We should...' = will not happen.
  • Publish the post-mortem. Internal systems that silo post-mortems lose learning across teams.
  • Don't write during the incident. Write after things are stable + enough time has passed to think clearly (usually 48-72 hours).

Customization tips

  • Write the post-mortem 48-72 hours after resolution. Earlier = adrenaline distorts; later = people forget details.
  • Before you send it out, have ONE uninvolved engineer read it. They'll spot blame-language you didn't notice.
  • Action items without owners and deadlines DON'T happen. Be explicit even if it feels bureaucratic.
  • Include 'what went well.' Blameless post-mortems need positive data to stay psychologically safe for teams.
  • Create an index of post-mortems. Pattern recognition across incidents reveals systemic issues.

Variants

Customer-Facing Outage

For incidents with external impact. Includes customer comms draft.

Security Incident

For security-specific incidents. Compliance + disclosure considerations.

Near-Miss / Close Call

For incidents that almost happened. Same rigor, different tone.

Frequently asked questions

How do I use the Incident Post-Mortem Writer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Incident Post-Mortem Writer?

Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.

Can I customize the Incident Post-Mortem Writer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.; Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals