⚡ Promptolis Original · Coding & Development
🚨 Incident Post-Mortem Writer
Writes the blameless post-mortem that actually prevents recurrence — covering timeline, root cause, and specific action items with owners, not 'we'll be more careful next time.'
Why this is epic
Most post-mortems are either blame hunts or 'we'll be more careful' platitudes — both fail to prevent recurrence. This Original produces the structured blameless post-mortem (Google SRE style) that actually generates prevention.
Distinguishes contributing factors (multiple, usually) from root cause (singular, often deeper than it appears) — most post-mortems stop at the first plausible cause.
Produces specific action items with OWNERS and DEADLINES — not 'the team will investigate.' Generic action items = same incident in 6 months.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<incident-summary>Payment processing was degraded for 90 minutes. About 40% of checkouts failed with a 500 error. Severity: SEV-2.</incident-summary> <duration>Trigger: 14:20 UTC (deploy). Detected: 14:45 UTC (monitoring alert). Mitigated: 15:30 UTC (rollback). Resolved: 15:50 UTC (traffic normal).</duration> <timeline>14:20 deploy shipped. 14:25 checkout failure rate rises. 14:45 alert fires. 14:50 on-call engages. 15:00 hypothesis: new code path has null-pointer. 15:20 rollback started. 15:30 rollback complete, error rate drops. 15:50 confirmed stable.</timeline> <what-triggered>Deploy added a new flag for A/B-testing a new checkout flow. The flag defaulted to 'on' instead of 'off' in production.</what-triggered> <what-made-it-worse>Alert was delayed 25 min due to alert threshold being set for 10% error rate (this peaked at 40%, but built up from 0%). Also: on-call was in a meeting and missed initial page.</what-made-it-worse> <how-resolved>Full deploy rollback. The new checkout flow was immediately disabled.</how-resolved> <people-involved>On-call SRE, backend engineer who shipped the deploy, engineering manager.</people-involved>
Incident Post-Mortem: Payment Processing Degradation, 2024-06-15
Severity: SEV-2
Status: Resolved
Duration: 90 minutes (14:20-15:50 UTC)
Summary
A deploy at 14:20 UTC introduced a new A/B-test flag that defaulted to 'on' in production, exposing an uncaught null-pointer error in the new checkout flow. Roughly 40% of checkout attempts failed with HTTP 500 over a 90-minute window before a full rollback restored service.
Impact
- Users: approximately 4,200 failed checkouts during the window.
- Revenue: estimated ~$65k in lost / delayed transactions (based on average order value × failure count).
- SLO: checkout success rate SLO (99.5%) breached for the hour; error budget consumed.
- Customer-facing: 38 support tickets received during the incident; no social media escalation.
Timeline (UTC)
| Time | Event |
|---|---|
| 14:20 | Deploy shipped including new A/B-test flag (flag-checkout-v2) |
| 14:25 | Checkout error rate begins to rise (0% → 5%) |
| 14:35 | Error rate at 20% |
| 14:45 | Monitoring alert fires (10% threshold crossed) |
| 14:45-14:50 | Initial page to on-call; on-call was in meeting and did not acknowledge immediately |
| 14:50 | Secondary page picked up; on-call begins investigation |
| 15:00 | Hypothesis identified: new checkout path null-pointer exception on specific cart configurations |
| 15:15 | Decision to roll back |
| 15:20 | Rollback initiated |
| 15:30 | Rollback complete; error rate drops below 1% |
| 15:50 | Confirmed stable; incident declared resolved |
Root Cause
The new CheckoutV2Service was deployed with a feature flag (flag-checkout-v2) intended to default OFF in production for gradual rollout. The flag configuration file had the production default set to true rather than false — a copy-paste error from a staging config.
The new code path contained an uncaught null-pointer exception on carts with zero line items (edge case for users adding gift cards to empty carts). Because the flag was enabled for 100% of traffic immediately, all affected users hit the bug.
Contributing Factors
1. Config file had wrong default value. The production flag was incorrectly set to `true`. Code review did not catch this because the change was formatted as a routine config update.
2. Feature flag rollout was not gradual. No ramp-up (e.g., 1% → 10% → 50% → 100%). A gradual rollout would have limited blast radius to a small percentage.
3. Null-pointer bug existed in the new code. An edge case (empty cart + gift card) was not covered by tests.
4. Alert threshold was too permissive. 10% error rate threshold meant 25-min delay before alerting; a lower threshold would have paged earlier.
5. On-call paging had no escalation when the first page was not acknowledged within 5 minutes.
Detection + Response
- Detection time: 25 minutes (14:20 → 14:45). Too slow. SLO-related alerts should fire at 5-10 min for this severity.
- Response time: 5 min from second page to investigation start. Acceptable after delay.
- Mitigation time: 30 min from investigation to rollback complete. Appropriate — rollback was the right decision and was executed cleanly.
What Went Well
- Rollback was fast once initiated. Deploy pipeline supported one-command rollback.
- On-call engineer, once engaged, correctly identified the hypothesis within 10 minutes.
- No data corruption. Failed checkouts did not leave partial-state records — the null-pointer aborted cleanly.
- Customer comms posted to status page within 20 min of rollback decision.
What Went Poorly
- Config error not caught in review. The production default was wrong; reviewer saw it as 'routine' and did not verify.
- No gradual rollout. A new critical-path feature deployed to 100% of traffic immediately.
- Alert threshold too high. 10% error rate is too permissive for checkout SLO; 3-5% would have fired faster.
- First page missed. On-call was in a meeting and could not respond.
- Edge case uncovered by tests. Empty cart + gift card scenarios not tested.
Action Items
| # | Item | Owner | Deadline | Status |
|---|---|---|---|---|
| 1 | Add automated validation that production flag defaults match explicit declarations (prevent copy-paste errors) | @alex (infra) | 2024-06-29 | Open |
| 2 | Standardize gradual rollout requirement for all new feature flags (1% → 10% → 50% → 100% over 24 hrs minimum) | @maria (platform) | 2024-07-06 | Open |
| 3 | Lower alert threshold for checkout success rate (10% → 3%) | @jamie (SRE) | 2024-06-22 | Open |
| 4 | Add escalation policy for unacknowledged pages after 3 min | @jamie (SRE) | 2024-06-22 | Open |
| 5 | Add test coverage for empty-cart + gift-card edge case | @raj (backend) | 2024-06-29 | Open |
| 6 | Code review checklist: for config changes, verify production default explicitly | @manager | 2024-06-22 | Open |
Lessons Learned
- Feature flags without gradual rollout are a deploy hidden in flag form. Treat new flag defaults = full deploys. Require same rigor.
- Edge cases in checkout have outsized blast radius. Checkout must have exhaustive test coverage of weird cart states — empty carts, gift-only carts, zero-value carts, etc.
- Alert thresholds should be tied to SLOs. A 10% error rate on a 99.5% SLO is too permissive. Align.
Follow-Up Review
- 2 weeks post: check all action items' progress in next Monday incident review.
- 6 weeks post: verify that gradual-rollout policy is actually being followed on new flag deploys.
- 12 weeks post: evaluate whether alert threshold change reduced detection time on other incidents.
Key Takeaways
- Root cause: wrong production flag default + no gradual rollout + uncovered edge case. Three failures compounding; any one caught would have prevented the outage.
- Detection time (25 min) is the biggest response failure. Alert threshold + paging escalation action items address this.
- All feature flag deploys going forward require gradual rollout. Policy change is the highest-impact action item.
Common use cases
- Post-outage documentation for engineering / leadership
- Internal reliability review after major incidents
- Near-miss documentation (no actual outage but close call)
- Complex multi-team failure modes
- Security incident post-mortems
- Data loss / corruption incidents
- Customer-facing incidents where status page matters
Best AI model for this
Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.
Pro tips
- Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.
- Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.
- 5 Whys are useful as a starting point, but often end at a superficial answer. Ask 'what conditions made this possible?' for deeper insight.
- Action items must have an OWNER and a DEADLINE. 'We should...' = will not happen.
- Publish the post-mortem. Internal systems that silo post-mortems lose learning across teams.
- Don't write during the incident. Write after things are stable + enough time has passed to think clearly (usually 48-72 hours).
Customization tips
- Write the post-mortem 48-72 hours after resolution. Earlier = adrenaline distorts; later = people forget details.
- Before you send it out, have ONE uninvolved engineer read it. They'll spot blame-language you didn't notice.
- Action items without owners and deadlines DON'T happen. Be explicit even if it feels bureaucratic.
- Include 'what went well.' Blameless post-mortems need positive data to stay psychologically safe for teams.
- Create an index of post-mortems. Pattern recognition across incidents reveals systemic issues.
Variants
Customer-Facing Outage
For incidents with external impact. Includes customer comms draft.
Security Incident
For security-specific incidents. Compliance + disclosure considerations.
Near-Miss / Close Call
For incidents that almost happened. Same rigor, different tone.
Frequently asked questions
How do I use the Incident Post-Mortem Writer prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Incident Post-Mortem Writer?
Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.
Can I customize the Incident Post-Mortem Writer prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.; Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals