Incident Post-Mortem Writer

⚡ Quick Answer

Incident Post-Mortem Writer — Writes the blameless post-mortem that actually prevents recurrence. Setup: 4 min to document · Best AI: Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier. · Cost: Free, MIT-licensed.

Why this is epic

Most post-mortems are either blame hunts or 'we'll be more careful' platitudes — both fail to prevent recurrence. This Original produces the structured blameless post-mortem (Google SRE style) that actually generates prevention.

Distinguishes contributing factors (multiple, usually) from root cause (singular, often deeper than it appears) — most post-mortems stop at the first plausible cause.

Produces specific action items with OWNERS and DEADLINES — not 'the team will investigate.' Generic action items = same incident in 6 months.

📑 Page navigation + Key Takeaways Click to expand

📌 Key Takeaways

What it is: Writes the blameless post-mortem that actually prevents recurrence.
Best for: Post-outage documentation for engineering / leadership
Time investment: 4 min to document setup, ~60 seconds in Claude output
Recommended AI model: Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.
Cost: Free forever — MIT-licensed, no signup, no paywall

⚙️ At a glance

Category:: Coding & Development
Setup time:: 4 min to document
Output time:: ~60 seconds in Claude
Best AI model:: Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.
License:: MIT (free commercial use)
Last reviewed:: 2026-07-06

📊 Promptolis Original vs generic AI prompts Click to expand

Feature	Promptolis	Generic prompts
Structure:	XML + chain-of-thought	Role-play one-liner
Example output:	Real full example	Rare
Variants:	3-7 per prompt	Single
Output quality:	+30-50% accurate ^[Anthropic]	Baseline

On the other hand, generic prompts work fine for simple lookups. Promptolis Originals shine for nuanced reasoning where precision matters.

The prompt

Promptolis Original · Copy-ready

<role> You are an SRE / reliability engineer who has written 100+ post-mortems at scale. You apply Google SRE's blameless post-mortem methodology and know which action items prevent recurrence vs. which feel good but change nothing. </role> <principles> 1. Blameless. Focus on systems, not individuals. 2. Multiple contributing factors > single root cause. Usually. 3. 'Human error' is a stopping point, not a conclusion. Dig. 4. Action items need OWNERS + DEADLINES. 5. Publish broadly. Cross-team learning matters. 6. Write 48-72 hrs after incident. Cold is better. </principles> <input> <incident-summary>{what happened, user impact, severity}</incident-summary> <duration>{start-detect-mitigate-resolve timestamps}</duration> <timeline>{rough sequence of events}</timeline> <what-triggered>{initial cause or event}</what-triggered> <what-made-it-worse>{escalating factors}</what-made-it-worse> <how-resolved>{what finally fixed it}</how-resolved> <people-involved>{roles, not names}</people-involved> </input> <output-format> # Incident Post-Mortem: [Short title + date] ## Summary 2-3 sentences: what, impact, severity. ## Impact Users affected, revenue, SLO impact. ## Timeline (UTC) Start → detect → mitigate → resolve, with key events. ## Root Cause The deep why. ## Contributing Factors The conditions that made this possible. ## Detection + Response How we found out, how fast we acted. ## What Went Well Genuine positive notes. ## What Went Poorly Honest failures. ## Action Items Table: Item / Owner / Deadline / Status. ## Lessons Learned For the broader org. ## Follow-Up Review When to revisit action items. ## Key Takeaways 3 bullets. </output-format> <auto-intake> If incomplete: ask for summary, duration, timeline, trigger, worsening factors, resolution, people. </auto-intake> Now, write:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<incident-summary>Payment processing was degraded for 90 minutes. About 40% of checkouts failed with a 500 error. Severity: SEV-2.</incident-summary>
<duration>Trigger: 14:20 UTC (deploy). Detected: 14:45 UTC (monitoring alert). Mitigated: 15:30 UTC (rollback). Resolved: 15:50 UTC (traffic normal).</duration>
<timeline>14:20 deploy shipped. 14:25 checkout failure rate rises. 14:45 alert fires. 14:50 on-call engages. 15:00 hypothesis: new code path has null-pointer. 15:20 rollback started. 15:30 rollback complete, error rate drops. 15:50 confirmed stable.</timeline>
<what-triggered>Deploy added a new flag for A/B-testing a new checkout flow. The flag defaulted to 'on' instead of 'off' in production.</what-triggered>
<what-made-it-worse>Alert was delayed 25 min due to alert threshold being set for 10% error rate (this peaked at 40%, but built up from 0%). Also: on-call was in a meeting and missed initial page.</what-made-it-worse>
<how-resolved>Full deploy rollback. The new checkout flow was immediately disabled.</how-resolved>
<people-involved>On-call SRE, backend engineer who shipped the deploy, engineering manager.</people-involved>

🤖 Output

Incident Post-Mortem: Payment Processing Degradation, 2024-06-15

Severity: SEV-2

Status: Resolved

Duration: 90 minutes (14:20-15:50 UTC)

Summary

A deploy at 14:20 UTC introduced a new A/B-test flag that defaulted to 'on' in production, exposing an uncaught null-pointer error in the new checkout flow. Roughly 40% of checkout attempts failed with HTTP 500 over a 90-minute window before a full rollback restored service.

Impact

Users: approximately 4,200 failed checkouts during the window.
Revenue: estimated ~$65k in lost / delayed transactions (based on average order value × failure count).
SLO: checkout success rate SLO (99.5%) breached for the hour; error budget consumed.
Customer-facing: 38 support tickets received during the incident; no social media escalation.

Timeline (UTC)

Time	Event
14:20	Deploy shipped including new A/B-test flag (flag-checkout-v2)
14:25	Checkout error rate begins to rise (0% → 5%)
14:35	Error rate at 20%
14:45	Monitoring alert fires (10% threshold crossed)
14:45-14:50	Initial page to on-call; on-call was in meeting and did not acknowledge immediately
14:50	Secondary page picked up; on-call begins investigation
15:00	Hypothesis identified: new checkout path null-pointer exception on specific cart configurations
15:15	Decision to roll back
15:20	Rollback initiated
15:30	Rollback complete; error rate drops below 1%
15:50	Confirmed stable; incident declared resolved

Root Cause

The new CheckoutV2Service was deployed with a feature flag (flag-checkout-v2) intended to default OFF in production for gradual rollout. The flag configuration file had the production default set to true rather than false — a copy-paste error from a staging config.

The new code path contained an uncaught null-pointer exception on carts with zero line items (edge case for users adding gift cards to empty carts). Because the flag was enabled for 100% of traffic immediately, all affected users hit the bug.

Contributing Factors

1. Config file had wrong default value. The production flag was incorrectly set to `true`. Code review did not catch this because the change was formatted as a routine config update.

2. Feature flag rollout was not gradual. No ramp-up (e.g., 1% → 10% → 50% → 100%). A gradual rollout would have limited blast radius to a small percentage.

3. Null-pointer bug existed in the new code. An edge case (empty cart + gift card) was not covered by tests.

4. Alert threshold was too permissive. 10% error rate threshold meant 25-min delay before alerting; a lower threshold would have paged earlier.

5. On-call paging had no escalation when the first page was not acknowledged within 5 minutes.

Detection + Response

Detection time: 25 minutes (14:20 → 14:45). Too slow. SLO-related alerts should fire at 5-10 min for this severity.
Response time: 5 min from second page to investigation start. Acceptable after delay.
Mitigation time: 30 min from investigation to rollback complete. Appropriate — rollback was the right decision and was executed cleanly.

What Went Well

Rollback was fast once initiated. Deploy pipeline supported one-command rollback.
On-call engineer, once engaged, correctly identified the hypothesis within 10 minutes.
No data corruption. Failed checkouts did not leave partial-state records — the null-pointer aborted cleanly.
Customer comms posted to status page within 20 min of rollback decision.

What Went Poorly

Config error not caught in review. The production default was wrong; reviewer saw it as 'routine' and did not verify.
No gradual rollout. A new critical-path feature deployed to 100% of traffic immediately.
Alert threshold too high. 10% error rate is too permissive for checkout SLO; 3-5% would have fired faster.
First page missed. On-call was in a meeting and could not respond.
Edge case uncovered by tests. Empty cart + gift card scenarios not tested.

Action Items

#	Item	Owner	Deadline	Status
1	Add automated validation that production flag defaults match explicit declarations (prevent copy-paste errors)	@alex (infra)	2024-06-29	Open
2	Standardize gradual rollout requirement for all new feature flags (1% → 10% → 50% → 100% over 24 hrs minimum)	@maria (platform)	2024-07-06	Open
3	Lower alert threshold for checkout success rate (10% → 3%)	@jamie (SRE)	2024-06-22	Open
4	Add escalation policy for unacknowledged pages after 3 min	@jamie (SRE)	2024-06-22	Open
5	Add test coverage for empty-cart + gift-card edge case	@raj (backend)	2024-06-29	Open
6	Code review checklist: for config changes, verify production default explicitly	@manager	2024-06-22	Open

Lessons Learned

Feature flags without gradual rollout are a deploy hidden in flag form. Treat new flag defaults = full deploys. Require same rigor.
Edge cases in checkout have outsized blast radius. Checkout must have exhaustive test coverage of weird cart states — empty carts, gift-only carts, zero-value carts, etc.
Alert thresholds should be tied to SLOs. A 10% error rate on a 99.5% SLO is too permissive. Align.

Follow-Up Review

2 weeks post: check all action items' progress in next Monday incident review.
6 weeks post: verify that gradual-rollout policy is actually being followed on new flag deploys.
12 weeks post: evaluate whether alert threshold change reduced detection time on other incidents.

Key Takeaways

Root cause: wrong production flag default + no gradual rollout + uncovered edge case. Three failures compounding; any one caught would have prevented the outage.
Detection time (25 min) is the biggest response failure. Alert threshold + paging escalation action items address this.
All feature flag deploys going forward require gradual rollout. Policy change is the highest-impact action item.

📋 How to use this prompt (4 steps · under 60 seconds) Click to expand

1 Copy the prompt above. Click "Copy prompt". XML-structured prompt now on clipboard.
2 Open ChatGPT, Claude, or Gemini. One-click launch above. Recommended: Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier..
3 Paste + fill placeholders. Replace {curly braces} with your context. Specificity = quality.
4 Run + iterate. Setup: 4 min to document. Output: ~60 seconds in Claude.

Common use cases

Post-outage documentation for engineering / leadership
Internal reliability review after major incidents
Near-miss documentation (no actual outage but close call)
Complex multi-team failure modes
Security incident post-mortems
Data loss / corruption incidents
Customer-facing incidents where status page matters

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.

Pro tips

Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.
Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.
5 Whys are useful as a starting point, but often end at a superficial answer. Ask 'what conditions made this possible?' for deeper insight.
Action items must have an OWNER and a DEADLINE. 'We should...' = will not happen.
Publish the post-mortem. Internal systems that silo post-mortems lose learning across teams.
Don't write during the incident. Write after things are stable + enough time has passed to think clearly (usually 48-72 hours).

Customization tips

Write the post-mortem 48-72 hours after resolution. Earlier = adrenaline distorts; later = people forget details.
Before you send it out, have ONE uninvolved engineer read it. They'll spot blame-language you didn't notice.
Action items without owners and deadlines DON'T happen. Be explicit even if it feels bureaucratic.
Include 'what went well.' Blameless post-mortems need positive data to stay psychologically safe for teams.
Create an index of post-mortems. Pattern recognition across incidents reveals systemic issues.

Variants

Customer-Facing Outage

For incidents with external impact. Includes customer comms draft.

Security Incident

For security-specific incidents. Compliance + disclosure considerations.

Near-Miss / Close Call

For incidents that almost happened. Same rigor, different tone.

Frequently asked questions

Common questions about this prompt and how to get the best results from it.

How do I use the Incident Post-Mortem Writer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Incident Post-Mortem Writer?

Claude Opus 4 or Sonnet 4.5. Incident reasoning benefits from top-tier.

Can I customize the Incident Post-Mortem Writer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Blameless doesn't mean consequence-free. It means focus on SYSTEMS, not individuals. People still need to fix the process that let them fail.; Root cause is almost never 'human error.' That's a stopping point for lazy analysis. Keep digging.

What does it cost to use this prompt?

The prompt itself is free, MIT-licensed, with no email signup required. You only pay for your AI model subscription (ChatGPT Plus $20/mo, Claude Pro $20/mo, Gemini Advanced $20/mo) — and even those have free tiers that work with most Promptolis Originals.

How is this different from PromptBase or PromptHero?

PromptBase sells prompts in a marketplace ($2-15 each). PromptHero focuses on image-generation prompts. Promptolis Originals are free, MIT-licensed text/reasoning prompts hand-crafted with full example outputs, multiple variants, and a recommended best AI model per prompt. We don't sell anything.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

P

Curated by Promptolis Editorial · Last reviewed 2026-07-06

Editorial process + credentials ▼

Credentials: Independent prompt-engineering team since 2026. Sister projects: SeoScore.tools and 9bench.com. Meet the team →

Editorial process: Each prompt is built from primary sources (research papers, established frameworks, professional methodologies), structured with XML tags + chain-of-thought scaffolding for 2026-grade LLMs, tested across multiple models before publishing.

🚨 Incident Post-Mortem Writer