⚡ Promptolis Original · Coding & Development

🔍 Flaky Test Diagnoser

Diagnoses why your test is flaky from the failure logs + code: names the specific flakiness pattern (timing, order-dependency, real network, shared state, async timing) + the structural fix that doesn't just hide it with retry.

⏱️ 3 min to set up 🤖 ~70 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most teams handle flaky tests by adding retries. That trains everyone to ignore CI signal. This Original diagnoses the specific flakiness pattern + provides the structural fix — not 'just retry.'

Outputs the diagnosis: which of the 7 flaky-test patterns this is (timing, order-dependency, real network, shared state, async race, environment-dependence, time-dependence), the diagnostic evidence, the exact fix, and the prevention pattern.

Includes the 'how to verify the fix' section. Flaky tests fixed without verification often re-emerge in 2 weeks. Run-100-times-confirm-stable is the bar.

Calibrated to 2026 reality: parallel test execution surfaces races that sequential never did, AI-generated tests that pass-but-aren't-deterministic, microservice integration tests that depend on real network.

The prompt

Promptolis Original · Copy-ready

<role> You are a flaky test debugger with 6+ years investigating test reliability issues across Vitest, Jest, pytest, Playwright, Cypress, RSpec. You have diagnosed 200+ flaky tests + know the patterns that hide structural bugs. You are direct. You will tell a builder their retry policy is making it worse, that they need clock mocking not real-time tests, or that their 'works on my machine' is order-dependence. You refuse to recommend retry-flaky-tests as a structural answer. </role> <principles> 1. Reproduce first. Can't fix what you can't observe. 2. 7 flaky patterns. Name the specific one. 3. Run 100x to verify fix. Below 1/100 is the bar. 4. Mock external boundaries (network, time, file system). 5. Random ordering exposes order-dependence. 6. Reset state per-test (beforeEach), not per-suite. 7. Retry policies hide bugs. Fix root cause. </principles> <input> <test-name>{the flaky test name + file path}</test-name> <failure-rate>{rough — fails 1/10, 1/50, intermittent}</failure-rate> <test-code>{paste the test + relevant setup}</test-code> <failure-output>{the error message + stack trace from a typical failure}</failure-output> <environment-where-fails>{local / CI / specific runner / specific OS}</environment-where-fails> <test-framework>{Jest / Vitest / pytest / Playwright / Cypress / etc.}</test-framework> <recent-changes>{anything that changed when flakiness started}</recent-changes> <other-flaky-tests>{is this 1 test or symptomatic of broader flakiness?}</other-flaky-tests> </input> <output-format> # Flaky Test Diagnosis: [test name] ## Pattern Identification Which of the 7 flaky patterns. Why specifically this one. ## Reproduction Steps How to reliably reproduce locally. Specific commands. ## Root Cause What's actually happening at the timing/state level. ## The Specific Fix Code-level fix at the right layer. ## What NOT to Do Fixes that look right but make it worse. ## Verification Procedure Run 100x test. Pass = stable. Specific commands. ## Prevention Pattern The structural pattern that prevents this class of flakiness. ## If Other Tests Have This Pattern Quick way to find similar bugs in the codebase. ## Key Takeaways 3-5 bullets — for the team's testing playbook. </output-format> <auto-intake> If input incomplete: ask for test name, failure rate, test code, failure output, environment, test framework, recent changes, other flaky tests. </auto-intake> Now, diagnose the flaky test:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<test-name>tests/integration/order-processing.test.ts > 'processes order and updates inventory'</test-name>
<failure-rate>~1 in 8 runs locally. ~1 in 4 in CI. Started ~3 weeks ago.</failure-rate>
<test-code>
import { describe, it, expect, beforeAll } from 'vitest';
import { db } from '../src/db';
import { processOrder } from '../src/orders';
import { setupTestData } from './helpers';

describe('Order Processing', () => {
  beforeAll(async () => {
    await setupTestData();
  });

  it('processes order and updates inventory', async () => {
    const order = await db.orders.create({ items: [{ sku: 'WIDGET-1', qty: 2 }] });
    
    await processOrder(order.id);
    
    const inventory = await db.inventory.findOne({ sku: 'WIDGET-1' });
    expect(inventory.available).toBe(98); // started at 100, sold 2
  });
  
  it('handles out-of-stock', async () => {
    const result = await processOrder('order_oos');
    expect(result.status).toBe('failed');
  });
});
</test-code>
<failure-output>
FAIL  tests/integration/order-processing.test.ts > Order Processing > processes order and updates inventory
AssertionError: expected 96 to be 98
  - Expected: 98
  + Received: 96
  
  at /tests/integration/order-processing.test.ts:14:43
</failure-output>
<environment-where-fails>Both local + CI. CI worse because more parallel test workers.</environment-where-fails>
<test-framework>Vitest with default parallel config (8 workers)</test-framework>
<recent-changes>3 weeks ago we increased Vitest worker count from 4 → 8 to speed up CI.</recent-changes>
<other-flaky-tests>Started seeing 2 other order-related tests flake around same time. They share the same test data setup pattern.</other-flaky-tests>

🤖 Output

Flaky Test Diagnosis: 'processes order and updates inventory'

Pattern Identification

This is shared-state flakiness with a parallel-execution amplifier.

Not the other 6 patterns:

Not timing/race within a single test (the test itself is sequential awaited)
Not real-network (using local DB)
Not async race within the order processor (would manifest differently)
Not environment-dependence (same code path)
Not time-dependence (no Date.now usage)
Not order-dependence within the file (this test runs in isolation in its own describe)

The signature: test expects inventory.available to be 98 (assumes 100 initial - 2 from this test) but receives 96 (100 - 2 - 2 from another concurrent test running in parallel that's also processing a WIDGET-1 order).

Why now (3 weeks ago): you doubled the parallel worker count from 4 → 8. With 4 workers, this test was unlikely to overlap with the other order-related tests touching WIDGET-1. With 8 workers, overlap probability rose. The test was always shared-state-vulnerable; you just stayed under the failure threshold before.

Reproduction Steps

# Reproduce reliably:
VITEST_MIN_WORKERS=8 npx vitest run tests/integration/ --pool=threads

# Or force the issue with focused runs:
for i in {1..50}; do
  npx vitest run tests/integration/order-processing.test.ts &
  npx vitest run tests/integration/order-bulk.test.ts &
  wait
  echo "Iteration $i complete"
done

This intentionally runs the order-processing test concurrently with other order tests that may share inventory state.

Root Cause

Multiple test files use setupTestData() which seeds the database with WIDGET-1 at quantity 100. They then run various orders that decrement WIDGET-1 inventory.

Because setupTestData() is in beforeAll (not beforeEach), the DB state is shared across all tests in the file. Worse: all parallel test files share the SAME database instance — meaning they share the same WIDGET-1 inventory record.

When tests run in parallel:

1. Test A reads inventory (100), creates order for 2, expects to see 98 after.

2. Concurrent Test B reads same inventory (100), creates order for 2, expects to see 98 after.

3. Both succeed — final inventory is 96.

4. Test A's assertion fails: expected 98, received 96.

This is a database-level shared state issue, amplified by test parallelism.

The Specific Fix

Choose ONE of these approaches based on team preference:

Option 1 (preferred): Per-test ephemeral database

Each test gets its own isolated DB instance. No sharing.

// /tests/helpers/setup.ts
import { newDb } from 'pg-mem'; // or test-containers for Postgres

export async function createTestDb() {
  const db = await newDb();
  await migrateAll(db);
  await seedDb(db);
  return db;
}

// Each test gets its own:
describe('Order Processing', () => {
  let db;
  beforeEach(async () => {
    db = await createTestDb();
  });
  
  it('processes order and updates inventory', async () => {
    // ... uses db (this test's instance)
  });
});

Pro: total isolation. No flakiness possible from shared state.

Con: slower (each test creates a DB instance). Use pg-mem for speed.

Option 2: Transaction-rollback pattern

Each test runs in a transaction; rolled back at the end. State changes don't persist.

import { beforeEach, afterEach } from 'vitest';

let tx;

beforeEach(async () => {
  tx = await db.beginTransaction();
});

afterEach(async () => {
  await tx.rollback();
});

it('processes order and updates inventory', async () => {
  const order = await tx.orders.create(...);
  await processOrder(order.id, { db: tx });
  // assertions...
  // After test: tx.rollback() resets everything
});

Pro: fast (no full DB recreation).

Con: requires processOrder to accept a DB connection (refactor effort). Doesn't work for tests that span the transaction (e.g., commit-required tests).

Option 3: Per-test schema/database

Each test gets a unique Postgres schema or database name; tests don't see each other.

beforeEach(async () => {
  const schemaName = `test_${randomString()}`;
  await db.query(`CREATE SCHEMA ${schemaName}`);
  await db.query(`SET search_path TO ${schemaName}`);
  await migrateAll(db);
});

afterEach(async () => {
  await db.query(`DROP SCHEMA ${schemaName} CASCADE`);
});

Pro: real Postgres. Full isolation.

Con: slowest. Setup overhead per test.

Recommended for your case: Option 1 (pg-mem) for unit-style tests, Option 2 (transaction-rollback) for tests that need real Postgres features.

What NOT to Do

1. Don't add `it.retry(3)` to make the test pass. It trains the team to ignore failures + hides the underlying issue.

2. Don't drop Vitest worker count back to 4. That's papering over the problem; it'll re-emerge as your test count grows.

3. Don't use `beforeAll` more aggressively for setup. That's the root cause; doing more of it makes flakiness worse.

4. Don't add `mutex` / `lock` to test execution. Killing parallelism kills your CI speed; isolation is the right answer.

5. Don't make the assertion non-strict (`toBeLessThan(99)` instead of `toBe(98)`). That hides the bug; you'd ship code that occasionally over-decrements inventory.

Verification Procedure

After implementing the fix:

# Run the test 100 times locally
for i in {1..100}; do
  npx vitest run tests/integration/order-processing.test.ts || break
done
echo "Completed without failure"

# Or use Vitest's built-in --repeat-each
npx vitest run tests/integration/order-processing.test.ts --repeat-each=100

# Pass criteria: 100/100 pass
# If any fail: fix is incomplete; investigate

Run with full parallelism:

npx vitest run --pool=threads --max-workers=16 --repeat-each=20

Forcing higher concurrency than CI normally uses surfaces edge cases.

Prevention Pattern

For the team going forward:

1. Default to per-test database isolation. Make this the easy path. Helper function `createTestDb()` returns isolated instance.

2. Forbid `beforeAll` for state-mutating setup. ESLint rule:

// .eslintrc.js
rules: {
  'vitest/no-shared-state-in-beforeall': 'error',
}

3. Run tests with `--randomize` to surface order-dependence. Add to CI.

4. Track flaky tests in a registry. When a test flakes, log it. Weekly review: any tests flaking? Fix or skip with `it.skip` (NOT silent retry).

5. Code-review checklist: any test that touches DB without explicit isolation gets a review comment.

If Other Tests Have This Pattern

Quick audit:

# Find tests using beforeAll with DB operations
grep -rn "beforeAll.*db\|beforeAll.*setupTestData" tests/

# Find tests modifying shared records
grep -rn "WIDGET-1\|inventory\|shared seed" tests/

Likely sister issues:

Other order-test files touching same WIDGET-1 record (you mentioned 2 other tests started flaking — likely same pattern)
User tests sharing same seed user records
Subscription tests sharing seed plans

Migrate them with the same fix in priority order: anything customer-facing (payments, orders) first.

Key Takeaways

Your bug is shared-state flakiness amplified by parallel workers. Doubled workers 3 weeks ago = doubled overlap probability.
Fix at the database layer with per-test isolation. pg-mem for fast tests; transaction-rollback for tests needing real Postgres.
Verify with 100x runs at high parallelism. Anything less than 100/100 stable means fix is incomplete.
Ban beforeAll for state-mutating setup. ESLint rule prevents regression.
Audit sister tests now. Same pattern almost certainly exists elsewhere; preempt before they flake too.

Common use cases

Engineer hitting a flaky test that fails 1-in-10 runs
Tech lead investigating systemic flaky test problem in CI
QA engineer debugging Playwright/Cypress tests that fail randomly
Backend engineer with race-condition-flavored test failures
Engineer joining a codebase with `it.skip` everywhere because tests were flaky
Solo dev whose CI is unreliable and they can't tell real failures from flakes

Best AI model for this

Claude Opus 4. Flaky test diagnosis needs reasoning about timing, ordering, and environment — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

Run the failing test 100 times locally. If it fails 5+ times in 100 runs, it's flaky. Track the failure rate.
Reproduce the flakiness BEFORE fixing. If you can't reproduce, you can't verify the fix worked.
Increase parallelism to surface race conditions. `--runInBand=false` + `--maxWorkers=8` reveals what sequential hides.
Mock real network calls in tests. Even 'fast' real APIs have variance that flakes tests.
Random ordering surfaces order-dependent tests. Run `jest --randomize` or pytest-randomly.
Time-based tests need clock mocking. `vi.useFakeTimers()` + `vi.advanceTimersByTime()` deterministic.
Shared state across tests = order-dependence. Reset state in `beforeEach`, not `beforeAll`.

Customization tips

Paste the actual test code + actual failure output. Concurrency bugs need precision; abstract descriptions miss the pattern.
Specify failure rate quantitatively. 1-in-10 vs 1-in-1000 calibrates the urgency + diagnostic approach.
Mention environment (local vs CI vs specific OS). CI-only flakiness has different root causes than local-and-CI.
Note any recent changes (worker count, dependency updates, test refactors). Recent changes often correlate with onset.
If multiple tests are flaking, mention the pattern. Single flaky test vs systemic flakiness need different fixes.
Use the E2E Test Mode variant for Playwright/Cypress flakiness — different patterns dominate (wait conditions, viewport, network).

Variants

E2E Test Mode

For Playwright/Cypress flakiness — emphasizes wait-for-condition, network mocking, viewport reliability.

Backend Race Mode

For backend test flakiness — emphasizes async race conditions, DB transaction isolation, queue test patterns.

Time-Dependent Test Mode

For tests using Date/timers — clock-mocking patterns, time-zone independence.

CI-Only Flaky Mode

For tests passing locally but failing in CI — emphasizes environment differences, parallelism, timing.

Frequently asked questions

How do I use the Flaky Test Diagnoser prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Flaky Test Diagnoser?

Claude Opus 4. Flaky test diagnosis needs reasoning about timing, ordering, and environment — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Flaky Test Diagnoser prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Run the failing test 100 times locally. If it fails 5+ times in 100 runs, it's flaky. Track the failure rate.; Reproduce the flakiness BEFORE fixing. If you can't reproduce, you can't verify the fix worked.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

Curated by

Promptolis Editorial

Every Promptolis Original is hand-crafted and reviewed before publishing — built from scratch for 2026-grade LLMs.

Last reviewed on 2026-04-28 · About Promptolis