⚡ Promptolis Original · AI Agents & Automation

🪜 Multi-Step Agent Workflow Designer

Designs reliable multi-step agentic workflows with explicit hand-offs, failure recovery, and observable checkpoints — instead of one giant agent that breaks at step 17.

⏱️ 5 min to set up 🤖 ~110 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most agent failures in production are NOT model failures — they are workflow architecture failures. One mega-agent doing 12 steps will fail at step 7 and you will not know why. This Original designs the workflow as an explicit graph with hand-offs.

Outputs a step-by-step DAG: each node has a single responsibility, clear input/output contract, failure-recovery policy, and observability hook. No more debugging 'the agent did something weird at some point.'

Calibrated to the 2026 reality: subagents in Claude Code, multi-agent systems in OpenAI Apps SDK, n8n+LLM workflows, LangGraph. Picks the right architecture for YOUR scale, not a generic template.

The prompt

Promptolis Original · Copy-ready
<role> You are an agent workflow architect with 4+ years designing multi-step agentic systems on Claude Code, LangGraph, n8n+LLM, and custom orchestrators. You have shipped 50+ multi-agent workflows to production. You think in directed acyclic graphs, not in linear scripts. You are direct. You will tell a builder their workflow has too many steps for one agent, that their hand-off contracts are implicit, or that they need a deterministic workflow engine wrapping LLM calls — not more prompt engineering. You refuse to recommend mega-prompts, single-agent solutions for >5-step workflows, or pure-LLM orchestration for production workflows. </role> <principles> 1. One agent = one role. If you cannot describe an agent's job in 8 words, it is doing too much. 2. Hand-offs are contracts. Explicit input schema, output schema, failure modes. Implicit hand-offs break silently. 3. Failure recovery is part of the design, not a retry-loop afterthought. 4. Checkpoint expensive steps. Re-running a $2 LLM call because step N+2 failed is bad design. 5. Observability at hand-offs, not inside agents. Step boundaries are the natural log points. 6. Long workflows (>10 steps) need a deterministic engine wrapping LLMs — not pure-LLM orchestration. 7. Architecture before prompts. Get the graph right; the prompts get easier. </principles> <input> <workflow-goal>{end-to-end outcome the workflow produces}</workflow-goal> <inputs>{what triggers the workflow + initial data available}</inputs> <outputs>{final artifact / state change at completion}</outputs> <current-design>{describe what exists today — single agent, n8n flow, manual process, etc.}</current-design> <scale>{runs/day, latency budget, cost budget}</scale> <failure-tolerance>{what failures can you accept silently? which need immediate alert? which need automatic recovery?}</failure-tolerance> <platform-preference>{Claude Code subagents, LangGraph, OpenAI Apps SDK, n8n+LLM, Inngest, custom — or 'recommend'}</platform-preference> <integrations>{external systems the workflow touches — APIs, databases, queues}</integrations> </input> <output-format> # Workflow Design: [One-line description] ## Architecture Recommendation Single agent vs multi-agent vs deterministic-engine-wrapping-LLMs. Why this for this workflow. ## The Workflow Graph ASCII or numbered list. Each step shows: name, role, input schema, output schema, expected duration, cost estimate. ## Hand-Off Contracts For each step boundary: input fields, output fields, failure modes, what the next step assumes. ## Failure Recovery Policy For each step: what happens on failure. Retry? Skip? Alert? Manual review queue? ## Checkpoint Strategy Where state is persisted. What's recoverable, what's not. Cost of re-running each segment. ## Observability Plan What to log at each hand-off. Metrics to track. Alerts to set up. ## Platform-Specific Implementation Concrete code structure for the chosen platform. Files to create, key configs. ## Anti-Patterns to Avoid 3-5 specific anti-patterns that look right for this workflow but fail at scale. ## Test & Validation Plan How to validate each step in isolation, then end-to-end. Specific test cases. ## Migration Path If there is a current design: how to transition without dropping production traffic. ## Key Takeaways 4-6 bullets — for the team's design review. </output-format> <auto-intake> If input incomplete: ask for workflow goal, inputs, outputs, current design, scale, failure tolerance, platform preference, integrations. </auto-intake> Now, design the workflow:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<workflow-goal>Convert inbound podcast-guest pitches (sent to my pitch@ inbox) into either (a) a calendar booking with the guest if they fit, or (b) a polite decline.</workflow-goal>
<inputs>Email pitch lands in pitch@ inbox via SMTP webhook → JSON to my workflow.</inputs>
<outputs>Either: calendar event created in my Google Calendar with guest invited + thread reply with confirmation. Or: thread reply with polite decline + tag the email 'declined-2026' in Gmail.</outputs>
<current-design>Single n8n workflow with one OpenAI node that takes the email body, generates a 'fit score' 0-100, then conditionally routes. About 35% of decisions are wrong — guests I'd want get declined, guests I'd reject get booked.</current-design>
<scale>~12-20 pitches/day. Latency budget: 4 hours from email arrival to reply. Cost budget: ~$0.10 per pitch.</scale>
<failure-tolerance>If the system makes a wrong DECLINE I lose a guest opportunity (bad). If it makes a wrong BOOK I have to manually unbook + apologize (worse). Silent technical failures (workflow stops): I want immediate alert.</failure-tolerance>
<platform-preference>Recommend. I'm comfortable with n8n, can write Python/TypeScript if needed.</platform-preference>
<integrations>Gmail API, Google Calendar API, my podcast topic preferences (markdown file), my recent guests list (sheet), my unavailable dates (calendar).</integrations>
🤖 Output

Workflow Design: Podcast pitch triage with multi-agent + human-in-the-loop for high-stakes decisions

Architecture Recommendation

Multi-agent (3 agents) + deterministic orchestration via n8n, with mandatory human-in-the-loop for the 'book' decision.

Your current single-agent design fails because one OpenAI call is doing FOUR distinct cognitive tasks: (1) extract guest claims, (2) compare against your topic preferences, (3) compare against recent guests for redundancy, (4) synthesize a final fit score. Single LLM calls do task #4 well; task #2 and #3 require deterministic comparison against your stored data, which the LLM does badly because it's hallucinating against your actual preferences.

The wrong-decision rate of 35% is consistent with this anti-pattern. After redesign you should see <10% wrong decisions.

The Workflow Graph

[1] Pitch Ingest (deterministic, n8n)
      ↓ {guest_email, pitch_text, sender_metadata}
[2] Extractor Agent (LLM, ~$0.01)
      ↓ {guest_name, claimed_expertise[], topics_pitched[], proof_links[], pitch_quality_signals{}}
[3] Topic Match Scorer (deterministic + small LLM, ~$0.02)
      ↓ {topic_overlap_score, redundancy_score, freshness_score}
[4] Decision Synthesizer Agent (LLM, ~$0.03)
      ↓ {recommendation: 'book' | 'decline' | 'human-review', reasoning, draft_reply, confidence}
[5] Routing Switch (deterministic)
      ↓ if 'book' OR confidence < 0.85 → human-review queue (Slack DM to you)
      ↓ if 'decline' AND confidence ≥ 0.85 → auto-decline path
[6a] Auto-Decline Path (deterministic, ~$0.005)
      ↓ Send draft_reply, tag email, log decision
[6b] Human-Review Path (waits for your Slack reaction)
      ↓ ✅ → book path | ❌ → decline path | 🤔 → manual takeover
[7] Book Path (deterministic + LLM for invite text)
      ↓ Calendar event, send thread reply with proposed times
[8] Audit Logger (deterministic)
      ↓ Append decision to weekly review sheet for retrospective

Hand-Off Contracts

[1]→[2] Pitch Ingest → Extractor Agent

  • Input: {guest_email: string, pitch_text: string, sender_domain: string, sender_first_seen: ISO-date}
  • Output: {guest_name: string, claimed_expertise: string[], topics_pitched: string[], proof_links: string[], pitch_quality_signals: {has_specific_topic: bool, has_credentials: bool, has_links: bool, length_signal: 'too-short'|'right'|'wall-of-text', personalization_signal: bool}}
  • Failure mode: pitch is non-English or too short to extract → return {insufficient_pitch: true} and skip to [6a] auto-decline path with template reply.

[2]→[3] Extractor Agent → Topic Match Scorer

  • Input: extractor output
  • Output: {topic_overlap_score: 0-1, redundancy_score: 0-1, freshness_score: 0-1, matched_preferences: string[], conflicting_recent_guests: string[]}
  • topic_overlap_score: cosine similarity between topics_pitched embeddings and your preferences markdown embeddings. Deterministic.
  • redundancy_score: 1 - max(similarity to last 30 guests). Deterministic.
  • freshness_score: small LLM call ("is this topic something the audience has heard 5x in the last year?"). ~$0.005.
  • Failure mode: embedding service down → fall back to LLM-only scoring + log degraded mode.

[3]→[4] Topic Match Scorer → Decision Synthesizer

  • Input: extractor output + match scorer output
  • Output: {recommendation: 'book'|'decline'|'human-review', reasoning: string, draft_reply: string, confidence: 0-1}
  • Decision rules baked into the prompt: if redundancy_score > 0.7 → decline regardless. If topic_overlap_score < 0.3 → decline regardless. Else use LLM judgment.

[4]→[5]→[6a/6b] Routing

  • Deterministic switch. NOT an LLM. The LLM proposed; the router executes policy.

Failure Recovery Policy

StepFailureAction
1Webhook misformattedMove email to manual-review folder, alert via Slack
2LLM timeoutRetry once with exponential backoff, then escalate to human-review
2LLM returns malformed JSONRetry with stricter prompt, then human-review
3Embedding service downDegrade to LLM-only scoring, alert (non-blocking)
4LLM returns confidence < 0.85Auto-route to human-review queue (this is a feature, not a bug)
5Routing logic crashesAlert, hold email in pending state
6aEmail send failsRetry 3×, then human-review
6bNo human response in 4 hoursSend escalation Slack DM with default action recommendation
7Calendar API failsSend manual booking link as fallback
8Logger failsNon-blocking warn

Checkpoint Strategy

Persist after each step in a single Postgres table pitch_workflow_state keyed by email message ID:

{
  message_id, ingested_at, extractor_output, scorer_output,
  decision, routing_decision, human_action, final_action,
  audit_trail[]
}

If step N fails, the workflow can resume from step N-1's persisted output. No LLM step ever re-runs unless explicitly requested.

Cost saved: re-running a failed step 7 currently costs $0 (vs $0.06 if we had to re-run from step 1).

Observability Plan

Metrics:

  • Pitches/day ingested
  • Auto-decline rate, auto-book rate, human-review rate
  • Mean confidence score
  • Mean latency from ingest to final action
  • Cost per pitch
  • Wrong-decision rate (computed weekly from your audit)

Logs at each hand-off: the input + output of each step, plus timing.

Alerts:

  • Workflow crash (PagerDuty / Slack DM)
  • Auto-decline rate >70% for 3 days (drift detector — your taste may be miscalibrated in the prompt)
  • Confidence distribution shifts (mean confidence drops 20% week-over-week)
  • Human-review queue >5 items for >2 hours

Platform-Specific Implementation (n8n + Postgres + small Python service)

n8n workflow nodes (high level):

1. Webhook trigger (Gmail push)

2. HTTP Request → Python service `/extract` (Extractor Agent)

3. HTTP Request → Python service `/score` (Topic Match Scorer)

4. HTTP Request → Python service `/decide` (Decision Synthesizer)

5. Switch node (routing)

6a. Gmail send + label (auto-decline)

6b. Slack DM with action buttons → wait for response

7. Calendar create + Gmail send (book)

8. Postgres insert (audit log)

Python service structure (FastAPI, ~150 lines):

/extract  → extractor_agent.run(pitch) → ExtractorOutput
/score    → scorer.run(extractor_output, preferences_md, recent_guests) → ScorerOutput
/decide   → synthesizer_agent.run(extractor + scorer + decline_examples) → Decision

Each agent is a SEPARATE prompt with single responsibility. Prompts live in version-controlled files; preferences live in preferences.md (LLM reads this); recent guests in recent_guests.json.

Anti-Patterns to Avoid

1. Putting your topic preferences in the prompt. They will drift. Keep them in `preferences.md` and inject. When you change preferences you change a file, not a prompt.

2. Single mega-prompt that does extract+score+decide. This is what you have now. It is why you're at 35% wrong decisions. Decompose.

3. No human-in-the-loop for 'book' decisions. Wrong-decline costs you a guest; wrong-book costs you a guest AND credibility. Always have human approval on book.

4. Embedding the recent-guests list in the prompt. Same issue as preferences — drifts and is expensive. Embed once, store, lookup at runtime.

5. Using LLM for the routing switch. Routing is policy, not judgment. Determine it deterministically from confidence + recommendation.

Test & Validation Plan

Step-isolation tests:

  • Extractor: 20 pitches you've manually labeled. Validate output structure and field accuracy.
  • Scorer: same 20 pitches. Validate scores match your intuition (topic_overlap correlation > 0.7 with your manual scoring).
  • Synthesizer: same 20 pitches with extractor+scorer output. Validate recommendation accuracy. Target >85% match with your decisions.

End-to-end smoke test: 5 historical pitches replayed (3 you accepted, 2 you declined). System should produce same final action.

Shadow mode for 1 week: run new system in parallel with old, compare decisions, do not act on new system's output. Migrate when shadow agreement > 90%.

Migration Path

1. Deploy new system in shadow mode (read-only, decision logged but not executed).

2. Run for 7 days. Compare with current n8n decisions and your overrides.

3. If shadow agreement with your overrides > 90%: cut over.

4. Keep old workflow disabled-but-deployable for 14 days as rollback.

5. After 14 days clean shutdown, archive old workflow.

Key Takeaways

  • Your current 35% wrong-decision rate is an architecture issue, not a model issue. One LLM call doing four cognitive tasks is the bug. Decomposing fixes most of it.
  • The decision synthesizer needs human-in-the-loop on 'book' decisions. The asymmetric cost of wrong-book vs wrong-decline mandates this.
  • Topic preferences belong in a file, not in the prompt. Drift will kill accuracy otherwise.
  • **Use deterministic scoring where possible (embeddings + cosine similarity for topic overlap, vector lookups for guest redundancy). Reserve LLM for genuine judgment.
  • Persist state at every hand-off. A $0.06 LLM call rerun because your calendar API hiccupped is preventable design.
  • Set up the drift detector from day one. Auto-decline rate trending up over weeks is the early-warning signal that your preferences markdown is out of date.

Common use cases

  • Engineer designing a research-then-write-then-publish agent pipeline
  • Solo operator building an inbound-lead-to-CRM workflow with AI enrichment
  • Builder converting a single bloated agent into a multi-agent system
  • Team replacing a fragile n8n workflow with proper agent orchestration
  • Developer designing a customer-onboarding flow with multiple handoffs
  • PM evaluating whether a workflow needs multi-agent architecture or one good agent

Best AI model for this

Claude Opus 4. Workflow design requires reasoning about state transitions, failure modes, and observability — exactly Claude's strengths. ChatGPT GPT-5 second-best for shorter workflows (≤5 steps).

Pro tips

  • If your workflow has more than 5 steps, you almost certainly need a multi-agent design. Single agents past 5 sequential steps lose reliability fast.
  • Each hand-off needs an explicit contract: what data, what format, what guarantees. Implicit hand-offs are where workflows break silently.
  • Always design failure recovery FIRST. The happy path is easy. What happens when step 4 returns a partially-correct result is where the design earns its money.
  • Checkpoint between expensive steps. If step 7 hits a $2 LLM call and step 9 fails, you do not want to redo step 7. Persist intermediate state.
  • One agent per role, not one agent per task. A 'research agent' that does 4 different research tasks is fine; a 'do everything' agent that researches and writes and publishes is not.
  • Observability hooks belong at hand-offs, not inside agents. Log the input/output of each step; do not try to instrument every internal tool-call.
  • For >10-step workflows, consider a workflow engine (Temporal, Inngest, n8n) wrapping LLM calls — not pure-LLM orchestration. The orchestration layer wants determinism the LLM cannot give.

Customization tips

  • Be specific about scale. A 12-pitches/day workflow needs different architecture than a 1200-pitches/day workflow.
  • List ALL integrations upfront, including read-only ones (preferences files, historical data). The architecture depends on what's queryable vs what's prompt-injected.
  • Specify the asymmetric cost of failure types. 'A wrong-book costs more than a wrong-decline' shapes the human-in-the-loop policy fundamentally.
  • If you have an existing system that's failing, describe the failure pattern (35% wrong decisions of what type) — not just 'it doesn't work'. The Original calibrates fixes to specific failure modes.
  • For workflows with >10 steps, ask explicitly for the 'Production Workflow Mode' variant — adds error budgets and on-call runbook.
  • Re-run quarterly. As your preferences, scale, or platform changes, the workflow architecture needs to evolve. Keep the design doc as a living artifact.

Variants

Subagent Mode

For Claude Code or similar agentic IDEs — designs the parent-agent + subagent architecture with todo-list lifecycle.

n8n Hybrid Mode

For n8n/Zapier+LLM workflows — picks where LLMs add value and where deterministic logic should win.

LangGraph Mode

For Python LangGraph — designs the state machine and the conditional edges.

Production Workflow Mode

Adds SLO targets, error budgets, on-call runbook, and rollback plan. For workflows running on customer-facing infrastructure.

Frequently asked questions

How do I use the Multi-Step Agent Workflow Designer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Multi-Step Agent Workflow Designer?

Claude Opus 4. Workflow design requires reasoning about state transitions, failure modes, and observability — exactly Claude's strengths. ChatGPT GPT-5 second-best for shorter workflows (≤5 steps).

Can I customize the Multi-Step Agent Workflow Designer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: If your workflow has more than 5 steps, you almost certainly need a multi-agent design. Single agents past 5 sequential steps lose reliability fast.; Each hand-off needs an explicit contract: what data, what format, what guarantees. Implicit hand-offs are where workflows break silently.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals