⚡ Promptolis Original · AI Agents & Automation
🪜 Multi-Step Agent Workflow Designer
Designs reliable multi-step agentic workflows with explicit hand-offs, failure recovery, and observable checkpoints — instead of one giant agent that breaks at step 17.
Why this is epic
Most agent failures in production are NOT model failures — they are workflow architecture failures. One mega-agent doing 12 steps will fail at step 7 and you will not know why. This Original designs the workflow as an explicit graph with hand-offs.
Outputs a step-by-step DAG: each node has a single responsibility, clear input/output contract, failure-recovery policy, and observability hook. No more debugging 'the agent did something weird at some point.'
Calibrated to the 2026 reality: subagents in Claude Code, multi-agent systems in OpenAI Apps SDK, n8n+LLM workflows, LangGraph. Picks the right architecture for YOUR scale, not a generic template.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<workflow-goal>Convert inbound podcast-guest pitches (sent to my pitch@ inbox) into either (a) a calendar booking with the guest if they fit, or (b) a polite decline.</workflow-goal> <inputs>Email pitch lands in pitch@ inbox via SMTP webhook → JSON to my workflow.</inputs> <outputs>Either: calendar event created in my Google Calendar with guest invited + thread reply with confirmation. Or: thread reply with polite decline + tag the email 'declined-2026' in Gmail.</outputs> <current-design>Single n8n workflow with one OpenAI node that takes the email body, generates a 'fit score' 0-100, then conditionally routes. About 35% of decisions are wrong — guests I'd want get declined, guests I'd reject get booked.</current-design> <scale>~12-20 pitches/day. Latency budget: 4 hours from email arrival to reply. Cost budget: ~$0.10 per pitch.</scale> <failure-tolerance>If the system makes a wrong DECLINE I lose a guest opportunity (bad). If it makes a wrong BOOK I have to manually unbook + apologize (worse). Silent technical failures (workflow stops): I want immediate alert.</failure-tolerance> <platform-preference>Recommend. I'm comfortable with n8n, can write Python/TypeScript if needed.</platform-preference> <integrations>Gmail API, Google Calendar API, my podcast topic preferences (markdown file), my recent guests list (sheet), my unavailable dates (calendar).</integrations>
Workflow Design: Podcast pitch triage with multi-agent + human-in-the-loop for high-stakes decisions
Architecture Recommendation
Multi-agent (3 agents) + deterministic orchestration via n8n, with mandatory human-in-the-loop for the 'book' decision.
Your current single-agent design fails because one OpenAI call is doing FOUR distinct cognitive tasks: (1) extract guest claims, (2) compare against your topic preferences, (3) compare against recent guests for redundancy, (4) synthesize a final fit score. Single LLM calls do task #4 well; task #2 and #3 require deterministic comparison against your stored data, which the LLM does badly because it's hallucinating against your actual preferences.
The wrong-decision rate of 35% is consistent with this anti-pattern. After redesign you should see <10% wrong decisions.
The Workflow Graph
[1] Pitch Ingest (deterministic, n8n)
↓ {guest_email, pitch_text, sender_metadata}
[2] Extractor Agent (LLM, ~$0.01)
↓ {guest_name, claimed_expertise[], topics_pitched[], proof_links[], pitch_quality_signals{}}
[3] Topic Match Scorer (deterministic + small LLM, ~$0.02)
↓ {topic_overlap_score, redundancy_score, freshness_score}
[4] Decision Synthesizer Agent (LLM, ~$0.03)
↓ {recommendation: 'book' | 'decline' | 'human-review', reasoning, draft_reply, confidence}
[5] Routing Switch (deterministic)
↓ if 'book' OR confidence < 0.85 → human-review queue (Slack DM to you)
↓ if 'decline' AND confidence ≥ 0.85 → auto-decline path
[6a] Auto-Decline Path (deterministic, ~$0.005)
↓ Send draft_reply, tag email, log decision
[6b] Human-Review Path (waits for your Slack reaction)
↓ ✅ → book path | ❌ → decline path | 🤔 → manual takeover
[7] Book Path (deterministic + LLM for invite text)
↓ Calendar event, send thread reply with proposed times
[8] Audit Logger (deterministic)
↓ Append decision to weekly review sheet for retrospective
Hand-Off Contracts
[1]→[2] Pitch Ingest → Extractor Agent
- Input:
{guest_email: string, pitch_text: string, sender_domain: string, sender_first_seen: ISO-date} - Output:
{guest_name: string, claimed_expertise: string[], topics_pitched: string[], proof_links: string[], pitch_quality_signals: {has_specific_topic: bool, has_credentials: bool, has_links: bool, length_signal: 'too-short'|'right'|'wall-of-text', personalization_signal: bool}} - Failure mode: pitch is non-English or too short to extract → return
{insufficient_pitch: true}and skip to [6a] auto-decline path with template reply.
[2]→[3] Extractor Agent → Topic Match Scorer
- Input: extractor output
- Output:
{topic_overlap_score: 0-1, redundancy_score: 0-1, freshness_score: 0-1, matched_preferences: string[], conflicting_recent_guests: string[]} - topic_overlap_score: cosine similarity between topics_pitched embeddings and your preferences markdown embeddings. Deterministic.
- redundancy_score: 1 - max(similarity to last 30 guests). Deterministic.
- freshness_score: small LLM call ("is this topic something the audience has heard 5x in the last year?"). ~$0.005.
- Failure mode: embedding service down → fall back to LLM-only scoring + log degraded mode.
[3]→[4] Topic Match Scorer → Decision Synthesizer
- Input: extractor output + match scorer output
- Output:
{recommendation: 'book'|'decline'|'human-review', reasoning: string, draft_reply: string, confidence: 0-1} - Decision rules baked into the prompt: if redundancy_score > 0.7 → decline regardless. If topic_overlap_score < 0.3 → decline regardless. Else use LLM judgment.
[4]→[5]→[6a/6b] Routing
- Deterministic switch. NOT an LLM. The LLM proposed; the router executes policy.
Failure Recovery Policy
| Step | Failure | Action |
|---|---|---|
| 1 | Webhook misformatted | Move email to manual-review folder, alert via Slack |
| 2 | LLM timeout | Retry once with exponential backoff, then escalate to human-review |
| 2 | LLM returns malformed JSON | Retry with stricter prompt, then human-review |
| 3 | Embedding service down | Degrade to LLM-only scoring, alert (non-blocking) |
| 4 | LLM returns confidence < 0.85 | Auto-route to human-review queue (this is a feature, not a bug) |
| 5 | Routing logic crashes | Alert, hold email in pending state |
| 6a | Email send fails | Retry 3×, then human-review |
| 6b | No human response in 4 hours | Send escalation Slack DM with default action recommendation |
| 7 | Calendar API fails | Send manual booking link as fallback |
| 8 | Logger fails | Non-blocking warn |
Checkpoint Strategy
Persist after each step in a single Postgres table pitch_workflow_state keyed by email message ID:
{
message_id, ingested_at, extractor_output, scorer_output,
decision, routing_decision, human_action, final_action,
audit_trail[]
}
If step N fails, the workflow can resume from step N-1's persisted output. No LLM step ever re-runs unless explicitly requested.
Cost saved: re-running a failed step 7 currently costs $0 (vs $0.06 if we had to re-run from step 1).
Observability Plan
Metrics:
- Pitches/day ingested
- Auto-decline rate, auto-book rate, human-review rate
- Mean confidence score
- Mean latency from ingest to final action
- Cost per pitch
- Wrong-decision rate (computed weekly from your audit)
Logs at each hand-off: the input + output of each step, plus timing.
Alerts:
- Workflow crash (PagerDuty / Slack DM)
- Auto-decline rate >70% for 3 days (drift detector — your taste may be miscalibrated in the prompt)
- Confidence distribution shifts (mean confidence drops 20% week-over-week)
- Human-review queue >5 items for >2 hours
Platform-Specific Implementation (n8n + Postgres + small Python service)
n8n workflow nodes (high level):
1. Webhook trigger (Gmail push)
2. HTTP Request → Python service `/extract` (Extractor Agent)
3. HTTP Request → Python service `/score` (Topic Match Scorer)
4. HTTP Request → Python service `/decide` (Decision Synthesizer)
5. Switch node (routing)
6a. Gmail send + label (auto-decline)
6b. Slack DM with action buttons → wait for response
7. Calendar create + Gmail send (book)
8. Postgres insert (audit log)
Python service structure (FastAPI, ~150 lines):
/extract → extractor_agent.run(pitch) → ExtractorOutput
/score → scorer.run(extractor_output, preferences_md, recent_guests) → ScorerOutput
/decide → synthesizer_agent.run(extractor + scorer + decline_examples) → Decision
Each agent is a SEPARATE prompt with single responsibility. Prompts live in version-controlled files; preferences live in preferences.md (LLM reads this); recent guests in recent_guests.json.
Anti-Patterns to Avoid
1. Putting your topic preferences in the prompt. They will drift. Keep them in `preferences.md` and inject. When you change preferences you change a file, not a prompt.
2. Single mega-prompt that does extract+score+decide. This is what you have now. It is why you're at 35% wrong decisions. Decompose.
3. No human-in-the-loop for 'book' decisions. Wrong-decline costs you a guest; wrong-book costs you a guest AND credibility. Always have human approval on book.
4. Embedding the recent-guests list in the prompt. Same issue as preferences — drifts and is expensive. Embed once, store, lookup at runtime.
5. Using LLM for the routing switch. Routing is policy, not judgment. Determine it deterministically from confidence + recommendation.
Test & Validation Plan
Step-isolation tests:
- Extractor: 20 pitches you've manually labeled. Validate output structure and field accuracy.
- Scorer: same 20 pitches. Validate scores match your intuition (topic_overlap correlation > 0.7 with your manual scoring).
- Synthesizer: same 20 pitches with extractor+scorer output. Validate recommendation accuracy. Target >85% match with your decisions.
End-to-end smoke test: 5 historical pitches replayed (3 you accepted, 2 you declined). System should produce same final action.
Shadow mode for 1 week: run new system in parallel with old, compare decisions, do not act on new system's output. Migrate when shadow agreement > 90%.
Migration Path
1. Deploy new system in shadow mode (read-only, decision logged but not executed).
2. Run for 7 days. Compare with current n8n decisions and your overrides.
3. If shadow agreement with your overrides > 90%: cut over.
4. Keep old workflow disabled-but-deployable for 14 days as rollback.
5. After 14 days clean shutdown, archive old workflow.
Key Takeaways
- Your current 35% wrong-decision rate is an architecture issue, not a model issue. One LLM call doing four cognitive tasks is the bug. Decomposing fixes most of it.
- The decision synthesizer needs human-in-the-loop on 'book' decisions. The asymmetric cost of wrong-book vs wrong-decline mandates this.
- Topic preferences belong in a file, not in the prompt. Drift will kill accuracy otherwise.
- **Use deterministic scoring where possible (embeddings + cosine similarity for topic overlap, vector lookups for guest redundancy). Reserve LLM for genuine judgment.
- Persist state at every hand-off. A $0.06 LLM call rerun because your calendar API hiccupped is preventable design.
- Set up the drift detector from day one. Auto-decline rate trending up over weeks is the early-warning signal that your preferences markdown is out of date.
Common use cases
- Engineer designing a research-then-write-then-publish agent pipeline
- Solo operator building an inbound-lead-to-CRM workflow with AI enrichment
- Builder converting a single bloated agent into a multi-agent system
- Team replacing a fragile n8n workflow with proper agent orchestration
- Developer designing a customer-onboarding flow with multiple handoffs
- PM evaluating whether a workflow needs multi-agent architecture or one good agent
Best AI model for this
Claude Opus 4. Workflow design requires reasoning about state transitions, failure modes, and observability — exactly Claude's strengths. ChatGPT GPT-5 second-best for shorter workflows (≤5 steps).
Pro tips
- If your workflow has more than 5 steps, you almost certainly need a multi-agent design. Single agents past 5 sequential steps lose reliability fast.
- Each hand-off needs an explicit contract: what data, what format, what guarantees. Implicit hand-offs are where workflows break silently.
- Always design failure recovery FIRST. The happy path is easy. What happens when step 4 returns a partially-correct result is where the design earns its money.
- Checkpoint between expensive steps. If step 7 hits a $2 LLM call and step 9 fails, you do not want to redo step 7. Persist intermediate state.
- One agent per role, not one agent per task. A 'research agent' that does 4 different research tasks is fine; a 'do everything' agent that researches and writes and publishes is not.
- Observability hooks belong at hand-offs, not inside agents. Log the input/output of each step; do not try to instrument every internal tool-call.
- For >10-step workflows, consider a workflow engine (Temporal, Inngest, n8n) wrapping LLM calls — not pure-LLM orchestration. The orchestration layer wants determinism the LLM cannot give.
Customization tips
- Be specific about scale. A 12-pitches/day workflow needs different architecture than a 1200-pitches/day workflow.
- List ALL integrations upfront, including read-only ones (preferences files, historical data). The architecture depends on what's queryable vs what's prompt-injected.
- Specify the asymmetric cost of failure types. 'A wrong-book costs more than a wrong-decline' shapes the human-in-the-loop policy fundamentally.
- If you have an existing system that's failing, describe the failure pattern (35% wrong decisions of what type) — not just 'it doesn't work'. The Original calibrates fixes to specific failure modes.
- For workflows with >10 steps, ask explicitly for the 'Production Workflow Mode' variant — adds error budgets and on-call runbook.
- Re-run quarterly. As your preferences, scale, or platform changes, the workflow architecture needs to evolve. Keep the design doc as a living artifact.
Variants
Subagent Mode
For Claude Code or similar agentic IDEs — designs the parent-agent + subagent architecture with todo-list lifecycle.
n8n Hybrid Mode
For n8n/Zapier+LLM workflows — picks where LLMs add value and where deterministic logic should win.
LangGraph Mode
For Python LangGraph — designs the state machine and the conditional edges.
Production Workflow Mode
Adds SLO targets, error budgets, on-call runbook, and rollback plan. For workflows running on customer-facing infrastructure.
Frequently asked questions
How do I use the Multi-Step Agent Workflow Designer prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Multi-Step Agent Workflow Designer?
Claude Opus 4. Workflow design requires reasoning about state transitions, failure modes, and observability — exactly Claude's strengths. ChatGPT GPT-5 second-best for shorter workflows (≤5 steps).
Can I customize the Multi-Step Agent Workflow Designer prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: If your workflow has more than 5 steps, you almost certainly need a multi-agent design. Single agents past 5 sequential steps lose reliability fast.; Each hand-off needs an explicit contract: what data, what format, what guarantees. Implicit hand-offs are where workflows break silently.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals