⚡ Promptolis Original · AI Agents & Automation

🎙️ Voice Agent Conversation Flow Architect

Designs realistic voice-agent dialogues for Vapi, Retell, Bland, ElevenLabs Conversational AI — with proper turn-taking, interruption handling, and the failure recovery that voice agents need but text agents don't.

⏱️ 5 min to set up 🤖 ~120 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Voice agents fail differently than text. Latency budgets are 800ms not 8s. Interruptions are constant. Background noise, MFA codes spoken aloud, accents, mumbled speech — none of these matter for chat agents and all of them matter for voice.

This Original designs the conversation flow with explicit turn-taking, interruption logic, fallback scripts, and the 6 voice-specific failure modes (no-speech-detected, partial-transcript, accent-misrecognition, talk-over-user, dead-air, escalation-trigger).

Calibrated to the 2026 voice-agent stack: Vapi, Retell, Bland, ElevenLabs Conversational AI. Picks the right platform for your use case, not a generic template.

The prompt

Promptolis Original · Copy-ready
<role> You are a voice-agent conversation designer with 4+ years building production voice agents on Vapi, Retell, Bland, and ElevenLabs Conversational AI. You have shipped 20+ voice agents handling >10K calls/day combined. You think in turn-taking, latency budgets, and barge-in semantics. You are direct. You will tell a builder their flow lacks dead-air timeouts, has no interruption handling, or designs for 2s latency budgets that will feel broken. You refuse to recommend 'just retry' for ASR failures — every misrecognition needs a recovery script. </role> <principles> 1. Six voice-specific failure modes: no-speech-detected, partial-transcript, accent-misrecognition, talk-over-user, dead-air, escalation-trigger. Design for each. 2. Latency budget is 800ms total round-trip. Voice agents at 2s feel broken. 3. Interruptions (barge-in) are normal, not exceptional. Plan first-class handling. 4. Dead-air timeouts: 4-5s silence triggers re-engagement. 5. Number sequences (phone, address, amounts) ALWAYS get read-back confirmation. 6. Hard-fail to human at 3 misunderstandings. Don't loop indefinitely. 7. ASR confidence drives recovery. Low confidence ≠ ignore; explicit 'I didn't catch that.' </principles> <input> <call-purpose>{end-to-end goal of the call — what does success look like}</call-purpose> <direction>{inbound / outbound}</direction> <expected-call-volume>{calls/day}</expected-call-volume> <typical-call-length-target>{seconds}</typical-call-length-target> <integrations>{CRM, calendar, payment, ID verification, etc.}</integrations> <compliance-needs>{HIPAA, PCI, GDPR, TCPA, etc.}</compliance-needs> <escalation-policy>{when does the agent hand off to a human, and to whom}</escalation-policy> <platform-preference>{Vapi / Retell / Bland / ElevenLabs / 'recommend'}</platform-preference> <known-failure-modes>{what's gone wrong before, if existing agent}</known-failure-modes> </input> <output-format> # Voice Agent Flow: [call purpose] ## Platform Recommendation Which voice-agent platform fits this use case. Why this rather than alternatives. ## The Conversation Flow State-machine diagram or numbered states. Each state: agent line, expected user response patterns, transitions, timeout behavior. ## Turn-Taking & Interruption Handling How the agent handles barge-in. When to yield, when to continue, when to acknowledge interruption. ## Latency Budget Allocation Where the 800ms goes: ASR, LLM, TTS, network. Specific platform-config recommendations. ## The 6 Failure Mode Defenses For each: detection trigger, recovery script, escalation if recovery fails. ## Number Sequence Handling For any phone numbers, addresses, dollar amounts: capture pattern, read-back script, confirmation logic. ## Compliance & Disclosures When and how to deliver required disclosures. Recording-consent, ID verification, payment compliance. ## Escalation Paths Triggers, handoff scripts, what context to pass to the human, fallback if no human is available. ## Implementation Skeleton Platform-specific configuration. State machine or graph definition. ## Test Plan Required test scenarios including failure simulations. Specific calls to make. ## Voice Selection & Personality Voice choice (gender, accent, pace, formality). Why this matches your use case. ## Key Takeaways 4-6 bullets — for the team's voice-agent playbook. </output-format> <auto-intake> If input incomplete: ask for purpose, direction, volume, length target, integrations, compliance, escalation, platform preference, known failures. </auto-intake> Now, design the voice-agent flow:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<call-purpose>Inbound calls to our dental practice. Receptionist is overwhelmed. Want voice agent to handle: book new appointment, reschedule existing, cancel, ask about emergency availability. Hand off to humans for any insurance question or anything outside scope.</call-purpose>
<direction>inbound</direction>
<expected-call-volume>~80 calls/day, peak hours 8-10am and 4-6pm</expected-call-volume>
<typical-call-length-target>3-4 minutes for booking; <60s for cancel</typical-call-length-target>
<integrations>Dentrix practice management (has API), Twilio for handoff to human, our calendar</integrations>
<compliance-needs>HIPAA — we collect patient name, DOB, reason for visit. Do NOT collect insurance details over voice agent.</compliance-needs>
<escalation-policy>Hand off to receptionist (Mon-Fri 8-5) for: insurance questions, complex scheduling, emergency dental pain (escalate immediately), any caller who says 'human' or 'agent'. Outside business hours: take message, promise callback.</escalation-policy>
<platform-preference>Recommend. Comfortable with both Vapi and Retell from past projects.</platform-preference>
<known-failure-modes>None yet — this is a new build.</known-failure-modes>
🤖 Output

Voice Agent Flow: Inbound Dental Practice Reception

Platform Recommendation

Vapi. Reasons specific to your use case:

  • Best-in-class latency for the 800ms budget you'll need (Vapi typically lands 600-900ms).
  • First-class Twilio handoff (matches your existing Twilio setup).
  • Native HIPAA-compliant deployment (Retell is HIPAA-capable but requires more config).
  • Their function-calling for Dentrix API integration is more robust than Bland's at this maturity.

Retell would also work well; it's a close second. Bland is better suited for outbound. ElevenLabs Conversational AI has the best voices but weaker function-calling for your Dentrix integration.

The Conversation Flow

[GREET] "Thanks for calling Dr. [name]'s office. This is [voice agent name]. I can help you book, reschedule, or cancel an appointment, or check on emergency availability. What can I help with today?"
  ↓
[INTENT] Listen for intent. Branches:
  - 'book' or 'new appointment' → BOOK
  - 'reschedule' or 'change' → RESCHEDULE
  - 'cancel' → CANCEL
  - 'emergency' or 'pain' → EMERGENCY (immediate escalation)
  - 'human', 'agent', 'receptionist' → HANDOFF
  - 'insurance' → HANDOFF (out of scope)
  - unclear → CLARIFY (max 2 attempts then HANDOFF)
  ↓

[BOOK]
  → ask first/last name (read-back confirm)
  → ask DOB (read-back confirm — HIPAA flag: stored briefly, transmitted only to Dentrix)
  → ask reason for visit (cleaning / consultation / pain / specific procedure)
  → query Dentrix for existing patient OR new
  → if existing: confirm phone on file or update
  → if new: capture phone (read-back confirm)
  → query available slots based on reason (cleanings vs ops use different blocks)
  → offer 2 slots: 'I have Tuesday at 10am or Thursday at 2pm. Which works?'
  → if neither: 'I have several options later this week. Could you tell me a day that works?'
  → confirm booking back: 'So that's [name], on [day], at [time], for [reason]. Right?'
  → on yes: book in Dentrix, deliver confirmation: 'You're booked. We'll text a reminder. Anything else?'
  → on no: re-prompt for the disagreed field

[RESCHEDULE]
  → ask name + DOB to look up
  → confirm existing appointment: 'I see you on [day] at [time]. Is that the one?'
  → ask preferred new time
  → check availability, offer 2 options
  → confirm + execute reschedule in Dentrix
  → confirmation message + text reminder

[CANCEL]
  → ask name + DOB to look up
  → confirm existing appointment
  → ask cancellation reason (optional, for our records)
  → execute cancel in Dentrix
  → 'Got it, [name]. Your appointment on [day] is cancelled. Would you like to reschedule now or call back?'

[EMERGENCY]
  → 'I'm transferring you to a real person right now. Please stay on the line.'
  → immediate Twilio handoff to receptionist
  → if outside hours: 'Our office is closed but our emergency line is [number]. For dental pain, please call there now.'

[HANDOFF]
  → 'Let me transfer you to one of our team. One moment.'
  → Twilio handoff with context tag (current state, what's been collected)
  → if outside hours: 'We're closed right now. I'll have someone call you back. Can I take your name and number?' (capture, read-back, log to ticket queue, end call)

[CLARIFY]
  → 'I didn't quite catch that. Are you calling to book, reschedule, cancel, or about an emergency?'
  → 2 attempts max → HANDOFF

Turn-Taking & Interruption Handling

Barge-in policy: Always allow user interruption EXCEPT during the read-back confirmation steps (where interrupting changes which field gets confirmed). On other turns, immediately yield to user when they speak.

Interruption acknowledgment: When user interrupts mid-sentence, agent stops, processes the interruption, and responds. Do NOT add 'sorry, I was speaking' — it adds latency and feels stilted.

Filler tokens during processing: When function-calling Dentrix takes >800ms, Vapi can play a filler ('Let me look that up, one second'). Configure this for any function call expected to exceed 800ms.

Yield priority: If user starts speaking while agent is mid-sentence, yield within 100ms. This is the platform default for Vapi; verify it's enabled.

Latency Budget Allocation

Total target: <900ms perceived response.

  • ASR (Vapi default Deepgram): 200-300ms
  • LLM call (Claude Sonnet via Vapi, with cached system prompt): 300-500ms
  • TTS (ElevenLabs streaming through Vapi): 150-250ms (first audio chunk)
  • Network overhead: 50-100ms

Critical optimizations:

  • Use prompt caching on the system prompt + flow definition (reduces every turn's LLM time by 200-400ms)
  • Use streaming TTS (don't wait for full sentence to start playback)
  • Parallel function calls when possible (e.g., 'lookup patient' AND 'check availability' fire simultaneously when both are determinable)
  • For Dentrix API calls expected to take >500ms, play filler token while waiting

The 6 Failure Mode Defenses

1. No-speech-detected (caller says nothing for >5s):

  • Trigger: Vapi VAD reports no audio above threshold for 5s
  • Recovery: 'Are you still there? I can help with booking, rescheduling, or cancelling.'
  • 2nd timeout (after 5 more seconds): 'I'll have to disconnect. Please call back when you're ready.'
  • End call gracefully

2. Partial-transcript (ASR confidence below threshold):

  • Trigger: ASR confidence <0.6 on the user's response
  • Recovery: 'I'm sorry, the line cut out. Could you repeat what you just said?'
  • 2nd low-confidence in same state: HANDOFF

3. Accent-misrecognition (caller has strong accent, ASR producing nonsense):

  • Trigger: 2 consecutive turns where the LLM cannot map the transcript to a valid intent
  • Recovery: 'I'm having a little trouble understanding. Let me transfer you to one of our team.' → HANDOFF immediately. Do not loop trying.

4. Talk-over-user (agent kept talking while user tried to interrupt):

  • Trigger: detection of significant overlap in audio streams (Vapi can flag this)
  • Recovery: 'Sorry — go ahead, what were you saying?' — agent yields fully.

5. Dead-air (silence on both sides for >4s mid-conversation):

  • Trigger: VAD reports no audio from caller for 4s, agent has nothing queued
  • Recovery: 'Just to make sure I have you — what's a good day for the appointment?' (re-anchor on current state's question)
  • 2nd dead-air: 'Are you still with me?'
  • 3rd: graceful end-call with offer to call back

6. Escalation-trigger (caller says key phrases):

  • Trigger: 'human', 'agent', 'receptionist', 'transfer me', 'real person', 'manager', 'I want to speak to', 'this is ridiculous', 'I'm frustrated'
  • Also: keyword 'emergency' or 'pain' immediately escalates regardless of state.
  • Action: 'Of course, let me get you to a real person.' → HANDOFF

Number Sequence Handling

Phone numbers:

  • Capture: 'What's the best phone number for you?'
  • Read-back: 'So that's [break each digit pair: 5-5-5, 1-2-3, 4-5-6-7]. Did I get that right?'
  • On 'no': capture again, do NOT assume which digits were wrong. Re-capture full number.

DOB:

  • Capture: 'And your date of birth?'
  • Parse to ISO date for Dentrix lookup, but read back in spoken form: 'June 14th, 1985 — is that right?'

Appointment times:

  • Always read back time + day + reason: 'Tuesday April 28th at 10am for a cleaning. Does that work?'
  • On 'wait, what time?': re-state slowly.

Compliance & Disclosures

HIPAA:

  • Recording disclosure (assuming you record): in greeting, after 'thanks for calling': 'This call may be recorded for quality.' (one short sentence — keep greeting brief)
  • Data minimization: collect only name, DOB, reason for visit, phone. Do NOT capture insurance details, SSN, payment.
  • If caller volunteers insurance: 'I'd love to help with that, but for insurance questions our team is best. Let me transfer you.' → HANDOFF

Recording consent:

  • One-party-consent states (most US): the disclosure is sufficient.
  • Two-party-consent states (CA, FL, IL, others): explicit verbal consent required. Add: 'I record calls to improve service. Is that okay?' Wait for affirmative.

TCPA (if you ever do callbacks):

  • Outbound callbacks must respect time-of-day restrictions and prior consent.

Escalation Paths

During business hours (Mon-Fri 8-5):

  • Twilio warm-transfer to main reception line
  • Pass context as Twilio call attribute: state at handoff, fields collected, reason for escalation
  • Reception agent sees a brief summary popup before picking up

Outside business hours:

  • Capture name + callback number
  • Log to ticket queue (your CRM)
  • Promise callback by '9am tomorrow' (or 'Monday morning' on weekends)
  • Read back: 'I have [name], number [digits]. We'll call by [time]. Anything else?'

Escalation if no human available (system error):

  • Promise callback within 1 hour
  • Page on-call manager via PagerDuty (this should be rare)

Implementation Skeleton (Vapi)

assistant:
  name: 'Dental Front Desk'
  voice: { provider: '11labs', voice_id: 'rachel' }  # warm, mid-tone, US accent
  model: { provider: 'anthropic', model: 'claude-sonnet-4', temperature: 0.4 }
  systemPrompt: <full state-machine flow + brand voice>
  firstMessage: '<GREET line>'
  responseDelaySeconds: 0.4
  endCallPhrases: ['goodbye', 'bye']
  silenceTimeoutSeconds: 5
  numWordsToInterruptAssistant: 2
  functions:
    - lookup_patient (params: name, dob)
    - check_availability (params: reason, date_range)
    - book_appointment (params: patient_id, slot_id, reason)
    - reschedule_appointment (params: appointment_id, new_slot_id)
    - cancel_appointment (params: appointment_id, reason)
    - transfer_to_human (params: context, reason)
    - log_callback_request (params: name, phone, callback_window)
  twilioPhoneNumber: <your number>
  recordingEnabled: true

Test Plan

Required test calls (manual):

1. Happy path: book new appointment, no interruptions

2. Existing patient reschedule

3. Cancel

4. Emergency: 'I'm having severe pain' → verify immediate escalation

5. 'I want to speak to a human' → verify immediate handoff

6. Insurance question → verify out-of-scope handoff

7. Caller goes silent mid-call (test dead-air)

8. Caller interrupts greeting (test barge-in)

9. Caller speaks with strong non-US accent (test accent failure path)

10. Caller from a noisy environment (test ASR-confidence fallback)

11. Outside business hours scenario

12. Caller says 'never mind' mid-flow (test graceful end)

Required automated tests:

  • 100 simulated calls with varied scripts
  • Latency p95 must stay under 1100ms
  • Handoff success rate >95%
  • Number-sequence read-back error rate <2%

Production rollout:

  • Week 1: 20% of inbound calls (low-traffic hours), monitor
  • Week 2: 50% of inbound calls
  • Week 3: 100% with human override always available

Voice Selection & Personality

Voice: Warm, mid-tone, US accent (ElevenLabs 'Rachel' or equivalent). Female-presenting voice is conventional for dental front desks and tests well. Pace: slightly slower than default (-10%). Formality: professional but friendly.

Why: Dental anxiety is real. The voice shouldn't sound clinical. Warm + slightly slower paces matches the human receptionist's natural style and reduces patient anxiety on the call.

Avoid: synthetic-sounding voices, uptalk, excessive filler ('um', 'uh' — TTS-generated fillers feel uncanny).

Key Takeaways

  • Latency budget is 800-900ms total. Vapi + Claude Sonnet + ElevenLabs streaming + Deepgram ASR can hit this. Use prompt caching aggressively.
  • Build all 6 failure modes from day one. Voice agents don't have the luxury of 'we'll add error handling in v2' — failures during a real call are immediately visible.
  • Number-sequence read-back is non-negotiable. Phone numbers and DOB will be wrong on first capture some percentage of the time.
  • Hard-fail to human at 3 misunderstandings. Looping forever loses callers; transferring saves them.
  • HIPAA data minimization. Collect name, DOB, reason, phone — nothing more. Insurance handoff to humans.
  • Test with non-native speakers AND on bad connections. Both conditions reveal flaws clean testing won't catch.

Common use cases

  • Builder designing an inbound voice agent for appointment booking
  • Solo SaaS founder adding voice support to their product
  • Team building outbound sales-qualification voice agents
  • Developer hardening a voice agent that's currently dropping 30% of calls
  • Builder designing a customer-care voice agent with proper escalation paths

Best AI model for this

Claude Opus 4. Conversation flow design requires reasoning about timing, interruption semantics, and failure recovery — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

  • Design for 800ms total round-trip latency. Voice agents that take 2s to respond feel broken even when output is perfect.
  • Always plan interruption handling. Real users interrupt; barge-in detection isn't optional.
  • Have a 'dead-air timeout.' If neither side has spoken in 4-5s, the agent should re-engage ('Are you still there?'). Silence kills conversations.
  • Number sequences are voice's hardest case. Phone numbers, addresses, dollar amounts. Always ask the user to repeat important numbers AND confirm read-back.
  • Plan for noise. Office background, kids in the room, road noise. ASR confidence scoring should drive 'I didn't catch that, could you repeat?' fallbacks.
  • Hard-fail to human handoff at 3 misunderstandings. Don't loop indefinitely. Frustrated users hang up; transferred users complete.
  • Test with non-native speakers AND on bad phone connections. Both reveal flaws perfect-conditions testing won't.

Customization tips

  • Specify direction (inbound vs outbound) precisely. They have completely different design priorities — outbound has TCPA, voicemail detection, opt-out flows; inbound has greeting, intent detection, escalation.
  • List integrations precisely. Vapi vs Retell vs Bland's function-calling robustness varies; the platform recommendation depends on what you're integrating with.
  • Be specific about compliance. HIPAA, PCI, TCPA, GDPR, state-level consent laws all change the design. Don't say 'standard compliance' — name the regulations.
  • Specify expected call volume and latency-budget tolerance. A 5-call/day internal tool can tolerate 1.5s latency; 80-call/day customer-facing cannot.
  • If existing voice agent: list specific failure modes you've observed. The recovery design is calibrated to YOUR failures.
  • For high-stakes flows (payment, identity, medical), use the High-Stakes Mode variant — adds double-confirmation patterns and aggressive escalation thresholds.

Variants

Inbound Mode

For inbound customer calls — emphasizes greeting, intent detection, ID verification, escalation paths.

Outbound Mode

For outbound calls — emphasizes opt-out handling, voicemail detection, do-not-call list compliance, regulatory disclosures.

IVR Replacement Mode

For replacing an existing IVR — preserves analytics, routing logic, and known customer behaviors from the IVR data.

High-Stakes Mode

For calls involving payment, identity, or medical info — adds compliance disclosures, double-confirmation patterns, and aggressive escalation thresholds.

Frequently asked questions

How do I use the Voice Agent Conversation Flow Architect prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Voice Agent Conversation Flow Architect?

Claude Opus 4. Conversation flow design requires reasoning about timing, interruption semantics, and failure recovery — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Voice Agent Conversation Flow Architect prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Design for 800ms total round-trip latency. Voice agents that take 2s to respond feel broken even when output is perfect.; Always plan interruption handling. Real users interrupt; barge-in detection isn't optional.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals