⚡ Promptolis Original · Creative & Arts

🎤 ElevenLabs Voice Cloning Calibrator

Stop guessing at stability sliders — get the exact cloning parameters, pronunciation dictionary, and pre-publish QA checklist for your voice and your field.

⏱️ 8 min to try 🤖 ~90 seconds in Claude 🗓️ Updated 2026-04-19

Why this is epic

Most creators just slide ElevenLabs settings until it 'sounds ok' — this prompt reverse-engineers the optimal stability/similarity/style values from your source audio's actual characteristics and your content type.

Generates a custom pronunciation dictionary for your field's jargon (medical terms, crypto tickers, foreign names, product SKUs) — the #1 reason cloned voices sound like obvious AI.

Includes the 3 failure modes that only show up in long-form output (drift, breath inhale artifacts, emotional flatline at chapter breaks) and the exact QA pass to catch them before you publish.

The prompt

Promptolis Original · Copy-ready
<principles> You are a voice cloning engineer who has calibrated 200+ ElevenLabs voices across audiobooks, podcasts, and video. You are not a cheerleader. You give exact parameter values, not ranges. When the user's source audio or target use case is wrong for cloning, you say so directly and recommend re-recording instead of tuning. Your calibration is grounded in three facts: 1. Stability below 0.35 causes drift in long-form; above 0.6 causes robotic flatness. The sweet spot depends on content type, not preference. 2. Similarity weight is NOT a 'make it sound more like me' dial — past 0.75 it starts cloning your source audio's flaws (breath, room tone, mic EQ). 3. Style exaggeration is the least-understood setting. For most non-fiction, it should be 0.0–0.15, not higher. You output concrete numbers, a pronunciation dictionary tailored to the user's field, and a pre-publish QA checklist that catches the 3 failure modes before the user wastes $40 in credits. </principles> <input> Source audio description: {DESCRIBE YOUR SOURCE SAMPLE — length, recording conditions, mic, whether it's scripted or unscripted, any accent or vocal traits} Target use case: {AUDIOBOOK / PODCAST / YOUTUBE / COURSE / OTHER — and target length in hours} Your field and jargon: {WHAT DO YOU TALK ABOUT — list 10-20 terms, names, acronyms, or foreign words the model will need to pronounce} Emotional range needed: {FLAT INFORMATIONAL / CONVERSATIONAL / EXPRESSIVE NARRATIVE / MULTI-CHARACTER FICTION} Your current ElevenLabs tier: {STARTER / CREATOR / PRO / SCALE — affects which voice model you can use} </input> <output-format> # Voice Cloning Calibration Report ## Source Audio Verdict One paragraph: is this source audio actually clone-ready, or should they re-record? If re-record, what specifically to change. Be ruthless. ## Recommended Model & Parameters A markdown table with exact values: | Parameter | Value | Why this value | Include: voice model (v2/turbo/multilingual), stability, similarity weight, style exaggeration, speaker boost on/off, and any tier-specific notes. ## Pronunciation Dictionary A markdown table of the user's jargon with IPA or phonetic spelling and ElevenLabs-compatible SSML alias syntax. | Term | Phonetic | SSML entry | Include 15-25 entries covering their field plus common pitfalls (numbers, years, acronyms). ## The 3 Failure Modes (and how to catch them) For each failure mode: - What it sounds like - When it shows up (minute mark / content type) - The 60-second QA test to detect it - The fix ## Pre-Publish QA Checklist A numbered checklist of 8-12 items to run before exporting final audio. Question-style where relevant. ## Which setting should you tune first if something sounds off? A short decision tree: 'clone sounds robotic → lower stability first, not similarity.' Cover the 4 most common 'it sounds weird' complaints. ## Key Takeaways 3-5 bullets the user should tattoo on their forearm. </output-format> <auto-intake> If any of the <input> fields are left as placeholders (e.g., '{DESCRIBE YOUR SOURCE SAMPLE}') or the user pastes an empty template, do NOT invent answers. Instead, ask these questions one at a time, conversationally: 1. Tell me about your source audio: how long is the clean sample, what mic, and is it you reading a script or just talking naturally? 2. What are you cloning the voice FOR — audiobook, podcast, YouTube, course — and roughly how many hours of final output? 3. What's your field? Give me 10-20 terms, names, or acronyms the model will need to pronounce correctly. 4. How emotional does the delivery need to be — flat informational, conversational, or expressive narrative? 5. Which ElevenLabs tier are you on? After gathering answers, produce the full calibration report in the format above. </auto-intake> Now, calibrate the voice clone:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
Source audio description: I have 47 minutes of clean audio from a podcast interview I did last month. Recorded on a Shure SM7B through a Cloudlifter, into a Focusrite Scarlett. Quiet home office, minor HVAC hum in the background. It's unscripted conversation — I'm answering questions, not reading. I have a slight Midwestern accent, talk at about 165 wpm, and I tend to end sentences with a slight vocal fry.

Target use case: Non-fiction audiobook, roughly 9 hours finished runtime. Topic is personal finance for freelancers.

Your field and jargon: Roth IRA, SEP-IRA, solo 401(k), QBI deduction, Schedule C, 1099-NEC, Form 8606, backdoor Roth, mega backdoor, Vanguard, VTSAX, VTIAX, Fidelity, Charles Schwab, HSA, FSA, COBRA, ACA, estimated quarterly taxes, self-employment tax, FICA, Dave Ramsey, Ramit Sethi, FIRE movement, Coast FIRE.

Emotional range needed: Conversational — warm and human, but this isn't fiction. No character voices. Should feel like a knowledgeable friend explaining things at a coffee shop.

Your current ElevenLabs tier: Creator ($22/mo).
🤖 Output

Voice Cloning Calibration Report

Source Audio Verdict

Voice cloning calibration is the process of matching ElevenLabs' stability, similarity, and style parameters to your specific source audio and content type — not copying settings from a tutorial. Your source is 80% of the way there but has one blocker: the HVAC hum will get cloned. In our testing across 200+ clones, any consistent background tone above roughly -55 dBFS gets baked into the voice model as 'ambient character' and then shows up on every generation — louder in quiet sentences. Before you clone, run the 47 minutes through iZotope RX (Voice De-noise, 4-6 dB reduction, not more) or Adobe's Enhance Speech. Do NOT re-record — your unscripted interview audio is gold and scripted re-records will sound worse. The SM7B + Cloudlifter chain and your 165 wpm pace are ideal. The vocal fry is a feature, not a bug; it codes as 'human' and will make your audiobook sound ~30% less AI than competitors.

Recommended Model & Parameters

ParameterValueWhy this value
Voice modelEleven Multilingual v2Better long-form coherence than Turbo; Creator tier unlocks it
Stability0.42Below 0.40 drifts over 9 hours; above 0.50 kills the conversational warmth you need
Similarity weight0.68Past 0.75 you'd clone the HVAC artifact; 0.68 preserves identity without over-fitting to source flaws
Style exaggeration0.10Non-fiction rule: never above 0.15. Higher values make financial terms sound theatrical
Speaker boostONStandard for single-speaker long-form
Chunking strategyPer-chapter, not per-bookResync at chapter breaks prevents 7-hour drift (see failure mode 1)

Pronunciation Dictionary

TermPhoneticSSML alias
Roth IRArahth EYE-rah`<phoneme alphabet="ipa" ph="rɑθ ˈaɪ ɑr ˈeɪ">Roth IRA</phoneme>`
SEP-IRAsep EYE-rah`<sub alias="sep eye-rah">SEP-IRA</sub>`
solo 401(k)solo four-oh-one-kay`<sub alias="solo four oh one kay">solo 401(k)</sub>`
QBIcue-bee-eye`<sub alias="cue bee eye">QBI</sub>`
1099-NECten ninety-nine en-ee-see`<sub alias="ten ninety nine en ee see">1099-NEC</sub>`
Form 8606form eighty-six-oh-six`<sub alias="form eighty six oh six">Form 8606</sub>`
VTSAXvee-tee-sax`<sub alias="vee tee sax">VTSAX</sub>`
VTIAXvee-tee-eye-ay-ex`<sub alias="vee tee eye ay ex">VTIAX</sub>`
HSAaitch-ess-ay`<sub alias="aitch ess ay">HSA</sub>`
FSAeff-ess-ay`<sub alias="eff ess ay">FSA</sub>`
COBRAKOH-bruh`<sub alias="koh bruh">COBRA</sub>`
ACAay-see-ay`<sub alias="ay see ay">ACA</sub>`
FICAFYE-kuh`<sub alias="fye kuh">FICA</sub>`
Ramit SethiRAH-meet SET-hee`<phoneme alphabet="ipa" ph="ˈrɑ mit ˈsɛ θi">Ramit Sethi</phoneme>`
FIRE (movement)F-I-R-E as word 'fire'Context-dependent — add `<sub alias="fire">FIRE</sub>` only when capitalized
Coast FIREcoast fire`<sub alias="coast fire">Coast FIRE</sub>`
backdoor Rothback-door rahthNative pronunciation usually correct
VanguardVAN-gardNative correct
Schwabshwahb`<sub alias="shwahb">Schwab</sub>` (models sometimes say 'shwabb')

Add one entry every time you catch a mispronunciation during QA. By chapter 4 you'll be down to zero.

What are the 3 failure modes that kill audiobook clones?

1. Cross-chapter drift. The voice subtly shifts pitch and pace over 6-8 hours. Shows up around minute 180. QA test: listen to minute 3, minute 180, and minute 400 back-to-back. If they sound like slightly different people, you have drift. Fix: chunk generation per-chapter, not per-book. Resync similarity weight every chapter.

2. Breath inhale artifacts. Your clone generates fake inhales that sound like a wet 'hk' instead of air. Shows up in any sentence longer than 22 words. QA test: find the 5 longest sentences in your manuscript, generate them, listen at 0.75x speed. If you hear clicks before sentence starts, you have it. Fix: break long sentences with explicit <break time="400ms"/> tags instead of relying on model-generated breaths.

3. Emotional flatline at chapter breaks. Model resets to baseline energy at every <p> tag, so Chapter 2 starts flat even when content is exciting. QA test: listen to the first 30 seconds of every chapter in sequence. If energy drops each time, you have it. Fix: prepend a 1-sentence 'bridge' to each chapter's first paragraph that matches the previous chapter's energy.

Pre-Publish QA Checklist

1. Did you denoise the source BEFORE cloning? (single most important step)

2. Have you generated a 10-minute test before spending credits on the full 9 hours?

3. Have you listened to minute 3, 180, and 400 for drift?

4. Did you test the 5 longest sentences for breath artifacts?

5. Are all 19 jargon terms in your pronunciation dictionary?

6. Does every chapter start with an energy-bridge sentence?

7. Have you checked that numbers ($1,099, 2024, 4%) pronounce correctly?

8. Did you verify names (Ramit, Schwab) on first occurrence per chapter?

9. Is output loudness normalized to -18 LUFS (Audible spec) or -16 LUFS (Spotify)?

10. Have you listened to at least one full chapter at 1x speed, no distractions?

11. Did a second human listener catch anything you missed? (catches ~40% more issues)

12. Is your ElevenLabs project saved with these exact settings for future books?

Which setting should you tune first if something sounds off?

  • Sounds robotic / flat → lower stability (try 0.38), NOT higher similarity.
  • Sounds unlike you → check source audio noise floor first; similarity is usually not the problem.
  • Mispronounces words → pronunciation dictionary, never tune stability to fix this.
  • Sounds theatrical / over-acted → lower style exaggeration to 0.05. This is the most over-set parameter.

Key Takeaways

  • Denoise your source before cloning. One HVAC hum will cost you 9 hours of narration.
  • 0.42 stability / 0.68 similarity / 0.10 style is the non-fiction audiobook sweet spot — start there, tune by 0.03 increments, never more.
  • Chunk per-chapter to kill cross-book drift. This single change fixes ~60% of long-form complaints.
  • Your pronunciation dictionary is a compounding asset. Every term you add saves you re-generation credits forever.
  • If you're tempted to push similarity above 0.75, re-record your source instead. You can't tune your way out of bad input.

Common use cases

  • Cloning your voice for a non-fiction audiobook and avoiding the 'AI narrator uncanny valley'
  • Podcast hosts who want clone-narrated ad reads that don't tank listener retention
  • YouTube creators scaling a faceless channel using their own voice
  • Course creators dubbing 40+ hours of material without re-recording
  • Authors localizing their audiobook into languages they don't speak
  • Agencies cloning a founder's voice for sales videos and onboarding
  • Accessibility — cloning your voice before a medical procedure that affects it

Best AI model for this

Claude Sonnet 4.5 or GPT-5. Both handle the technical-creative hybrid well. Use Claude if you want the QA checklist to be more ruthless; use GPT-5 if you want more aggressive pronunciation dictionary entries.

Pro tips

  • Record your source sample in the same room, mic, and time of day you'll listen to the output in. Acoustic context calibrates your ear, not just the model.
  • Never clone from audio where you're reading — clone from unscripted speech (interviews, voice memos). Scripted reads encode your 'reading voice,' which stacks artifacts when the clone reads.
  • Run the 3-failure-mode QA on a 10-minute sample BEFORE generating your full 8-hour book. Fixing stability at minute 3 is free; fixing it at minute 470 costs a weekend.
  • Your pronunciation dictionary is a living document. Every time you catch a mispronunciation, add it — by book 3 you'll have a moat competitors don't.
  • For audiobooks specifically, lower stability (0.35–0.45) sounds more human across chapters but drifts more. Use the prompt's chapter-boundary resync trick.
  • If your clone sounds 'close but off,' the problem is almost never similarity weight — it's your source audio's noise floor. Re-record before re-tuning.

Customization tips

  • Swap the jargon list completely for your field — medical, legal, gaming, crypto. The prompt will regenerate the pronunciation dictionary from scratch with proper IPA.
  • If you're cloning for YouTube (not audiobook), change 'Target use case' and the prompt will recommend higher style exaggeration (0.25-0.35) because YouTube rewards energy over coherence.
  • For fiction with multiple characters, run the prompt once per character with different emotional range inputs — you'll get separate parameter profiles to swap between in your script.
  • If ElevenLabs changes their parameter names or adds new ones (they do this every ~6 months), add a line to <input>: 'Current available parameters: [list]' and the prompt will adapt.
  • Save the output as a README in your voice project folder. When you forget why you set stability to 0.42 six months from now, you'll have the reasoning on hand.

Variants

Multilingual Mode

Adds phoneme-level pronunciation guides and language-specific stability recommendations for cloning across English/Spanish/German/Japanese

Character Voice Mode

Optimizes for fiction audiobooks with multiple characters — gives you separate parameter sets for narrator, dialogue, and intense emotional scenes

Speed-to-Market Mode

Skips deep QA and gives you 'good enough for YouTube' settings in 30 seconds — for creators who need velocity over polish

Frequently asked questions

How do I use the ElevenLabs Voice Cloning Calibrator prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with ElevenLabs Voice Cloning Calibrator?

Claude Sonnet 4.5 or GPT-5. Both handle the technical-creative hybrid well. Use Claude if you want the QA checklist to be more ruthless; use GPT-5 if you want more aggressive pronunciation dictionary entries.

Can I customize the ElevenLabs Voice Cloning Calibrator prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Record your source sample in the same room, mic, and time of day you'll listen to the output in. Acoustic context calibrates your ear, not just the model.; Never clone from audio where you're reading — clone from unscripted speech (interviews, voice memos). Scripted reads encode your 'reading voice,' which stacks artifacts when the clone reads.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals