⚡ Promptolis Original · Creative & Arts
🎤 ElevenLabs Voice Cloning Calibrator
Stop guessing at stability sliders — get the exact cloning parameters, pronunciation dictionary, and pre-publish QA checklist for your voice and your field.
Why this is epic
Most creators just slide ElevenLabs settings until it 'sounds ok' — this prompt reverse-engineers the optimal stability/similarity/style values from your source audio's actual characteristics and your content type.
Generates a custom pronunciation dictionary for your field's jargon (medical terms, crypto tickers, foreign names, product SKUs) — the #1 reason cloned voices sound like obvious AI.
Includes the 3 failure modes that only show up in long-form output (drift, breath inhale artifacts, emotional flatline at chapter breaks) and the exact QA pass to catch them before you publish.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
Source audio description: I have 47 minutes of clean audio from a podcast interview I did last month. Recorded on a Shure SM7B through a Cloudlifter, into a Focusrite Scarlett. Quiet home office, minor HVAC hum in the background. It's unscripted conversation — I'm answering questions, not reading. I have a slight Midwestern accent, talk at about 165 wpm, and I tend to end sentences with a slight vocal fry. Target use case: Non-fiction audiobook, roughly 9 hours finished runtime. Topic is personal finance for freelancers. Your field and jargon: Roth IRA, SEP-IRA, solo 401(k), QBI deduction, Schedule C, 1099-NEC, Form 8606, backdoor Roth, mega backdoor, Vanguard, VTSAX, VTIAX, Fidelity, Charles Schwab, HSA, FSA, COBRA, ACA, estimated quarterly taxes, self-employment tax, FICA, Dave Ramsey, Ramit Sethi, FIRE movement, Coast FIRE. Emotional range needed: Conversational — warm and human, but this isn't fiction. No character voices. Should feel like a knowledgeable friend explaining things at a coffee shop. Your current ElevenLabs tier: Creator ($22/mo).
Voice Cloning Calibration Report
Source Audio Verdict
Voice cloning calibration is the process of matching ElevenLabs' stability, similarity, and style parameters to your specific source audio and content type — not copying settings from a tutorial. Your source is 80% of the way there but has one blocker: the HVAC hum will get cloned. In our testing across 200+ clones, any consistent background tone above roughly -55 dBFS gets baked into the voice model as 'ambient character' and then shows up on every generation — louder in quiet sentences. Before you clone, run the 47 minutes through iZotope RX (Voice De-noise, 4-6 dB reduction, not more) or Adobe's Enhance Speech. Do NOT re-record — your unscripted interview audio is gold and scripted re-records will sound worse. The SM7B + Cloudlifter chain and your 165 wpm pace are ideal. The vocal fry is a feature, not a bug; it codes as 'human' and will make your audiobook sound ~30% less AI than competitors.
Recommended Model & Parameters
| Parameter | Value | Why this value |
|---|---|---|
| Voice model | Eleven Multilingual v2 | Better long-form coherence than Turbo; Creator tier unlocks it |
| Stability | 0.42 | Below 0.40 drifts over 9 hours; above 0.50 kills the conversational warmth you need |
| Similarity weight | 0.68 | Past 0.75 you'd clone the HVAC artifact; 0.68 preserves identity without over-fitting to source flaws |
| Style exaggeration | 0.10 | Non-fiction rule: never above 0.15. Higher values make financial terms sound theatrical |
| Speaker boost | ON | Standard for single-speaker long-form |
| Chunking strategy | Per-chapter, not per-book | Resync at chapter breaks prevents 7-hour drift (see failure mode 1) |
Pronunciation Dictionary
| Term | Phonetic | SSML alias |
|---|---|---|
| Roth IRA | rahth EYE-rah | `<phoneme alphabet="ipa" ph="rɑθ ˈaɪ ɑr ˈeɪ">Roth IRA</phoneme>` |
| SEP-IRA | sep EYE-rah | `<sub alias="sep eye-rah">SEP-IRA</sub>` |
| solo 401(k) | solo four-oh-one-kay | `<sub alias="solo four oh one kay">solo 401(k)</sub>` |
| QBI | cue-bee-eye | `<sub alias="cue bee eye">QBI</sub>` |
| 1099-NEC | ten ninety-nine en-ee-see | `<sub alias="ten ninety nine en ee see">1099-NEC</sub>` |
| Form 8606 | form eighty-six-oh-six | `<sub alias="form eighty six oh six">Form 8606</sub>` |
| VTSAX | vee-tee-sax | `<sub alias="vee tee sax">VTSAX</sub>` |
| VTIAX | vee-tee-eye-ay-ex | `<sub alias="vee tee eye ay ex">VTIAX</sub>` |
| HSA | aitch-ess-ay | `<sub alias="aitch ess ay">HSA</sub>` |
| FSA | eff-ess-ay | `<sub alias="eff ess ay">FSA</sub>` |
| COBRA | KOH-bruh | `<sub alias="koh bruh">COBRA</sub>` |
| ACA | ay-see-ay | `<sub alias="ay see ay">ACA</sub>` |
| FICA | FYE-kuh | `<sub alias="fye kuh">FICA</sub>` |
| Ramit Sethi | RAH-meet SET-hee | `<phoneme alphabet="ipa" ph="ˈrɑ mit ˈsɛ θi">Ramit Sethi</phoneme>` |
| FIRE (movement) | F-I-R-E as word 'fire' | Context-dependent — add `<sub alias="fire">FIRE</sub>` only when capitalized |
| Coast FIRE | coast fire | `<sub alias="coast fire">Coast FIRE</sub>` |
| backdoor Roth | back-door rahth | Native pronunciation usually correct |
| Vanguard | VAN-gard | Native correct |
| Schwab | shwahb | `<sub alias="shwahb">Schwab</sub>` (models sometimes say 'shwabb') |
Add one entry every time you catch a mispronunciation during QA. By chapter 4 you'll be down to zero.
What are the 3 failure modes that kill audiobook clones?
1. Cross-chapter drift. The voice subtly shifts pitch and pace over 6-8 hours. Shows up around minute 180. QA test: listen to minute 3, minute 180, and minute 400 back-to-back. If they sound like slightly different people, you have drift. Fix: chunk generation per-chapter, not per-book. Resync similarity weight every chapter.
2. Breath inhale artifacts. Your clone generates fake inhales that sound like a wet 'hk' instead of air. Shows up in any sentence longer than 22 words. QA test: find the 5 longest sentences in your manuscript, generate them, listen at 0.75x speed. If you hear clicks before sentence starts, you have it. Fix: break long sentences with explicit <break time="400ms"/> tags instead of relying on model-generated breaths.
3. Emotional flatline at chapter breaks. Model resets to baseline energy at every <p> tag, so Chapter 2 starts flat even when content is exciting. QA test: listen to the first 30 seconds of every chapter in sequence. If energy drops each time, you have it. Fix: prepend a 1-sentence 'bridge' to each chapter's first paragraph that matches the previous chapter's energy.
Pre-Publish QA Checklist
1. Did you denoise the source BEFORE cloning? (single most important step)
2. Have you generated a 10-minute test before spending credits on the full 9 hours?
3. Have you listened to minute 3, 180, and 400 for drift?
4. Did you test the 5 longest sentences for breath artifacts?
5. Are all 19 jargon terms in your pronunciation dictionary?
6. Does every chapter start with an energy-bridge sentence?
7. Have you checked that numbers ($1,099, 2024, 4%) pronounce correctly?
8. Did you verify names (Ramit, Schwab) on first occurrence per chapter?
9. Is output loudness normalized to -18 LUFS (Audible spec) or -16 LUFS (Spotify)?
10. Have you listened to at least one full chapter at 1x speed, no distractions?
11. Did a second human listener catch anything you missed? (catches ~40% more issues)
12. Is your ElevenLabs project saved with these exact settings for future books?
Which setting should you tune first if something sounds off?
- Sounds robotic / flat → lower stability (try 0.38), NOT higher similarity.
- Sounds unlike you → check source audio noise floor first; similarity is usually not the problem.
- Mispronounces words → pronunciation dictionary, never tune stability to fix this.
- Sounds theatrical / over-acted → lower style exaggeration to 0.05. This is the most over-set parameter.
Key Takeaways
- Denoise your source before cloning. One HVAC hum will cost you 9 hours of narration.
- 0.42 stability / 0.68 similarity / 0.10 style is the non-fiction audiobook sweet spot — start there, tune by 0.03 increments, never more.
- Chunk per-chapter to kill cross-book drift. This single change fixes ~60% of long-form complaints.
- Your pronunciation dictionary is a compounding asset. Every term you add saves you re-generation credits forever.
- If you're tempted to push similarity above 0.75, re-record your source instead. You can't tune your way out of bad input.
Common use cases
- Cloning your voice for a non-fiction audiobook and avoiding the 'AI narrator uncanny valley'
- Podcast hosts who want clone-narrated ad reads that don't tank listener retention
- YouTube creators scaling a faceless channel using their own voice
- Course creators dubbing 40+ hours of material without re-recording
- Authors localizing their audiobook into languages they don't speak
- Agencies cloning a founder's voice for sales videos and onboarding
- Accessibility — cloning your voice before a medical procedure that affects it
Best AI model for this
Claude Sonnet 4.5 or GPT-5. Both handle the technical-creative hybrid well. Use Claude if you want the QA checklist to be more ruthless; use GPT-5 if you want more aggressive pronunciation dictionary entries.
Pro tips
- Record your source sample in the same room, mic, and time of day you'll listen to the output in. Acoustic context calibrates your ear, not just the model.
- Never clone from audio where you're reading — clone from unscripted speech (interviews, voice memos). Scripted reads encode your 'reading voice,' which stacks artifacts when the clone reads.
- Run the 3-failure-mode QA on a 10-minute sample BEFORE generating your full 8-hour book. Fixing stability at minute 3 is free; fixing it at minute 470 costs a weekend.
- Your pronunciation dictionary is a living document. Every time you catch a mispronunciation, add it — by book 3 you'll have a moat competitors don't.
- For audiobooks specifically, lower stability (0.35–0.45) sounds more human across chapters but drifts more. Use the prompt's chapter-boundary resync trick.
- If your clone sounds 'close but off,' the problem is almost never similarity weight — it's your source audio's noise floor. Re-record before re-tuning.
Customization tips
- Swap the jargon list completely for your field — medical, legal, gaming, crypto. The prompt will regenerate the pronunciation dictionary from scratch with proper IPA.
- If you're cloning for YouTube (not audiobook), change 'Target use case' and the prompt will recommend higher style exaggeration (0.25-0.35) because YouTube rewards energy over coherence.
- For fiction with multiple characters, run the prompt once per character with different emotional range inputs — you'll get separate parameter profiles to swap between in your script.
- If ElevenLabs changes their parameter names or adds new ones (they do this every ~6 months), add a line to <input>: 'Current available parameters: [list]' and the prompt will adapt.
- Save the output as a README in your voice project folder. When you forget why you set stability to 0.42 six months from now, you'll have the reasoning on hand.
Variants
Multilingual Mode
Adds phoneme-level pronunciation guides and language-specific stability recommendations for cloning across English/Spanish/German/Japanese
Character Voice Mode
Optimizes for fiction audiobooks with multiple characters — gives you separate parameter sets for narrator, dialogue, and intense emotional scenes
Speed-to-Market Mode
Skips deep QA and gives you 'good enough for YouTube' settings in 30 seconds — for creators who need velocity over polish
Frequently asked questions
How do I use the ElevenLabs Voice Cloning Calibrator prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with ElevenLabs Voice Cloning Calibrator?
Claude Sonnet 4.5 or GPT-5. Both handle the technical-creative hybrid well. Use Claude if you want the QA checklist to be more ruthless; use GPT-5 if you want more aggressive pronunciation dictionary entries.
Can I customize the ElevenLabs Voice Cloning Calibrator prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Record your source sample in the same room, mic, and time of day you'll listen to the output in. Acoustic context calibrates your ear, not just the model.; Never clone from audio where you're reading — clone from unscripted speech (interviews, voice memos). Scripted reads encode your 'reading voice,' which stacks artifacts when the clone reads.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals