reference_id to Text to Speech and every generation speaks in that voice. Works from the API directly, the Python library, or JavaScript.
Use it in the web app
No code — clone a voice in the browser.
API reference
Every field for
POST /model.Cookbooks
Instant clones, training, and reuse.
When to use it
Brand voice
One consistent voice across product, ads, and IVR.
Personal voice
Clone your own voice for narration or assistants.
Characters
Distinct voices for games, stories, and dialogue.
Dubbing & localization
Keep a speaker’s identity across languages.
Quick start
Send one or more audio samples, get back a voice model. Choose your implementation:Use your cloned voice
Pass the voice id asreference_id to Text to Speech — exactly like any other voice.
Implementation details
Sample quality
Clean, mono, single-speaker audio gives the best result. A short clip works for a quick clone; a minute or two of clear speech improves fidelity. Avoid background music, reverb, and overlapping voices.Multiple samples
Pass several clips to capture more range. You can also supply the matching transcripts astexts to sharpen pronunciation.
Visibility
Models areprivate by default. Set unlist for a shareable link, or public to publish to the Voice Library. You can change this later — see Manage Voices.
Instant vs. persistent clones
There are two ways to clone:- Persistent model (above) — train once with
voices.create(), get back a reusableid. Best when you’ll use the same voice repeatedly. - Instant clone — pass reference audio inline on each generation with no model to manage. Best for one-off or per-request voices.
references instead of reference_id:
Python
ReferenceAudio entries to capture more range, just as you would with multiple samples in a persistent model. The matching text for each clip sharpens pronunciation.
Sample audio requirements
Samples can be.wav, .mp3, .m4a, or .opus. Aim for at least 10 seconds per clip; a minute or two of clear, single-speaker speech improves fidelity.
enhance_audio_quality (on by default) removes background noise and normalizes levels before training:
Python
enhance_audio_quality=False) avoids any extra processing.
Model state
A new model reports astate field that moves from created to trained (or failed). With train_mode="fast" (the default) the voice is usable almost immediately, so most clones return already trained.
Python
reference_id, re-fetch the model and confirm its state before using it in Text to Speech:
Python
Going further
Speak with your voice
Use
reference_id in any generation.Manage your voices
List, update, and delete your voice models.
Cloning best practices
Get the most natural results from your samples.
Create Model API
Every field for
POST /model.
