Skip to main content
Build a reusable voice model from your own audio, then use it anywhere you generate speech. You get back a voice id — pass it as reference_id to Text to Speech and every generation speaks in that voice. Works from the API directly, the Python library, or JavaScript.

Use it in the web app

No code — clone a voice in the browser.

API reference

Every field for POST /model.

Cookbooks

Instant clones, training, and reuse.

When to use it

Brand voice

One consistent voice across product, ads, and IVR.

Personal voice

Clone your own voice for narration or assistants.

Characters

Distinct voices for games, stories, and dialogue.

Dubbing & localization

Keep a speaker’s identity across languages.

Quick start

Send one or more audio samples, get back a voice model. Choose your implementation:
from fishaudio import FishAudio

client = FishAudio()  # reads FISH_API_KEY

with open("sample.wav", "rb") as f:
    voice = client.voices.create(
        title="My Voice",
        voices=[f.read()],
        description="Cloned from a studio sample",
        visibility="private",
    )

print(voice.id, voice.state)

Use your cloned voice

Pass the voice id as reference_id to Text to Speech — exactly like any other voice.
audio = client.tts.convert(
    text="Now I speak in my cloned voice.",
    reference_id=voice.id,
)

Implementation details

Sample quality

Clean, mono, single-speaker audio gives the best result. A short clip works for a quick clone; a minute or two of clear speech improves fidelity. Avoid background music, reverb, and overlapping voices.

Multiple samples

Pass several clips to capture more range. You can also supply the matching transcripts as texts to sharpen pronunciation.
voice = client.voices.create(
    title="My Voice",
    voices=[open("a.wav", "rb").read(), open("b.wav", "rb").read()],
    texts=["Transcript of clip A.", "Transcript of clip B."],
)

Visibility

Models are private by default. Set unlist for a shareable link, or public to publish to the Voice Library. You can change this later — see Manage Voices.

Instant vs. persistent clones

There are two ways to clone:
  • Persistent model (above) — train once with voices.create(), get back a reusable id. Best when you’ll use the same voice repeatedly.
  • Instant clone — pass reference audio inline on each generation with no model to manage. Best for one-off or per-request voices.
For an instant clone, send the reference audio (and its transcript) directly to Text to Speech via references instead of reference_id:
Python
from fishaudio import FishAudio
from fishaudio.types import ReferenceAudio

client = FishAudio()

with open("reference.wav", "rb") as f:
    audio = client.tts.convert(
        text="This will sound like the reference voice.",
        references=[ReferenceAudio(
            audio=f.read(),
            text="The exact words spoken in the reference clip.",
        )],
    )
Pass several ReferenceAudio entries to capture more range, just as you would with multiple samples in a persistent model. The matching text for each clip sharpens pronunciation.

Sample audio requirements

Samples can be .wav, .mp3, .m4a, or .opus. Aim for at least 10 seconds per clip; a minute or two of clear, single-speaker speech improves fidelity. enhance_audio_quality (on by default) removes background noise and normalizes levels before training:
Python
voice = client.voices.create(
    title="My Voice",
    voices=[open("sample.wav", "rb").read()],
    enhance_audio_quality=True,
)
Leave it on for noisy or lower-quality recordings. If your audio is already clean and studio-grade, turning it off (enhance_audio_quality=False) avoids any extra processing.

Model state

A new model reports a state field that moves from created to trained (or failed). With train_mode="fast" (the default) the voice is usable almost immediately, so most clones return already trained.
Python
voice = client.voices.create(title="My Voice", voices=[sample])
print(voice.state)  # "trained"
If a generation rejects the reference_id, re-fetch the model and confirm its state before using it in Text to Speech:
Python
voice = client.voices.get(voice.id)
if voice.state == "trained":
    audio = client.tts.convert(text="Hello.", reference_id=voice.id)

Going further

Speak with your voice

Use reference_id in any generation.

Manage your voices

List, update, and delete your voice models.

Cloning best practices

Get the most natural results from your samples.

Create Model API

Every field for POST /model.