Text to Speech - Fish Audio

Generate natural speech from text with the s2-pro and s1 models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript.

Use it in the web app

No code — type, pick a voice, generate.

API reference

Every parameter for POST /v1/tts.

Cookbooks

Ready-made recipes: streaming, telephony, and more.

When to use it

Voiceovers & narration

Audiobooks, explainers, ads, and video narration.

Conversational AI

Speak an assistant’s replies — pair with streaming for low latency.

Accessibility & IVR

Read content aloud, phone menus, notifications.

Custom voices

Speak in a cloned voice you own.

Quick start

Send text, get back audio. Choose your implementation:

from fishaudio import FishAudio
from fishaudio.utils import save

client = FishAudio()  # reads FISH_API_KEY
audio = client.tts.convert(text="Hello from Fish Audio!")
save(audio, "out.mp3")

Use a specific voice

Pass a voice model id (reference_id). Find ids in the Voice Library or create your own via Voice Cloning.

audio = client.tts.convert(
    text="This uses a specific voice.",
    reference_id="802e3bc2b27e49c2995d23ef70e6ac89",
)

Implementation details

Models

s2-pro (default) — highest quality, multi-speaker, natural-language expression control.
s1 — previous generation, (parenthesis) emotion tags.

In the API, select with the model request header. In Python, pass model="s2-pro". See Choosing a Model.

Output formats

mp3 (default), wav, pcm, opus. Set format (and optionally mp3_bitrate, sample_rate).

from fishaudio.types import TTSConfig

audio = client.tts.convert(
    text="High quality",
    config=TTSConfig(format="wav", sample_rate=44100),
)

Speed & prosody

Adjust speech speed (0.5–2.0) and volume.

audio = client.tts.convert(text="Speaking faster.", speed=1.5)

Generation methods (Python)

The Python SDK exposes three ways to generate, depending on whether you have the full text upfront and how you want to consume the audio:

Method	Returns	Use it for
`tts.convert()`	complete audio `bytes`	most cases — you have the text, you want the file
`tts.stream()`	`AudioStream` (iterate chunks, or `.collect()`)	memory-efficient transfer of large audio; write chunks to disk as they arrive
`tts.stream_websocket()`	iterator of audio `bytes`	text arriving in real time (LLM tokens, live captions)

# Memory-efficient: write each chunk as it arrives instead of buffering
audio_stream = client.tts.stream(text="A very long passage...")
with open("out.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

For real-time text streaming with stream_websocket(), see Realtime Streaming.

Instant voice cloning (reference audio)

Instead of a saved reference_id, pass raw audio plus its transcript to clone a voice on the fly — no training step. Best with a clean 10–30s sample.

from fishaudio.types import ReferenceAudio

with open("sample.wav", "rb") as f:
    audio = client.tts.convert(
        text="Spoken in the reference voice.",
        references=[ReferenceAudio(audio=f.read(), text="Transcript of the sample.")],
    )

To reuse a voice across many requests, clone it once and pass the resulting reference_id instead.

Format & bitrate

Pick a format for your delivery channel, and tune bitrate to trade size against quality:

Format	Notes
`mp3` (default)	good size/quality balance; set `mp3_bitrate` to `64`, `128`, or `192`
`wav`	uncompressed, highest quality; set `sample_rate` (e.g. `44100`)
`pcm`	raw samples, no container — for low-latency playback and telephony pipelines
`opus`	efficient for streaming; bitrate is automatic (`opus_bitrate=-1000`)

from fishaudio.types import TTSConfig

audio = client.tts.convert(
    text="Smaller file, lower bitrate.",
    config=TTSConfig(format="mp3", mp3_bitrate=64),
)

Latency & chunk length

latency trades stability for speed; chunk_length controls how much text the engine batches before it starts generating.

latency="balanced" (default) — lower time-to-first-audio (~300ms). Good for interactive use.
latency="normal" — most stable output, at slightly higher latency.
chunk_length (100–300, default 200) — smaller chunks start audio sooner; larger chunks are more efficient for long text.

from fishaudio.types import TTSConfig

audio = client.tts.convert(
    text="Quick, responsive output.",
    config=TTSConfig(latency="balanced", chunk_length=150),
)

Direct API (MessagePack)

POST /v1/tts also accepts a MessagePack body (Content-Type: application/msgpack) — the path the API reference is built around. Use it to send binary reference audio in the request without base64 overhead, or when you don’t want the SDK.

import os
import httpx
import ormsgpack

payload = {"text": "Hello from the direct API.", "reference_id": "YOUR_VOICE_ID", "format": "mp3"}

resp = httpx.post(
    "https://api.fish.audio/v1/tts",
    content=ormsgpack.packb(payload),
    headers={
        "Authorization": f"Bearer {os.environ['FISH_API_KEY']}",
        "Content-Type": "application/msgpack",
        "model": "s2-pro",
    },
)
with open("out.mp3", "wb") as f:
    f.write(resp.content)

The model header is required on every request. JSON and MessagePack accept the same fields.

Advanced generation tuning

For finer control, TTSConfig exposes the model’s sampling parameters. The defaults are well-tuned — reach for these only when you need to dial in determinism or curb artifacts.

from fishaudio.types import TTSConfig, Prosody

config = TTSConfig(
    prosody=Prosody(speed=1.1, volume=0),
    temperature=0.7,            # lower = more deterministic
    top_p=0.7,
    repetition_penalty=1.2,     # >1.0 curbs repeated sounds
    max_new_tokens=1024,        # cap audio length per chunk
    normalize=True,             # expand numbers/dates for natural reading
)

audio = client.tts.convert(text="Carefully tuned output.", config=config)

A TTSConfig is reusable — define it once and pass it to many convert() calls. See the full field list for every parameter and default.

Going further

Stream as it generates

Lowest latency for conversational and live apps.

Emotion & expression

Direct delivery with tags and prosody.

Full API parameters

Every field, type, and default.

Python reference

tts.convert / stream / stream_websocket.

Use it in the web app

API reference

Cookbooks

​When to use it

Voiceovers & narration

Conversational AI

Accessibility & IVR

Custom voices

​Quick start

​Use a specific voice

​Implementation details

​Models

​Output formats

​Speed & prosody

​Generation methods (Python)

​Instant voice cloning (reference audio)

​Format & bitrate

​Latency & chunk length

​Direct API (MessagePack)

​Advanced generation tuning

​Going further