API reference
The live TTS WebSocket protocol.
Cookbooks
LLM-to-speech and voice agents.
Best practices
Tuning latency for production.
When to use it
Voice agents
Conversational AI where time-to-first-audio matters.
LLM to speech
Speak tokens as your model produces them — no waiting for the full reply.
Live narration
Long-form content that should start playing immediately.
Interactive apps
Anywhere a few hundred milliseconds of latency is noticeable.
Stream text you already have
When you have the full string, stream the audio chunks as they generate and write or play them immediately.--no-buffer tells curl to write each chunk as it arrives instead of waiting for the full response.
Stream from an LLM
When text arrives token by token, feed a generator tostream_websocket. It opens a WebSocket, sends text as you produce it, and yields audio chunks back — so speech keeps pace with your model.
Implementation details
Which mode to use
- HTTP streaming (
tts.stream) — you have the full text up front and want low time-to-first-audio. Simplest option. - WebSocket (
tts.stream_websocket) — text is still being produced (LLM output, live captions). Lets you start speaking before the sentence is finished.
Lower the latency further
- Use a streaming-friendly format like
mp3orpcm. - Keep the connection warm for back-to-back generations.
- Pair with a cloned voice via
reference_id— see Voice Cloning.
Control where audio generates
The WebSocket buffers incoming text and generates audio once it has enough context for natural-sounding speech, so you don’t need to batch tokens yourself. When you do want a clean break — end of a sentence, a deliberate pause, or the end of a turn — yield aFlushEvent to force generation immediately. Wrap text in a TextEvent if you prefer explicit events over bare strings.
Tune latency vs. quality
Both streaming paths take alatency mode:
latency="balanced"(default) — lowest time-to-first-audio. Use it for voice agents and live LLM output.latency="normal"— slightly higher latency, best audio quality. Use it for narration where you can afford a beat.
TTSConfig with chunk tuning. Smaller chunks emit audio sooner (lower latency); larger chunks give the model more context (smoother prosody):
Stream asynchronously
For asyncio apps,AsyncFishAudio exposes the same streaming methods. stream_websocket accepts an async generator, so you can pipe an async LLM client straight into speech.
Direct API (no SDK)
Token-level streaming runs over the WebSocket endpoint — the SDK’sstream_websocket() handles framing for you. To speak the protocol directly, send MessagePack frames over the socket; the same application/msgpack payload format also works for one-shot HTTP streaming, which is faster to serialize than JSON for large reference audio:
Going further
Text to Speech
Voices, formats, and prosody for every generation.
WebSocket reference
The live TTS protocol, message by message.
Streaming best practices
Tuning latency for production voice apps.
Python reference
tts.stream and tts.stream_websocket.
