s2-pro and s1 models. Pick a voice, choose a format, and go — from the API directly, the Python library, or JavaScript.
Use it in the web app
No code — type, pick a voice, generate.
API reference
Every parameter for
POST /v1/tts.Cookbooks
Ready-made recipes: streaming, telephony, and more.
When to use it
Voiceovers & narration
Audiobooks, explainers, ads, and video narration.
Conversational AI
Speak an assistant’s replies — pair with streaming for low latency.
Accessibility & IVR
Read content aloud, phone menus, notifications.
Custom voices
Speak in a cloned voice you own.
Quick start
Send text, get back audio. Choose your implementation:Use a specific voice
Pass a voice model id (reference_id). Find ids in the Voice Library or create your own via Voice Cloning.
Implementation details
Models
s2-pro(default) — highest quality, multi-speaker, natural-language expression control.s1— previous generation,(parenthesis)emotion tags.
model request header. In Python, pass model="s2-pro". See Choosing a Model.
Output formats
mp3 (default), wav, pcm, opus. Set format (and optionally mp3_bitrate, sample_rate).
Speed & prosody
Adjust speech speed (0.5–2.0) and volume.Generation methods (Python)
The Python SDK exposes three ways to generate, depending on whether you have the full text upfront and how you want to consume the audio:| Method | Returns | Use it for |
|---|---|---|
tts.convert() | complete audio bytes | most cases — you have the text, you want the file |
tts.stream() | AudioStream (iterate chunks, or .collect()) | memory-efficient transfer of large audio; write chunks to disk as they arrive |
tts.stream_websocket() | iterator of audio bytes | text arriving in real time (LLM tokens, live captions) |
stream_websocket(), see Realtime Streaming.
Instant voice cloning (reference audio)
Instead of a savedreference_id, pass raw audio plus its transcript to clone a voice on the fly — no training step. Best with a clean 10–30s sample.
reference_id instead.
Format & bitrate
Pick a format for your delivery channel, and tune bitrate to trade size against quality:| Format | Notes |
|---|---|
mp3 (default) | good size/quality balance; set mp3_bitrate to 64, 128, or 192 |
wav | uncompressed, highest quality; set sample_rate (e.g. 44100) |
pcm | raw samples, no container — for low-latency playback and telephony pipelines |
opus | efficient for streaming; bitrate is automatic (opus_bitrate=-1000) |
Latency & chunk length
latency trades stability for speed; chunk_length controls how much text the engine batches before it starts generating.
latency="balanced"(default) — lower time-to-first-audio (~300ms). Good for interactive use.latency="normal"— most stable output, at slightly higher latency.chunk_length(100–300, default200) — smaller chunks start audio sooner; larger chunks are more efficient for long text.
Direct API (MessagePack)
POST /v1/tts also accepts a MessagePack body (Content-Type: application/msgpack) — the path the API reference is built around. Use it to send binary reference audio in the request without base64 overhead, or when you don’t want the SDK.
model header is required on every request. JSON and MessagePack accept the same fields.
Advanced generation tuning
For finer control,TTSConfig exposes the model’s sampling parameters. The defaults are well-tuned — reach for these only when you need to dial in determinism or curb artifacts.
TTSConfig is reusable — define it once and pass it to many convert() calls. See the full field list for every parameter and default.
Going further
Stream as it generates
Lowest latency for conversational and live apps.
Emotion & expression
Direct delivery with tags and prosody.
Full API parameters
Every field, type, and default.
Python reference
tts.convert / stream / stream_websocket.
