Telephony-grade audio (8 kHz) for IVR and phone

Prerequisites

Create a Fish Audio account

Go to fish.audio/auth/signup
Fill in your details to create an account, complete steps to verify your account.
Log in to your account and navigate to the API section

Get your API key

Once you have an account, you’ll need an API key to authenticate your requests.

Log in to your Fish Audio Dashboard
Navigate to the API Keys section
Click “Create New Key” and give it a descriptive name, set a expiration if desired
Copy your key and store it securely

Keep your API key secret! Never commit it to version control or share it publicly.

Recipe

Phone networks carry narrowband audio at 8 kHz. Generating at a higher rate just forces the carrier to downsample on the way through — wasting bandwidth and often softening the result. Synthesize at 8 kHz directly and the bytes are ready to hand to your IVR or SIP stack. Set the sample rate on TTSConfig (it is not a top-level argument) and write the WAV to disk.

from fishaudio import FishAudio
from fishaudio.types import TTSConfig
from fishaudio.utils import save

client = FishAudio()

audio = client.tts.convert(
    text="Thank you for calling. Press one to speak with an agent.",
    config=TTSConfig(format="wav", sample_rate=8000),
)

save(audio, "out.wav")

The output is a mono 8 kHz WAV — the standard for G.711 PCM telephony. For a headerless stream to feed straight into a SIP or RTP pipeline, switch to raw PCM with format="pcm"; the sample rate stays on TTSConfig.

audio = client.tts.convert(
    text="Thank you for calling. Press one to speak with an agent.",
    config=TTSConfig(format="pcm", sample_rate=8000),
)

API (curl)

curl --request POST https://api.fish.audio/v1/tts \
  --header "Authorization: Bearer $FISH_API_KEY" \
  --header "Content-Type: application/json" \
  --header "model: s2-pro" \
  --data '{ "text": "Thank you for calling. Press one to speak with an agent.", "format": "wav", "sample_rate": 8000 }' \
  --output out.wav

8 kHz discards everything above ~4 kHz, so plosives and sibilance lose detail. Keep prompts short and articulate, and reserve higher sample rates (16/24 kHz) for VoIP or recordings that never touch the legacy phone network.

Stream TTS to a file Transcribe audio to SRT/VTT captions

⌘I

​Prerequisites

​Recipe

​Related

Prerequisites

Recipe

Related