Build a voice agent loop: speech in, reply, speech out

Prerequisites

Create a Fish Audio account

Go to fish.audio/auth/signup
Fill in your details to create an account, complete steps to verify your account.
Log in to your account and navigate to the API section

Get your API key

Once you have an account, you’ll need an API key to authenticate your requests.

Log in to your Fish Audio Dashboard
Navigate to the API Keys section
Click “Create New Key” and give it a descriptive name, set a expiration if desired
Copy your key and store it securely

Keep your API key secret! Never commit it to version control or share it publicly.

Recipe

A voice agent is three stages chained together: asr.transcribe() turns the caller’s audio into text, your own LLM turns that text into a reply, and tts.stream() turns the reply back into speech. The transcript and the reply are just strings, so the only Fish Audio-specific parts are the first and last calls. Streaming the reply lets you start writing (or forwarding) audio before the whole sentence is synthesized.

from fishaudio import FishAudio
from fishaudio.utils import save

client = FishAudio()

def reply_from_llm(text: str) -> str:
    # ---- PLACEHOLDER ----
    # Call your own LLM here and return its reply as a string.
    # e.g. return openai_client.chat.completions.create(...).choices[0].message.content
    return f"You said: {text}. How can I help?"

def voice_agent_turn(audio_path: str, out_path: str) -> str:
    with open(audio_path, "rb") as f:
        heard = client.asr.transcribe(audio=f.read())

    reply = reply_from_llm(heard.text)

    audio_stream = client.tts.stream(text=reply, reference_id="<voice-id>")
    save(audio_stream, out_path)  # writes chunks as they arrive
    return reply

reply = voice_agent_turn("speech.wav", "reply.mp3")
print("Agent:", reply)

heard is an ASRResponse: heard.text is the full transcript and heard.duration is the clip length in seconds. Pass language="en" to transcribe() to skip auto-detection when you already know the input language.

For the lowest latency, feed your LLM’s token stream straight into stream_websocket() instead of waiting for the full reply string — see Realtime: LLM tokens → speech.

Reply in the caller’s voice

reference_id points the reply at a saved voice. Drop it to use the default voice, or clone the caller’s voice from the same clip you just transcribed by passing references instead — see Instant voice cloning.

​Prerequisites

​Recipe

​Reply in the caller’s voice

​Related

Prerequisites

Recipe

Reply in the caller’s voice

Related