Streaming voice in real time.
Stream synthesised speech sentence by sentence, so a conversational agent can start speaking the first sentence while the rest of the text is still being generated. Today this is available over HTTP on the REST endpoint; the full-duplex WebSocket protocol described below is on the roadmap.
You can hear sentence-by-sentence streaming live in the demo today. The streaming API for your own key — both the HTTP stream: true flag and the bi-directional WebSocket below — is rolling out. Today POST /v1/audio/speech returns the full audio in a single response. The contracts below are published so you can build against them ahead of launch.
HTTP streaming Coming soon
The REST endpoint will stream its response: add stream: true to a POST /v1/audio/speech request and the server emits audio sentence by sentence as it's synthesised, so playback can begin before the full text is rendered — the same behaviour you can hear in the live demo now. This is the contract it will ship with.
const res = await fetch("https://api.leanvoice.ai/v1/audio/speech", { method: "POST", headers: { "Authorization": `Bearer ${process.env.LEANVOICE_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ voice: "meera", input: "Streaming voice, sentence by sentence.", format: "pcm_s16le_24k", stream: true, }), }); // Each chunk is audio for a completed sentence; play it as it arrives. const reader = res.body.getReader(); while(true){ const { value, done } = await reader.read(); if(done) break; audioPlayer.push(value); }
WebSocket endpoint Coming soon
The bi-directional WebSocket protocol below is on the roadmap and not yet available on our current infrastructure. The reference is published so you can build against it ahead of launch.
WSSwss://api.leanvoice.ai/v1/stream
Connect with the bearer WebSocket sub-protocol carrying your API key as the second element. Query parameters set the per-session defaults; every parameter can be overridden per text frame.
wss://api.leanvoice.ai/v1/stream ?voice=meera # default voice for the session &language=en # default language; auto-detected if omitted &format=pcm_s16le_24k # pcm_s16le_24k | pcm_s16le_16k | mulaw_8k | opus &steps=8 # 4 (draft) | 8 (balanced) | 10 (studio) &speed=1.0 # 0.5 to 2.0
Audio formats
| Format | Container | Sample rate | Use case |
|---|---|---|---|
| pcm_s16le_24k | raw PCM, little-endian | 24 kHz | Browser playback, high-fidelity agents (default) |
| pcm_s16le_16k | raw PCM, little-endian | 16 kHz | VoIP, narrow-band ASR pipelines |
| mulaw_8k | μ-law | 8 kHz | Telephony, SIP trunks, legacy IVR |
| opus | Ogg/Opus | 48 kHz | WebRTC, low-bandwidth mobile |
Client → server messages
All client messages are JSON. Send input.text frames as the user (or your LLM) produces tokens; the server starts synthesising as soon as the first frame arrives.
input.text
input.commit
Marks the end of the current utterance. The server flushes any in-flight synthesis and emits a audio.done control frame.
input.cancel
Interrupt synthesis mid-sentence. Used for barge-in: the user spoke over the agent, so stop rendering and discard pending audio. Server responds with audio.cancelled.
{ "type": "input.text", "text": "Hello, this is your " }
{ "type": "input.text", "text": "voice agent calling." }
{ "type": "input.commit" }
Server → client messages
Binary frames are raw audio in the format you chose. JSON frames carry control events and timing metadata you can use to drive lip-sync, captions, or barge-in detection.
audio (binary)
A raw audio buffer in the negotiated format. Chunks are roughly 40 ms each. Concatenate them in arrival order; there's no per-chunk header to strip.
audio.meta
Sent before the first audio frame for each utterance. Carries the sample rate, channel count, codec, and the alignment of words to byte offsets so you can drive captions or visemes.
{
"type": "audio.meta",
"sample_rate": 24000,
"channels": 1,
"codec": "pcm_s16le",
"voice": "meera",
"alignment": [
{ "word": "Hello", "start_ms": 0, "end_ms": 320 },
{ "word": "this", "start_ms": 450, "end_ms": 590 }
]
}
audio.done
Emitted after the last audio frame for an utterance. Includes the total synthesised duration and the wall-clock time spent synthesising, which you can log for SLA monitoring.
audio.cancelled
Acknowledges an input.cancel. After this frame, no more audio for the cancelled utterance will be sent; you can immediately push new input.text for the next turn.
error
Carries a code and message. See Errors for the full list. Errors are non-fatal unless followed by a connection close.
Full example
A minimal browser-side WebSocket client that streams a sentence and plays audio chunks through the Web Audio API the moment they arrive. No external dependencies.
const ctx = new AudioContext({ sampleRate: 24000 }); let nextStart = ctx.currentTime + 0.06; const ws = new WebSocket( "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k", ["bearer", "sk_live_..."] ); ws.binaryType = "arraybuffer"; ws.addEventListener("open", () => { ws.send(JSON.stringify({ type: "input.text", text: "Streaming voice, byte by byte." })); ws.send(JSON.stringify({ type: "input.commit" })); }); ws.addEventListener("message", (evt) => { if(typeof evt.data === "string"){ const msg = JSON.parse(evt.data); if(msg.type === "audio.done") ws.close(); return; } // Decode 16-bit signed PCM and schedule playback const samples = new Int16Array(evt.data); const buf = ctx.createBuffer(1, samples.length, 24000); const ch = buf.getChannelData(0); for(let i = 0; i < samples.length; i++) ch[i] = samples[i] / 32768; const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start(nextStart); nextStart += buf.duration; });
import WebSocket from "ws"; import Speaker from "speaker"; const ws = new WebSocket( "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k", ["bearer", process.env.LEANVOICE_API_KEY] ); const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 }); ws.on("open", () => { ws.send(JSON.stringify({ type: "input.text", text: "Hello from Node." })); ws.send(JSON.stringify({ type: "input.commit" })); }); ws.on("message", (data, isBinary) => { if(isBinary) speaker.write(data); else { const msg = JSON.parse(data.toString()); if(msg.type === "audio.done") { speaker.end(); ws.close(); } } });
import asyncio, json, os, websockets, sounddevice as sd import numpy as np async def stream(): url = "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k" async with websockets.connect( url, subprotocols=["bearer", os.environ["LEANVOICE_API_KEY"]] ) as ws: await ws.send(json.dumps({"type": "input.text", "text": "Hello from Python."})) await ws.send(json.dumps({"type": "input.commit"})) stream = sd.RawOutputStream(samplerate=24000, channels=1, dtype="int16") stream.start() async for msg in ws: if isinstance(msg, bytes): stream.write(msg) else: if json.loads(msg)["type"] == "audio.done": break stream.stop() asyncio.run(stream())
Throughput & streaming
The model synthesises faster than real time on a CPU — no GPU required — at a real-time factor of about 0.18, roughly 5.5× faster than playback. Because audio streams sentence by sentence, the first sentence starts playing while later sentences are still being synthesised, so perceived wait time stays short on long passages. The live demo on the home page shows the real timing end to end.
Pair streaming with your LLM's streaming output. As sentences come back from the LLM, send each one for synthesis; audio for the first sentence arrives while the LLM is still finishing later ones, keeping the conversation moving.
Barge-in & interruption
For natural conversational agents you need to stop speaking the moment the user starts talking. The flow is:
- Your VAD or ASR detects user speech.
- Send
{"type":"input.cancel"}on the WebSocket. - Stop playing audio chunks already buffered on the client.
- Wait for the
audio.cancelledack, then start a new utterance withinput.text.
The server drops any synthesis still in flight, so you won't be billed for cancelled audio past the moment of input.cancel.
Billing
Streaming sessions are metered by the number of characters synthesised, just like the REST endpoint. A cancelled utterance is billed only for the characters whose audio was actually sent. Idle sessions cost nothing; we keep the socket open for free up to one hour of inactivity.