Docs/API reference/WebSocket streaming

Streaming voice in real time.

Stream synthesised speech sentence by sentence, so a conversational agent can start speaking the first sentence while the rest of the text is still being generated. Today this is available over HTTP on the REST endpoint; the full-duplex WebSocket protocol described below is on the roadmap.

Streaming API — on the roadmap

You can hear sentence-by-sentence streaming live in the demo today. The streaming API for your own key — both the HTTP stream: true flag and the bi-directional WebSocket below — is rolling out. Today POST /v1/audio/speech returns the full audio in a single response. The contracts below are published so you can build against them ahead of launch.

HTTP streaming Coming soon

The REST endpoint will stream its response: add stream: true to a POST /v1/audio/speech request and the server emits audio sentence by sentence as it's synthesised, so playback can begin before the full text is rendered — the same behaviour you can hear in the live demo now. This is the contract it will ship with.

javascript
const res = await fetch("https://api.leanvoice.ai/v1/audio/speech", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.LEANVOICE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    voice: "meera",
    input: "Streaming voice, sentence by sentence.",
    format: "pcm_s16le_24k",
    stream: true,
  }),
});

// Each chunk is audio for a completed sentence; play it as it arrives.
const reader = res.body.getReader();
while(true){
  const { value, done } = await reader.read();
  if(done) break;
  audioPlayer.push(value);
}

WebSocket endpoint Coming soon

The bi-directional WebSocket protocol below is on the roadmap and not yet available on our current infrastructure. The reference is published so you can build against it ahead of launch.

WSSwss://api.leanvoice.ai/v1/stream

Connect with the bearer WebSocket sub-protocol carrying your API key as the second element. Query parameters set the per-session defaults; every parameter can be overridden per text frame.

connect
wss://api.leanvoice.ai/v1/stream
  ?voice=meera           # default voice for the session
  &language=en           # default language; auto-detected if omitted
  &format=pcm_s16le_24k  # pcm_s16le_24k | pcm_s16le_16k | mulaw_8k | opus
  &steps=8               # 4 (draft) | 8 (balanced) | 10 (studio)
  &speed=1.0             # 0.5 to 2.0

Audio formats

FormatContainerSample rateUse case
pcm_s16le_24kraw PCM, little-endian24 kHzBrowser playback, high-fidelity agents (default)
pcm_s16le_16kraw PCM, little-endian16 kHzVoIP, narrow-band ASR pipelines
mulaw_8kμ-law8 kHzTelephony, SIP trunks, legacy IVR
opusOgg/Opus48 kHzWebRTC, low-bandwidth mobile

Client → server messages

All client messages are JSON. Send input.text frames as the user (or your LLM) produces tokens; the server starts synthesising as soon as the first frame arrives.

input.text

textstringRequired
The next chunk of text to synthesise. Streamed sentences are fine; the server buffers fragments and emits audio at natural prosodic boundaries. Maximum 4 000 characters per frame.
voicestringOptional
Override the session voice for this frame only. Switching mid-sentence is supported, but a small crossfade is inserted to avoid an audible jump.
languagestringOptional
ISO 639-1 code. Use this for code-switched copy (e.g. an English sentence followed by a Hindi one in the same session).

input.commit

Marks the end of the current utterance. The server flushes any in-flight synthesis and emits a audio.done control frame.

input.cancel

Interrupt synthesis mid-sentence. Used for barge-in: the user spoke over the agent, so stop rendering and discard pending audio. Server responds with audio.cancelled.

json
{ "type": "input.text", "text": "Hello, this is your " }
{ "type": "input.text", "text": "voice agent calling." }
{ "type": "input.commit" }

Server → client messages

Binary frames are raw audio in the format you chose. JSON frames carry control events and timing metadata you can use to drive lip-sync, captions, or barge-in detection.

audio (binary)

A raw audio buffer in the negotiated format. Chunks are roughly 40 ms each. Concatenate them in arrival order; there's no per-chunk header to strip.

audio.meta

Sent before the first audio frame for each utterance. Carries the sample rate, channel count, codec, and the alignment of words to byte offsets so you can drive captions or visemes.

json
{
  "type": "audio.meta",
  "sample_rate": 24000,
  "channels": 1,
  "codec": "pcm_s16le",
  "voice": "meera",
  "alignment": [
    { "word": "Hello", "start_ms": 0,   "end_ms": 320 },
    { "word": "this",  "start_ms": 450, "end_ms": 590 }
  ]
}

audio.done

Emitted after the last audio frame for an utterance. Includes the total synthesised duration and the wall-clock time spent synthesising, which you can log for SLA monitoring.

audio.cancelled

Acknowledges an input.cancel. After this frame, no more audio for the cancelled utterance will be sent; you can immediately push new input.text for the next turn.

error

Carries a code and message. See Errors for the full list. Errors are non-fatal unless followed by a connection close.

Full example

A minimal browser-side WebSocket client that streams a sentence and plays audio chunks through the Web Audio API the moment they arrive. No external dependencies.

javascript
const ctx = new AudioContext({ sampleRate: 24000 });
let nextStart = ctx.currentTime + 0.06;

const ws = new WebSocket(
  "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k",
  ["bearer", "sk_live_..."]
);
ws.binaryType = "arraybuffer";

ws.addEventListener("open", () => {
  ws.send(JSON.stringify({ type: "input.text", text: "Streaming voice, byte by byte." }));
  ws.send(JSON.stringify({ type: "input.commit" }));
});

ws.addEventListener("message", (evt) => {
  if(typeof evt.data === "string"){
    const msg = JSON.parse(evt.data);
    if(msg.type === "audio.done") ws.close();
    return;
  }
  // Decode 16-bit signed PCM and schedule playback
  const samples = new Int16Array(evt.data);
  const buf = ctx.createBuffer(1, samples.length, 24000);
  const ch = buf.getChannelData(0);
  for(let i = 0; i < samples.length; i++) ch[i] = samples[i] / 32768;
  const src = ctx.createBufferSource();
  src.buffer = buf;
  src.connect(ctx.destination);
  src.start(nextStart);
  nextStart += buf.duration;
});
node
import WebSocket from "ws";
import Speaker from "speaker";

const ws = new WebSocket(
  "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k",
  ["bearer", process.env.LEANVOICE_API_KEY]
);

const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });

ws.on("open", () => {
  ws.send(JSON.stringify({ type: "input.text", text: "Hello from Node." }));
  ws.send(JSON.stringify({ type: "input.commit" }));
});

ws.on("message", (data, isBinary) => {
  if(isBinary) speaker.write(data);
  else {
    const msg = JSON.parse(data.toString());
    if(msg.type === "audio.done") { speaker.end(); ws.close(); }
  }
});
python
import asyncio, json, os, websockets, sounddevice as sd
import numpy as np

async def stream():
    url = "wss://api.leanvoice.ai/v1/stream?voice=meera&format=pcm_s16le_24k"
    async with websockets.connect(
        url, subprotocols=["bearer", os.environ["LEANVOICE_API_KEY"]]
    ) as ws:
        await ws.send(json.dumps({"type": "input.text", "text": "Hello from Python."}))
        await ws.send(json.dumps({"type": "input.commit"}))
        stream = sd.RawOutputStream(samplerate=24000, channels=1, dtype="int16")
        stream.start()
        async for msg in ws:
            if isinstance(msg, bytes):
                stream.write(msg)
            else:
                if json.loads(msg)["type"] == "audio.done":
                    break
        stream.stop()

asyncio.run(stream())

Throughput & streaming

The model synthesises faster than real time on a CPU — no GPU required — at a real-time factor of about 0.18, roughly 5.5× faster than playback. Because audio streams sentence by sentence, the first sentence starts playing while later sentences are still being synthesised, so perceived wait time stays short on long passages. The live demo on the home page shows the real timing end to end.

For voice agents

Pair streaming with your LLM's streaming output. As sentences come back from the LLM, send each one for synthesis; audio for the first sentence arrives while the LLM is still finishing later ones, keeping the conversation moving.

Barge-in & interruption

For natural conversational agents you need to stop speaking the moment the user starts talking. The flow is:

  1. Your VAD or ASR detects user speech.
  2. Send {"type":"input.cancel"} on the WebSocket.
  3. Stop playing audio chunks already buffered on the client.
  4. Wait for the audio.cancelled ack, then start a new utterance with input.text.

The server drops any synthesis still in flight, so you won't be billed for cancelled audio past the moment of input.cancel.

Billing

Streaming sessions are metered by the number of characters synthesised, just like the REST endpoint. A cancelled utterance is billed only for the characters whose audio was actually sent. Idle sessions cost nothing; we keep the socket open for free up to one hour of inactivity.