Pricing

Pay only for what you use. Three tenths of a cent.

No plans, no slabs, no monthly minimums. Billed per second of audio you generate, across any of our twenty-three languages and six voices.

0.22¢/ 1k characters
$2.20/ million characters
Per minute of audio$0.002two tenths of a cent
Per hour of audio$0.12twelve cents
Per 1 million chars$2.20about 1,500 minutes

All 23 languages and every voice included. About 50× cheaper than ElevenLabs Multilingual, 18× cheaper than Cartesia Sonic, 8× cheaper than Amazon Polly Neural. Three dollars buys roughly 1,500 minutes of speech. First 10,000 characters free every month. Billed by the second, no minimums, cancel any time.

How it's this cheap

Engineering, not arithmetic.

This is not a loss leader, an introductory rate, or a thin reseller margin. It's the price you pay when you rebuild lifelike voice synthesis from first principles, ship every layer in-house, and run the whole thing on commodity CPUs. Six engineering choices, each chipping an order of magnitude off the bill, until two tenths of a cent a minute became a comfortable price.

01
−4× compute

Distilled, not throttled.

The teacher model is a 460M-parameter flow-matching giant. We distil it down to a 117M-parameter student that ships everything you can hear, including breath, laugh, sigh, prosody, and pitch contour, but at a quarter of the FLOPs. No quality knob, no "lite" tier.

02
CPU-native

Runs without a GPU.

Most "neural" TTS quietly rents A100s in the basement and passes the bill along. We ship an ONNX Runtime graph that hits 9–12× realtime on an ordinary 8-core cloud CPU. No CUDA, no driver lock-in, no $3/hour accelerator surcharge that you ultimately pay for.

03
8 steps · not 50

Flow matching, not diffusion.

Classic diffusion TTS takes 30 to 50 denoising passes for studio quality. Our flow-matching decoder needs eight, and four is still usable. Every step you don't run is latency and cost you don't bill. The whole inference loop is six times tighter than the published baseline.

04
Streams sentence-by-sentence

Streamed, not batched.

The first audio chunk leaves the server before the full sentence has been synthesised. Audio streams back sentence by sentence over HTTP, so the client can start playback while the rest of the line is still being written. No waiting for the whole paragraph, no buffering pause.

05
23 langs · 1 model

One model, every language.

Most TTS providers ship a different model per language and charge you separately for "multilingual". We trained the phonemiser, the acoustic model, and the vocoder once across twenty-three languages, so there's nothing to swap, nothing to warm up, and nothing extra to bill when a Hindi voice answers a French question.

06
$0 cold-start

Scales to zero, wakes in two seconds.

The whole serving stack fits in roughly 500 MB and a single Python process. Container cold-starts complete in about two seconds, which means traffic that idles can run on serverless workers that bill by the second. No standing GPU bill, no warm-pool overhead, no "minimum committed throughput".

Every number above is from production benchmarks, not a slide deck. The full methodology, model card, and per-machine latency table live on the research page.

Hear it first

Numbers are nice. Voices are better.

Open the studio, type a line, and listen to what two tenths of a cent actually sounds like.

Open the studio