Pay only for what you use.
No plans. No slabs. No monthly minimums. You're billed per second of audio, every second. A team that ships nothing this month pays nothing this month.
LeanVoice is a voice AI company building lifelike speech synthesis the rest of the industry forgot to make affordable. One simple price, twenty-three languages, anywhere on the planet.
Late 2025. We were prototyping a voice agent for a customer-support workflow. The model worked. The conversation worked. Then we priced out a month of audio at ElevenLabs and Cartesia, multiplied by a hundred concurrent users, and the number stopped us cold. Lifelike speech, the kind a real customer actually wants to hear, cost more per minute than the entire rest of the stack put together.
That math doesn't just fail for a scrappy startup. It fails for an established product team rolling voice out across a million users, a public-service app trying to reach citizens in their own language, and a research group trying to ship anything at all. So we went looking for what it would take to run a competitive TTS model on a normal CPU, in a normal cloud region, at a price nobody flinches at.
Three months of benchmarking, distillation, and infrastructure work later, you're looking at the result. A 117-million-parameter model, eight synthesis steps, twenty-three languages, no GPU anywhere. Two tenths of a cent per minute. Three dollars buys about 1,500 minutes of speech.
That's the whole project. No round, no roadmap deck, no Series A theatre. Just a price that finally makes voice agents buildable, for everyone.
No plans. No slabs. No monthly minimums. You're billed per second of audio, every second. A team that ships nothing this month pays nothing this month.
Three tenths of a cent a minute, the same number whether you're a solo developer prototyping over the weekend or a public-service agency rolling voice across millions of users. No tiered hand-shaking, no enterprise surcharge for the same bytes.
Every claim on this site is reproducible. Benchmarks, the model card, the inference graph — published, not shadowed in a deck. If we stop being honest about the work, the project has failed.
Most lifelike voice models lean on billion-parameter networks and racks of GPUs. We took the opposite route. The full technical write-up lives on How it works; here's the gist.
A large, expressive teacher model trained on diverse speech is distilled into a roughly 20× smaller student. The warmth survives; the bulk doesn't.
Flow matching learns a near-straight path from noise to speech. Eight Euler steps produce a clean result, where diffusion baselines often need a hundred or more.
The graph is exported to ONNX with operator fusion, multi-threaded execution, and tuned kernels. No CUDA, no GPU, no specialised hardware.
A single 8-core cloud instance produces 9 to 12 seconds of audio every second. Spread that across concurrent calls and a minute of speech settles in fractions of a cent.
POST /v1/audio/speech with the same request body shape as OpenAI's TTS API. If you're using the openai SDK, change base_url to ours and keep the rest of your code. WAV bytes stream back the same way.<laugh>, <breath>, and <sigh> work inline.Pay-as-you-go, billed by the second. Get a key in your inbox; ship audio by lunch.