About · LeanVoice

How it started.

Late 2025. We were prototyping a voice agent for a customer-support workflow. The model worked. The conversation worked. Then we priced out a month of audio at ElevenLabs and Cartesia, multiplied by a hundred concurrent users, and the number stopped us cold. Lifelike speech, the kind a real customer actually wants to hear, cost more per minute than the entire rest of the stack put together.

That math doesn't just fail for a scrappy startup. It fails for an established product team rolling voice out across a million users, a public-service app trying to reach citizens in their own language, and a research group trying to ship anything at all. So we went looking for what it would take to run a competitive TTS model on a normal CPU, in a normal cloud region, at a price nobody flinches at.

Three months of benchmarking, distillation, and infrastructure work later, you're looking at the result. A 117-million-parameter model, eight synthesis steps, twenty-three languages, no GPU anywhere. Two tenths of a cent per minute. Three dollars buys about 1,500 minutes of speech.

That's the whole project. No round, no roadmap deck, no Series A theatre. Just a price that finally makes voice agents buildable, for everyone.

Three things we won't back off from.

Pay only for what you use.

No plans. No slabs. No monthly minimums. You're billed per second of audio, every second. A team that ships nothing this month pays nothing this month.

One price, fair everywhere.

Three tenths of a cent a minute, the same number whether you're a solo developer prototyping over the weekend or a public-service agency rolling voice across millions of users. No tiered hand-shaking, no enterprise surcharge for the same bytes.

Engineering, not marketing.

Every claim on this site is reproducible. Benchmarks, the model card, the inference graph — published, not shadowed in a deck. If we stop being honest about the work, the project has failed.

Four engineering choices.

Most lifelike voice models lean on billion-parameter networks and racks of GPUs. We took the opposite route. The full technical write-up lives on How it works; here's the gist.

I · DISTILLATION

Big teacher to small student.

A large, expressive teacher model trained on diverse speech is distilled into a roughly 20× smaller student. The warmth survives; the bulk doesn't.

II · FLOW MATCHING

Eight steps, not a hundred.

Flow matching learns a near-straight path from noise to speech. Eight Euler steps produce a clean result, where diffusion baselines often need a hundred or more.

III · CPU NATIVE

Compiled for ordinary processors.

The graph is exported to ONNX with operator fusion, multi-threaded execution, and tuned kernels. No CUDA, no GPU, no specialised hardware.

IV · SHARED INFRA

One box serves dozens.

A single 8-core cloud instance produces 9 to 12 seconds of audio every second. Spread that across concurrent calls and a minute of speech settles in fractions of a cent.

Want the full technical write-up? Parameter counts, real-time-factor measurements across hardware, the INT8 quantisation experiment, and the velocity-field equation, all on the research page.

Read the technical note →

Things people actually ask.

Is the audio quality really comparable to the big APIs?

On standard multilingual word-error-rate benchmarks the model lands within rounding distance of 2B-parameter open systems, despite being 20× smaller. On expressiveness, it's competitive for most conversational and narrative use cases. For high-end voiceover work where every breath matters, the premium APIs still have an edge. The samples page exists so you can judge with your own ears.

How is it possible to price this so low?

Two reasons. First, the model is small enough to run on a CPU you already rent by the hour, so each new minute of audio is nearly free once the box is paid for. Second, we distilled a teacher network 4× smaller, swapped diffusion for an 8-step flow-matching decoder, and exported the whole graph to ONNX, which together cut compute roughly an order of magnitude without losing voice quality. The pricing page breaks each engineering choice down.

Is it really OpenAI-compatible?

Yes. The endpoint is POST /v1/audio/speech with the same request body shape as OpenAI's TTS API. If you're using the openai SDK, change base_url to ours and keep the rest of your code. WAV bytes stream back the same way.

What languages and voices does it support?

Twenty-three languages, with first-class quality on English, Hindi, and the major European tongues; Korean and Japanese also included. Ten distinct voices ship out of the box (Arjun, Vikram, Kabir, Meera, Anaya, Saanvi and others). Expression tags like <laugh>, <breath>, and <sigh> work inline.

Can I self-host or use a private deployment?

Yes, on Enterprise terms. The whole engine is a container that runs on a 2 vCPU / 4 GiB box, so private deployment in your own VPC is straightforward. Drop us a line if that's the path you want.

What's the latency like for real-time voice agents?

Audio is streamed sentence by sentence, so playback starts before the full text has finished rendering. On a single CPU core the engine generates audio roughly six times faster than real-time at 8 synthesis steps; a "Draft" quality setting hits 12× and a "Pristine" setting hits 3×. That throughput, plus sentence-by-sentence streaming, keeps it comfortably real-time for conversational voice agents — all without a GPU.

How do I get an API key?

For now, drop your email at the bottom of the home page or here. We're in invite-only beta while we firm up billing and rate-limiting; new accounts are reviewed within a working day.

Who's behind this?

A small, distributed team of researchers and engineers. We ship code instead of press releases. If you've got a real use case and want to talk through it, the email at the bottom of the home page reaches us directly.

Lifelike voice should cost cents, not dollars.

How it started.

Three things we won't back off from.

Pay only for what you use.

One price, fair everywhere.

Engineering, not marketing.

What the box actually does.

Four engineering choices.

Big teacher to small student.

Eight steps, not a hundred.

Compiled for ordinary processors.

One box serves dozens.

Things people actually ask.

Build something worth listening to.