Lifelike speech, as a drop-in API — at 0.3¢ per 1,000 characters. The price voice needs before it can be everywhere, not just on a phone line.


Every roadmap points at the same place: an agent on a phone line. Meanwhile the real surface — voice across every app, aisle, factory floor, kiosk and workflow, billions of moments a day — sits unbuilt, because no one has priced it for that scale.

One consumer app, a million voice interactions a day (~30s each). Same usage, same quality — only the price per character differs:

Real-time voice runs at batch size one — one caller, one stream. That's memory-bound, not compute-bound: a live call uses a fraction of a percent of the chip, yet still demands sub-500ms. You can't batch your way to efficiency, so one H100 serves only a handful of conversations — and push concurrency up and latency falls off a cliff.
Inference — not training — is now 55–80% of enterprise AI spend, and it rides the most contested hardware on earth. As models grow and GPUs stay scarce, the per-minute floor under everyone built on that stack only goes up.

Cost to synthesise 1,000 characters — roughly a minute and a quarter of talking — across the field:
The field is funded like it has already won. But outside ElevenLabs, public revenue is small or undisclosed, and nearly every player is pre-profit — burning the raise on compute.
Conversational AI is a ~$19B market in 2025 heading to ~$44B by 2030 (~22% CAGR); the voice-agent slice alone compounds near 39%. It surfaces in every industry:

A screen forces one layout, one menu tree, one language on everyone — so teams tune the default for the revenue-driving segment, and everyone else adapts. There is no average user; a fixed UI quietly excludes whoever falls outside it. Voice removes the trade-off: the interface is generated per turn — the same agent, in Hindi or Tamil, terse or patient, no menus — adapting to the person instead of to a revenue-weighted average.
In India the marginal new user is rural, not metro — and a screen built for an English-literate majority leaves most of them out. Voice is how they leapfrog in.

How four everyday apps are built for the literate metro majority today — and what each becomes when the interface is a voice that meets the user in their own language.
Grid, filters, English search box. A tier-3 user who can't type "earbuds under 1000" simply bounces.
"1000 ke andar achhe earphone dikhao" — the agent narrates options and checks out. No typing, no English. (Flipkart's voice haggle-bot already proved it.)
Dense dashboard, NEFT/IMPS tabs, beneficiary forms, OTP flows — built for the salaried, financially literate customer.
"Mera balance kitna hai?" / "500 pay karo." One spoken turn, local language — already live for SBI balances and UPI voice payments.
PDF circulars and form portals in English/Hindi text — locking out the 60%+ of rural users who cite language as the barrier.
"PM-KISAN ka paisa kab aayega?" — answered, spoken, in any language (Bhashini / AI4Bharat). One scheme, every citizen.
Restaurant cards, map pin-drop, typed instructions — easy for the repeat metro user, hard for a first-time tier-4 user.
"Pas wale dhaba se do roti aur dal ghar pe" — saved address inferred, customizations captured as natural speech.

We distilled the model small enough to serve, in real time, on ordinary CPUs — the cheap, abundant compute nobody is fighting over. That collapses the cost floor. You never touch the model: you call a drop-in, OpenAI-compatible API.
~50× below premium — the line between voice as a feature you ration, and voice as something you leave on for everyone.
When every voice sounds human, the differentiator stops being "is it good enough?" and becomes "what does each minute cost?"
At billions of daily touchpoints, a fraction of a cent decides which use cases are even possible.
The mass of voice is commodity speech — read-alouds, support, kiosks — where buyers need a price, not a better voice.
Bandwidth, cloud and SMS each went ubiquitous only once unit cost collapsed. Voice is at that inflection.
I've spent my career making big models cheap to run — distilling them, and squeezing inference onto the humblest hardware that will hold it. LeanVoice is that instinct, pointed at voice.
Staff Software Engineer — most recently at xAI (eval + inference for Grok-4), and before that Coupang (air-gapped RAG and distilled 7B student models), Atlassian, Disney+ Hotstar, and Zomato, where scale meant billions of events a day. The same lesson kept repeating: the model is rarely the hard part — serving it cheaply, at scale, is. The next decade of software won't be typed, tapped or stared at. It'll be spoken — and the only thing between that world and this one is what it costs to say a word.
