The thesis · leanvoice

Everyone solved how voice sounds.
Nobody solved what it costs.

Lifelike speech, as a drop-in API — at 0.3¢ per 1,000 characters. The price voice needs before it can be everywhere, not just on a phone line.

0.3¢/ 1,000 characters
real-time on a CPU23 languagesOpenAI-compatiblestreams sentence-by-sentence
A studio microphone in a vast minimal space
Context · 2,000 years, one strip

Making voice sound human took two millennia. That part is finished.

c.25 BC
Stone acoustics
1791
Speaking machine
1877
Recorded voice
1939
Electronic speech
1978
Voice on a chip
2016→
Neural · indistinguishable
Naturalness: solved. The race everyone is still running is already over.
A lone telephone handset
The only use case the market sees01
The blind spot

The industry thinks voice is a call-centre feature.

Every roadmap points at the same place: an agent on a phone line. Meanwhile the real surface — voice across every app, aisle, factory floor, kiosk and workflow, billions of moments a day — sits unbuilt, because no one has priced it for that scale.

8.4B
voice-capable devices
$62B
bought by voice · '25
~1%
is phone agents
The wrong race

So the field optimises the one thing already solved: sounding better.

A precarious stack of servers
Bigger models, heavier infrastructure02
Parameters of notable TTS models7M → ~2.5B
7M 28M 30M 750M 1B ~2.5B 201620172021202320242025
More parameters → more GPU time per second of audio → higher latency, higher cost. The curve points the wrong way.
The unit-economics trap

At real volume, premium voice turns success into a loss.

One consumer app, a million voice interactions a day (~30s each). Same usage, same quality — only the price per character differs:

Premium · ~$0.15 / 1K chars
$23M /yr
A bill that grows every time the product succeeds. Most teams quietly cap usage — or never ship.
LeanVoice · 0.3¢ / 1K chars
$0.47M /yr
~50× less, for the same million-a-day. "Voice everywhere" finally pencils out.
A tap pouring coins

Every successful call makes the next one more expensive to afford.

Why it doesn't scale

A live call is the worst case for a GPU.

Real-time voice runs at batch size one — one caller, one stream. That's memory-bound, not compute-bound: a live call uses a fraction of a percent of the chip, yet still demands sub-500ms. You can't batch your way to efficiency, so one H100 serves only a handful of conversations — and push concurrency up and latency falls off a cliff.

End-to-end voice latency vs. concurrent calls · one GPUASR + 8B LLM + TTS
500 ms — conversational limit 984 ms ~2.5 s 28.8 s 1 call10 concurrent100 concurrent
Past ~50 calls on one GPU, success rates collapse. The fix for GPU economics — big batches — is the one thing real-time voice can't do.
8–14
voice streams per H100
~0.1–1%
of the GPU a live call uses
~60×
slower at 100 calls vs 1
+40%
H100 rental in ~6 months
The rising floor

And the cost floor keeps rising.

Inference — not training — is now 55–80% of enterprise AI spend, and it rides the most contested hardware on earth. As models grow and GPUs stay scarce, the per-minute floor under everyone built on that stack only goes up.

Where the AI budget goesinference dominates
Inference55–80%
Training + rest20–45%
A wall of servers
GPU serving · ~$2–8 / hour and climbing03
The competitive landscape

Same job, priced from premium to negligible.

Cost to synthesise 1,000 characters — roughly a minute and a quarter of talking — across the field:

Cost per 1,000 characters · true scaleLeanVoice = the green sliver
ElevenLabs$0.150
Google / Azure$0.016
OpenAI tts$0.015
LeanVoice$0.003
Not a discount on the same stack — a different architecture (next section).
Follow the money

Billions raised. Revenue — for almost everyone — still thin.

The field is funded like it has already won. But outside ElevenLabs, public revenue is small or undisclosed, and nearly every player is pre-profit — burning the raise on compute.

Capital raised vs. known revenue · top voice-AI pure-playsest. · mostly private
ElevenLabs
$811M raised · $11B val
~$500M ARR
Sesame
$308M
revenue: pre-revenue / beta
Deepgram
$246M · $1.3B val
~$22M
Cartesia
$191M
revenue: undisclosed
Speechmatics
$91M
~$11M
Hume AI
$74M · $219M val
revenue: undisclosed
capital raisedknown revenue (est.)
Sources: company announcements, TechCrunch / CNBC, third-party scrapers. Revenue for most is estimate or undisclosed; profitability is essentially never reported — these are capital-intensive, pre-profit businesses.
Where voice lives

Voice isn't a vertical. It's horizontal.

Conversational AI is a ~$19B market in 2025 heading to ~$44B by 2030 (~22% CAGR); the voice-agent slice alone compounds near 39%. It surfaces in every industry:

#1
Customer service / telephony
Largest today — BFSI ~28–33%, telecom ~20%. IVR + live agents.
$59B
Healthcare · by 2030
~26% CAGR. Ambient notes, triage, scheduling.
$64B
Automotive · by 2031
In-car assistants, navigation, embedded alerts.
$19B
Media & entertainment
~58% of voice-assistant sales. Narration, dubbing, localization.
Retail
Voice commerce
Shopping assistants, order status, recommendations.
Edu
Learning
Narration, language learning, pronunciation.
A11y
Accessibility
Screen readers, assistive voices — a compliance driver.
Games
Gaming
Dynamic NPC dialogue, real-time character voice.
Three different people each speaking to a device
One agent, a different answer for each person05
Why it matters · the trade-off disappears

Today's apps are built for the majority that pays.

A screen forces one layout, one menu tree, one language on everyone — so teams tune the default for the revenue-driving segment, and everyone else adapts. There is no average user; a fixed UI quietly excludes whoever falls outside it. Voice removes the trade-off: the interface is generated per turn — the same agent, in Hindi or Tamil, terse or patient, no menus — adapting to the person instead of to a revenue-weighted average.

Who it unlocks

The app the world's top users get — now for the next billion.

In India the marginal new user is rural, not metro — and a screen built for an English-literate majority leaves most of them out. Voice is how they leapfrog in.

958M
India internet users · 57% rural
90%
of new users not English-fluent
26%
of adults can't use text UIs
400M
voice users in India already
35.7%
voice-assistant CAGR to 2030
A small-town shopkeeper speaking to her phone
The next user speaks, not types06

The same job, two interfaces.

How four everyday apps are built for the literate metro majority today — and what each becomes when the interface is a voice that meets the user in their own language.

E-commerce
Screen-first · for the majority

Grid, filters, English search box. A tier-3 user who can't type "earbuds under 1000" simply bounces.

Voice-first · for each user

"1000 ke andar achhe earphone dikhao" — the agent narrates options and checks out. No typing, no English. (Flipkart's voice haggle-bot already proved it.)

Banking
Screen-first

Dense dashboard, NEFT/IMPS tabs, beneficiary forms, OTP flows — built for the salaried, financially literate customer.

Voice-first

"Mera balance kitna hai?" / "500 pay karo." One spoken turn, local language — already live for SBI balances and UPI voice payments.

Govt & welfare
Screen-first

PDF circulars and form portals in English/Hindi text — locking out the 60%+ of rural users who cite language as the barrier.

Voice-first

"PM-KISAN ka paisa kab aayega?" — answered, spoken, in any language (Bhashini / AI4Bharat). One scheme, every citizen.

Food delivery
Screen-first

Restaurant cards, map pin-drop, typed instructions — easy for the repeat metro user, hard for a first-time tier-4 user.

Voice-first

"Pas wale dhaba se do roti aur dal ghar pe" — saved address inferred, customizations captured as natural speech.

A single CPU chip
Lifelike voice, served on CPUs04
The unlock

The frontier rides GPUs. We went the other way.

We distilled the model small enough to serve, in real time, on ordinary CPUs — the cheap, abundant compute nobody is fighting over. That collapses the cost floor. You never touch the model: you call a drop-in, OpenAI-compatible API.

0.3¢
per 1K chars
real-time
streams on a CPU
23
languages
API
not a model to run
The price of speech
0.3¢
per 1,000 characters

~50× below premium — the line between voice as a feature you ration, and voice as something you leave on for everyone.

The bet

Every interface went ubiquitous only after its price collapsed.

01

When every voice sounds human, the differentiator stops being "is it good enough?" and becomes "what does each minute cost?"

02

At billions of daily touchpoints, a fraction of a cent decides which use cases are even possible.

03

The mass of voice is commodity speech — read-alouds, support, kiosks — where buyers need a price, not a better voice.

04

Bandwidth, cloud and SMS each went ubiquitous only once unit cost collapsed. Voice is at that inflection.

Why I'm building this

I've spent my career making big models cheap to run — distilling them, and squeezing inference onto the humblest hardware that will hold it. LeanVoice is that instinct, pointed at voice.

Staff Software Engineer — most recently at xAI (eval + inference for Grok-4), and before that Coupang (air-gapped RAG and distilled 7B student models), Atlassian, Disney+ Hotstar, and Zomato, where scale meant billions of events a day. The same lesson kept repeating: the model is rarely the hard part — serving it cheaply, at scale, is. The next decade of software won't be typed, tapped or stared at. It'll be spoken — and the only thing between that world and this one is what it costs to say a word.

— Aayush Gupta · founder, LeanVoice · ex-xAI · Coupang · Atlassian · Disney+ Hotstar · Zomato
People in everyday conversation

The industry perfected a voice almost no one can afford to use. We made the one they can.

enter leanvoice →