The thesis — LeanVoice

The only use case the market sees01

The blind spot

The industry thinks voice is a call-centre feature.

Every roadmap points at the same place: an agent on a phone line. Meanwhile the real surface — voice across every app, aisle, factory floor, kiosk and workflow, billions of moments a day — sits unbuilt, because no one has priced it for that scale.

8.4B

voice-capable devices

$62B

bought by voice · '25

~1%

is phone agents

The unit-economics trap

At real volume, premium voice turns success into a loss.

One consumer app, a million voice interactions a day (~30s each). Same usage, same quality — only the price per character differs:

Premium · ~$0.15 / 1K chars

$23M /yr

A bill that grows every time the product succeeds. Most teams quietly cap usage — or never ship.

LeanVoice · 0.3¢ / 1K chars

$0.47M /yr

~50× less, for the same million-a-day. "Voice everywhere" finally pencils out.

Every successful call makes the next one more expensive to afford.

Why it doesn't scale

A live call is the worst case for a GPU.

Real-time voice runs at batch size one — one caller, one stream. That's memory-bound, not compute-bound: a live call uses a fraction of a percent of the chip, yet still demands sub-500ms. You can't batch your way to efficiency, so one H100 serves only a handful of conversations — and push concurrency up and latency falls off a cliff.

End-to-end voice latency vs. concurrent calls · one GPUASR + 8B LLM + TTS

Past ~50 calls on one GPU, success rates collapse. The fix for GPU economics — big batches — is the one thing real-time voice can't do.

8–14

voice streams per H100

~0.1–1%

of the GPU a live call uses

~60×

slower at 100 calls vs 1

+40%

H100 rental in ~6 months

The rising floor

And the cost floor keeps rising.

Inference — not training — is now 55–80% of enterprise AI spend, and it rides the most contested hardware on earth. As models grow and GPUs stay scarce, the per-minute floor under everyone built on that stack only goes up.

Where the AI budget goesinference dominates

Inference55–80%

Training + rest20–45%

GPU serving · ~$2–8 / hour and climbing03

The competitive landscape

Same job, priced from premium to negligible.

Cost to synthesise 1,000 characters — roughly a minute and a quarter of talking — across the field:

Cost per 1,000 characters · true scaleLeanVoice = the green sliver

ElevenLabs$0.150

Google / Azure$0.016

OpenAI tts$0.015

LeanVoice$0.003

Not a discount on the same stack — a different architecture (next section).

Follow the money

Billions raised. Revenue — for almost everyone — still thin.

The field is funded like it has already won. But outside ElevenLabs, public revenue is small or undisclosed, and nearly every player is pre-profit — burning the raise on compute.

Capital raised vs. known revenue · top voice-AI pure-playsest. · mostly private

ElevenLabs

$811M raised · $11B val

~$500M ARR

Sesame

$308M

revenue: pre-revenue / beta

Deepgram

$246M · $1.3B val

~$22M

Cartesia

$191M

revenue: undisclosed

Speechmatics

$91M

~$11M

Hume AI

$74M · $219M val

revenue: undisclosed

capital raisedknown revenue (est.)

Sources: company announcements, TechCrunch / CNBC, third-party scrapers. Revenue for most is estimate or undisclosed; profitability is essentially never reported — these are capital-intensive, pre-profit businesses.

Where voice lives

Voice isn't a vertical. It's horizontal.

Conversational AI is a ~$19B market in 2025 heading to ~$44B by 2030 (~22% CAGR); the voice-agent slice alone compounds near 39%. It surfaces in every industry:

Customer service / telephony

Largest today — BFSI ~28–33%, telecom ~20%. IVR + live agents.

$59B

Healthcare · by 2030

~26% CAGR. Ambient notes, triage, scheduling.

$64B

Automotive · by 2031

In-car assistants, navigation, embedded alerts.

$19B

Media & entertainment

~58% of voice-assistant sales. Narration, dubbing, localization.

Retail

Voice commerce

Shopping assistants, order status, recommendations.

Edu

Learning

Narration, language learning, pronunciation.

A11y

Accessibility

Screen readers, assistive voices — a compliance driver.

Games

Gaming

Dynamic NPC dialogue, real-time character voice.

Three different people each speaking to a device

One agent, a different answer for each person05

Why it matters · the trade-off disappears

Today's apps are built for the majority that pays.

A screen forces one layout, one menu tree, one language on everyone — so teams tune the default for the revenue-driving segment, and everyone else adapts. There is no average user; a fixed UI quietly excludes whoever falls outside it. Voice removes the trade-off: the interface is generated per turn — the same agent, in Hindi or Tamil, terse or patient, no menus — adapting to the person instead of to a revenue-weighted average.

Who it unlocks

The app the world's top users get — now for the next billion.

In India the marginal new user is rural, not metro — and a screen built for an English-literate majority leaves most of them out. Voice is how they leapfrog in.

958M

India internet users · 57% rural

90%

of new users not English-fluent

26%

of adults can't use text UIs

400M

voice users in India already

35.7%

voice-assistant CAGR to 2030

A small-town shopkeeper speaking to her phone

The next user speaks, not types06

The same job, two interfaces.

How four everyday apps are built for the literate metro majority today — and what each becomes when the interface is a voice that meets the user in their own language.

E-commerce

Screen-first · for the majority

Grid, filters, English search box. A tier-3 user who can't type "earbuds under 1000" simply bounces.

Voice-first · for each user

"1000 ke andar achhe earphone dikhao" — the agent narrates options and checks out. No typing, no English. (Flipkart's voice haggle-bot already proved it.)

Banking

Screen-first

Dense dashboard, NEFT/IMPS tabs, beneficiary forms, OTP flows — built for the salaried, financially literate customer.

Voice-first

"Mera balance kitna hai?" / "500 pay karo." One spoken turn, local language — already live for SBI balances and UPI voice payments.

Govt & welfare

Screen-first

PDF circulars and form portals in English/Hindi text — locking out the 60%+ of rural users who cite language as the barrier.

Voice-first

"PM-KISAN ka paisa kab aayega?" — answered, spoken, in any language (Bhashini / AI4Bharat). One scheme, every citizen.

Food delivery

Screen-first

Restaurant cards, map pin-drop, typed instructions — easy for the repeat metro user, hard for a first-time tier-4 user.

Voice-first

"Pas wale dhaba se do roti aur dal ghar pe" — saved address inferred, customizations captured as natural speech.

Lifelike voice, served on CPUs04

The unlock

The frontier rides GPUs. We went the other way.

We distilled the model small enough to serve, in real time, on ordinary CPUs — the cheap, abundant compute nobody is fighting over. That collapses the cost floor. You never touch the model: you call a drop-in, OpenAI-compatible API.

0.3¢

per 1K chars

real-time

streams on a CPU

languages

API

not a model to run

The price of speech

0.3¢

per 1,000 characters

~50× below premium — the line between voice as a feature you ration, and voice as something you leave on for everyone.

The bet

Every interface went ubiquitous only after its price collapsed.

When every voice sounds human, the differentiator stops being "is it good enough?" and becomes "what does each minute cost?"

At billions of daily touchpoints, a fraction of a cent decides which use cases are even possible.

The mass of voice is commodity speech — read-alouds, support, kiosks — where buyers need a price, not a better voice.

Bandwidth, cloud and SMS each went ubiquitous only once unit cost collapsed. Voice is at that inflection.

Why I'm building this

I've spent my career making big models cheap to run — distilling them, and squeezing inference onto the humblest hardware that will hold it. LeanVoice is that instinct, pointed at voice.

Staff Software Engineer — most recently at xAI (eval + inference for Grok-4), and before that Coupang (air-gapped RAG and distilled 7B student models), Atlassian, Disney+ Hotstar, and Zomato, where scale meant billions of events a day. The same lesson kept repeating: the model is rarely the hard part — serving it cheaply, at scale, is. The next decade of software won't be typed, tapped or stared at. It'll be spoken — and the only thing between that world and this one is what it costs to say a word.

— Aayush Gupta · founder, LeanVoice · ex-xAI · Coupang · Atlassian · Disney+ Hotstar · Zomato

Everyone solved how voice sounds.
Nobody solved what it costs.

Making voice sound human took two millennia. That part is finished.

The industry thinks voice is a call-centre feature.

So the field optimises the one thing already solved: sounding better.

At real volume, premium voice turns success into a loss.

Every successful call makes the next one more expensive to afford.

A live call is the worst case for a GPU.

And the cost floor keeps rising.

Same job, priced from premium to negligible.

Billions raised. Revenue — for almost everyone — still thin.

Voice isn't a vertical. It's horizontal.

Today's apps are built for the majority that pays.

The app the world's top users get — now for the next billion.

The same job, two interfaces.

The frontier rides GPUs. We went the other way.

Every interface went ubiquitous only after its price collapsed.

The industry perfected a voice almost no one can afford to use. We made the one they can.