Aayush TTS — Technical Report

1The idea, in plain English

Modern computers can already produce speech that sounds human. The hard part is doing it cheaply and fast enough to use everywhere — not just in a data centre. Aayush TTS is engineered for exactly that: it runs on the kind of processor in an ordinary laptop, with no expensive graphics card.

In plain English

Think of the model as a sculptor. It starts with a block of static (random noise) and, in eight quick passes, carves it into the shape of a voice saying your words. Most systems take hundreds of passes; ours takes eight — which is why it can keep up in real time on a normal computer.

2Architecture

Under the hood, Aayush TTS is four neural networks in sequence. Text is encoded; a duration model decides how long each sound lasts; a flow-matching field sculpts the noise into a mel-spectrogram; and a vocoder turns that spectrogram into an audio waveform. Each is a separate ONNX graph.

3Where the model's brain is: parameters

In plain English

A parameter is a single number the model learned during training — one of millions of tiny dials that shape how it sounds. More dials can mean more nuance, but also more compute. Aayush TTS uses ~117 million of them; a frontier voice model can have ten to thirty times more.

The full model is about 117 million parameters. The four-stage pipeline we ship and run at inference accounts for ~99 M of them — measured directly from the ONNX graphs below — while the remaining ~18 M is the style/reference encoder used offline to build voices, never run at synthesis time. Two-thirds of the inference weights live in the flow-matching field — the part that does the creative work — and almost none in the duration model.

Inference graphs by component · measuredshipped ≈ 99.2 M of 117 M

Flow-matching field64.0 M

Vocoder25.3 M

Text encoder9.0 M

Duration predictor0.9 M

Counted from the four ONNX graphs we ship; the offline style/reference encoder makes up the rest of the 117 M. The field carries the generative capacity; the rest is light.

Small on purpose

One hundred seventeen million parameters is tiny by today's standards — and that is the whole point. A model this size keeps its weights in cache, finishes in eight steps, and never needs a GPU.

Model size vs. a frontier TTS modelparameters

Frontier TTS (≈)~3 B

Aayush TTS117 M

~25× smaller than the largest neural voice models — comparable fidelity, a fraction of the compute.

4How the compact model is made: distillation

In plain English

Big models are accurate but slow. Distillation trains a small "student" model to imitate a large "teacher", keeping most of the quality at a fraction of the size — like an apprentice learning a master's craft, then doing the job faster and cheaper.

Aayush TTS is a distilled-class model: a compact student that reproduces the behaviour of a much larger generative teacher, paired with a few-step flow-matching solver. That combination — small network, eight evaluations — is what collapses the per-second cost of speech onto a CPU.

5The generative core: flow matching

The field learns a time-dependent velocity $v_\theta(x, t \mid c)$ that transports a Gaussian sample at $t{=}0$ to a mel-spectrogram at $t{=}1$, trained along a straight optimal-transport path. The math is compact:

probability path

$$x_t = (1-t)\,x_0 + t\,x_1, \qquad x_0 \sim \mathcal{N}(0, I),\ \ x_1 \sim p_{\text{data}}(\,\cdot \mid c)$$(1)

conditional flow-matching objective

$$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{\,t,\ x_0,\ x_1}\!\Big[\,\big\lVert\, v_\theta(x_t, t \mid c) - (x_1 - x_0)\,\big\rVert_2^{\,2}\,\Big]$$(2)

At inference we integrate the field as an ODE from noise to speech, with a fixed eight-step Euler solver:

sampling · explicit euler, N = 8

$$\frac{dx}{dt} = v_\theta(x, t \mid c), \qquad x_{k+1} = x_k + \tfrac{1}{8}\,v_\theta(x_k, t_k), \quad k = 0,\dots,7$$(3)

Because the path is nearly straight, eight steps suffice where a diffusion model needs tens to hundreds.

6Benchmarks

Fewer steps than diffusion

The number of network evaluations per utterance (NFE) is the dominant cost. Flow matching at eight steps is an order of magnitude leaner than diffusion vocoders.

Network evaluations per utterance (NFE) · drawn to scalefewer is faster

Diffusion (high-q.)~1000

Diffusion (fast)~50

Aayush TTS8

Real-time on a CPU

Measured end-to-end on a consumer laptop CPU (single request, eight steps):

Table 1 — measured inference · CPU · batch = 1
quantity	value	note
audio produced	4.03 s	44.1 kHz mono
compute time	0.72 s	warm, CPU
real-time factor (RTF)	0.18	≈ 5× faster than real time
parameters · full model	~117 M	incl. offline style encoder
parameters · inference graphs	99.2 M	measured, 4 ONNX
ODE steps (NFE)	8	fixed-step Euler
GPU required	none	CPU only

real-time factor

$$\text{RTF} = \frac{t_{\text{compute}}}{t_{\text{audio}}} \approx \frac{0.72\ \text{s}}{4.03\ \text{s}} \approx 0.18$$(4)

Time to synthesize one second of audiobelow 1.0× = faster than real time

real time1.00×

Aayush TTS (CPU)0.18×

The headroom under 1.0× is what a single core turns into ~5 concurrent real-time voices.

Cost per million characters

The speed translates directly into price. Synthesizing a million characters of speech, across the field:

Cost per 1,000,000 characters · true scaleAayush TTS = the sliver

ElevenLabs$150

Google / Azure$16

OpenAI tts$15

Aayush TTS$3

$3 / 1M characters = 0.3¢ per 1,000 characters — the price the CPU economics unlock.

7Examples — hear it

The same compact model, synthesized on CPU, across emotions and languages. (Pre-rendered samples.)

English · neutral

“Aayush TTS runs in real time on an ordinary processor.”

English · narration

“We integrate a probability flow in just eight steps.”

Emotion · calm

A measured, even-toned read.

Emotion · cheerful

Bright and upbeat.

Use case · professional

A composed corporate voice.

Hindi · वेलकम

Vernacular synthesis, same model.

8From milliseconds to a price

An RTF of 0.18 on a commodity core is the whole business case. A GPU serving real-time voice runs at batch one — its worst case — and idles most of its silicon; a CPU at batch one is doing exactly the work it is good at. Trading scarce, premium accelerators for abundant general-purpose compute collapses the cost floor — and the floor is the product. It is what lets Aayush TTS be offered at 0.3¢ per 1,000 characters and still leave margin. The market argument is in the thesis; this report is the machine that makes it true.

§References & notes

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, M. Le. Flow Matching for Generative Modeling. ICLR, 2023.
A. Tong et al. Conditional flow matching with optimal-transport paths. 2023.
G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. 2015.
Foundations: an open flow-matching speech model — MIT-licensed inference code, Open RAIL-M weights. Use is subject to the RAIL-M use-based restrictions, passed through in our Acceptable Use Policy; attribution retained in the distribution NOTICES.