Technical Report · Aayush TTS · v1

Eight-step flow matching for real-time neural speech on commodity CPUs

Aayush Gupta
LeanVoice Research
Abstract

Aayush TTS turns text into lifelike 44.1 kHz speech with a compact ~117-million-parameter model whose generative core is solved in just eight steps. We export the four-stage inference pipeline — ~99 M of those weights — to ONNX and tune it for batch-one, real-time inference on ordinary CPUs — no GPU. On a consumer laptop it renders ~4 s of audio in ~0.7 s (real-time factor ≈ 0.18), which is exactly what makes sub-cent-per-thousand-character pricing possible.

117M
parameters
8
ODE steps
0.18×
real-time factor
0
GPUs
Built on open foundations. Aayush TTS is built on an open flow-matching speech model — MIT-licensed inference code and Open RAIL-M weights — which we export, optimize, and serve. Our contribution is the inference and serving stack, not the original training; attribution is retained per those licenses.

1The idea, in plain English

Modern computers can already produce speech that sounds human. The hard part is doing it cheaply and fast enough to use everywhere — not just in a data centre. Aayush TTS is engineered for exactly that: it runs on the kind of processor in an ordinary laptop, with no expensive graphics card.

In plain English

Think of the model as a sculptor. It starts with a block of static (random noise) and, in eight quick passes, carves it into the shape of a voice saying your words. Most systems take hundreds of passes; ours takes eight — which is why it can keep up in real time on a normal computer.

Random noise organizing into a speech spectrogram across eight numbered steps
Figure 1. From noise to speech in eight steps. Each panel is one pass of the model; by the eighth, the static has become a structured spectrogram — the recipe for a waveform.

2Architecture

Under the hood, Aayush TTS is four neural networks in sequence. Text is encoded; a duration model decides how long each sound lasts; a flow-matching field sculpts the noise into a mel-spectrogram; and a vocoder turns that spectrogram into an audio waveform. Each is a separate ONNX graph.

Text encoder9.0 M Duration0.9 M Flow-matchingvθ(x, t | c)64.0 M · 8 steps Vocoder25.3 M hdmeltext x₀ ~ N(0, I) → x₁ = melconditioned on encoded text + voice-style vector c
Figure 2. The four-stage pipeline, with measured parameter counts. The flow-matching field is the only iterative stage; everything else runs once.

3Where the model's brain is: parameters

In plain English

A parameter is a single number the model learned during training — one of millions of tiny dials that shape how it sounds. More dials can mean more nuance, but also more compute. Aayush TTS uses ~117 million of them; a frontier voice model can have ten to thirty times more.

The full model is about 117 million parameters. The four-stage pipeline we ship and run at inference accounts for ~99 M of them — measured directly from the ONNX graphs below — while the remaining ~18 M is the style/reference encoder used offline to build voices, never run at synthesis time. Two-thirds of the inference weights live in the flow-matching field — the part that does the creative work — and almost none in the duration model.

Inference graphs by component · measuredshipped ≈ 99.2 M of 117 M
Flow-matching field64.0 M
Vocoder25.3 M
Text encoder9.0 M
Duration predictor0.9 M
Counted from the four ONNX graphs we ship; the offline style/reference encoder makes up the rest of the 117 M. The field carries the generative capacity; the rest is light.

Small on purpose

One hundred seventeen million parameters is tiny by today's standards — and that is the whole point. A model this size keeps its weights in cache, finishes in eight steps, and never needs a GPU.

Model size vs. a frontier TTS modelparameters
Frontier TTS (≈)~3 B
Aayush TTS117 M
~25× smaller than the largest neural voice models — comparable fidelity, a fraction of the compute.

4How the compact model is made: distillation

In plain English

Big models are accurate but slow. Distillation trains a small "student" model to imitate a large "teacher", keeping most of the quality at a fraction of the size — like an apprentice learning a master's craft, then doing the job faster and cheaper.

Aayush TTS is a distilled-class model: a compact student that reproduces the behaviour of a much larger generative teacher, paired with a few-step flow-matching solver. That combination — small network, eight evaluations — is what collapses the per-second cost of speech onto a CPU.

A large teacher network distilled into a smaller student network
Figure 3. Distillation. A large teacher transfers its behaviour to a compact ~117 M-parameter student; the student keeps most of the fidelity and runs many times cheaper.

5The generative core: flow matching

The field learns a time-dependent velocity \(v_\theta(x, t \mid c)\) that transports a Gaussian sample at \(t{=}0\) to a mel-spectrogram at \(t{=}1\), trained along a straight optimal-transport path. The math is compact:

probability path
$$x_t = (1-t)\,x_0 + t\,x_1, \qquad x_0 \sim \mathcal{N}(0, I),\ \ x_1 \sim p_{\text{data}}(\,\cdot \mid c)$$(1)
conditional flow-matching objective
$$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{\,t,\ x_0,\ x_1}\!\Big[\,\big\lVert\, v_\theta(x_t, t \mid c) - (x_1 - x_0)\,\big\rVert_2^{\,2}\,\Big]$$(2)

At inference we integrate the field as an ODE from noise to speech, with a fixed eight-step Euler solver:

sampling · explicit euler, N = 8
$$\frac{dx}{dt} = v_\theta(x, t \mid c), \qquad x_{k+1} = x_k + \tfrac{1}{8}\,v_\theta(x_k, t_k), \quad k = 0,\dots,7$$(3)

Because the path is nearly straight, eight steps suffice where a diffusion model needs tens to hundreds.

2D flow-matching vector field with an 8-step Euler trajectory from noise to a mel-spectrogram
Figure 4. The probability-flow ODE in 2-D: the learned field (arrows) carries noise at \(t{=}0\) to a clean attractor at \(t{=}1\) along the eight-step trajectory of Eq. (3). Below, the mel-spectrogram resolving along the path.

6Benchmarks

Fewer steps than diffusion

The number of network evaluations per utterance (NFE) is the dominant cost. Flow matching at eight steps is an order of magnitude leaner than diffusion vocoders.

Network evaluations per utterance (NFE) · drawn to scalefewer is faster
Diffusion (high-q.)~1000
Diffusion (fast)~50
Aayush TTS8

Real-time on a CPU

Measured end-to-end on a consumer laptop CPU (single request, eight steps):

Table 1 — measured inference · CPU · batch = 1
quantityvaluenote
audio produced4.03 s44.1 kHz mono
compute time0.72 swarm, CPU
real-time factor (RTF)0.18≈ 5× faster than real time
parameters · full model~117 Mincl. offline style encoder
parameters · inference graphs99.2 Mmeasured, 4 ONNX
ODE steps (NFE)8fixed-step Euler
GPU requirednoneCPU only
real-time factor
$$\text{RTF} = \frac{t_{\text{compute}}}{t_{\text{audio}}} \approx \frac{0.72\ \text{s}}{4.03\ \text{s}} \approx 0.18$$(4)
Time to synthesize one second of audiobelow 1.0× = faster than real time
real time1.00×
Aayush TTS (CPU)0.18×
The headroom under 1.0× is what a single core turns into ~5 concurrent real-time voices.

Cost per million characters

The speed translates directly into price. Synthesizing a million characters of speech, across the field:

Cost per 1,000,000 characters · true scaleAayush TTS = the sliver
ElevenLabs$150
Google / Azure$16
OpenAI tts$15
Aayush TTS$3
$3 / 1M characters = 0.3¢ per 1,000 characters — the price the CPU economics unlock.

7Examples — hear it

The same compact model, synthesized on CPU, across emotions and languages. (Pre-rendered samples.)

English · neutral
“Aayush TTS runs in real time on an ordinary processor.”
English · narration
“We integrate a probability flow in just eight steps.”
Emotion · calm
A measured, even-toned read.
Emotion · cheerful
Bright and upbeat.
Use case · professional
A composed corporate voice.
Hindi · वेलकम
Vernacular synthesis, same model.
A grid of mel-spectrograms for different voices and emotions
Figure 5. Eight voices and emotions, as mel-spectrograms. One model, conditioned on a voice-style vector — distinct timbre and prosody, identical eight-step inference path.

8From milliseconds to a price

An RTF of 0.18 on a commodity core is the whole business case. A GPU serving real-time voice runs at batch one — its worst case — and idles most of its silicon; a CPU at batch one is doing exactly the work it is good at. Trading scarce, premium accelerators for abundant general-purpose compute collapses the cost floor — and the floor is the product. It is what lets Aayush TTS be offered at 0.3¢ per 1,000 characters and still leave margin. The market argument is in the thesis; this report is the machine that makes it true.

§References & notes

  1. Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, M. Le. Flow Matching for Generative Modeling. ICLR, 2023.
  2. A. Tong et al. Conditional flow matching with optimal-transport paths. 2023.
  3. G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. 2015.
  4. Foundations: an open flow-matching speech model — MIT-licensed inference code, Open RAIL-M weights. Use is subject to the RAIL-M use-based restrictions, passed through in our Acceptable Use Policy; attribution retained in the distribution NOTICES.