Live Voice Latency Benchmarks (Sarvam)

This repo contains working code to benchmark:

TTS latency (text -> first audio chunk)
STT latency (audio -> first transcript message)
E2E latency (text -> audio -> text)
Length variation sweeps (how latency changes as text length changes)

Code Map

sarvam/scripts/sarvam_tts_low_latency.js: optimized Sarvam TTS websocket benchmark
sarvam/scripts/sarvam_stt.js: optimized Sarvam STT websocket benchmark
sarvam/scripts/sarvam_e2e_latency.js: end-to-end benchmark (TTS + STT), including multi-trial stats and length sweeps
sarvam/scripts/generate_multilang_tts.js: generate long/short multilingual audio samples
sarvam/context/: project behavior and learnings
sarvam/benchmarks/: benchmark outputs

Setup

Install deps:

npm install

Create env file:

cp .env.example .env

Fill API keys in .env.

Required Environment

See .env.example in this repo.

Minimum required for Sarvam flows:

SARVAM_API_KEY (or SARVAM_API_SUBSCRIPTION_KEY)

Run Commands

1) STT benchmark

npm run sarvam:stt

2) TTS benchmark

npm run sarvam:tts:low-latency

3) E2E single input

npm run sarvam:e2e -- "नमस्ते, यह end to end latency test है।"

4) E2E multi-trial stats (min/avg/p95)

E2E_TRIALS=10 E2E_WARMUP_TRIALS=2 npm run sarvam:e2e -- "your text here"

5) E2E length sweep (parallel)

Runs multiple text lengths concurrently and prints variation table + overall averages.

E2E_LENGTH_SWEEP=true E2E_LENGTH_PARALLELISM=2 E2E_TRIALS=3 E2E_WARMUP_TRIALS=1 npm run sarvam:e2e

6) Groq -> Sarvam TTS live latency

npm run groq:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"

Speaker mode (only when explicitly needed):

npm run groq:tts:live -- --speaker "नमस्ते, यह live speaker test है।"

7) Groq -> Sarvam TTS benchmark (Hindi/Gujarati short/medium/long)

npm run groq:tts:bench

8) Cerebras -> Sarvam TTS live latency

npm run cerebras:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"

9) Cerebras -> Sarvam TTS benchmark

npm run cerebras:tts:bench

10) One-command compare (Groq + Cerebras)

TTS_COMPARE_TRIALS=2 TTS_COMPARE_TARGET_MS=400 npm run tts:both -- "नमस्ते, यह compare test है।"

11) Full live voice pipeline (mic -> model -> TTS speaker)

# Groq provider
npm run pipeline::groq

# Cerebras provider
npm run pipeline::cerebras

This pipeline is isolated from standalone STT/TTS scripts and includes:

VAD end-of-speech handoff
barge-in drop/discard logic
low-latency streaming speaker output
first-token and first-audio latency counters

12) Standalone websocket server (deployment-ready)

This is isolated under voice.ai/ and does not replace local scripts.

npm run voice:server

Server endpoint:

ws://localhost:8081/

Live test app (mic -> websocket server -> speaker):

npm run voice:client

Variations We Tested

STT chunk/pacing variations

# 10ms chunks + 10ms pacing
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=10 SARVAM_STT_CHUNK_PACING_MS=10 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# 5ms chunks + no pacing (best practical)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# 1ms chunks + no pacing (stress)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=1 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# VAD off comparison
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_VAD_SIGNALS=false SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

Long Gujarati + Hindi STT tests

# Gujarati long
SARVAM_STT_AUDIO_FILE=audio/gu_long_16k.wav SARVAM_STT_LANGUAGE_CODE=gu-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# Hindi long
SARVAM_STT_AUDIO_FILE=audio/hi_long_16k.wav SARVAM_STT_LANGUAGE_CODE=hi-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

Generate multilingual audio samples

# short sample
node sarvam/scripts/generate_multilang_tts.js short audio/multilang_short.mp3

# long sample
node sarvam/scripts/generate_multilang_tts.js long audio/multilang_long.mp3

Outputs

STT per-request outputs: audio/sarvam_stt_output_req*.json
E2E generated audio: audio/e2e_tts_*.wav
Length sweep exports: sarvam/benchmarks/length_sweep_*.json

Notes

For optimization work, use pipeline_ms (tts_total + stt_total) for model path latency.
overall_e2e_ms includes process/orchestration overhead and will be higher.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
audio		audio
model		model
node_modules		node_modules
sarvam		sarvam
voice.ai		voice.ai
.env.example		.env.example
.env.example.essential		.env.example.essential
.gitignore		.gitignore
1.py		1.py
1.ts		1.ts
Accemblyai.js		Accemblyai.js
README.md		README.md
bun.lock		bun.lock
debug_output.wav		debug_output.wav
elevenlabs.ts		elevenlabs.ts
gemini_live_low_latency_s2s.ts		gemini_live_low_latency_s2s.ts
optimized_transcribe.py		optimized_transcribe.py
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Live Voice Latency Benchmarks (Sarvam)

Code Map

Setup

Required Environment

Run Commands

1) STT benchmark

2) TTS benchmark

3) E2E single input

4) E2E multi-trial stats (min/avg/p95)

5) E2E length sweep (parallel)

6) Groq -> Sarvam TTS live latency

7) Groq -> Sarvam TTS benchmark (Hindi/Gujarati short/medium/long)

8) Cerebras -> Sarvam TTS live latency

9) Cerebras -> Sarvam TTS benchmark

10) One-command compare (Groq + Cerebras)

11) Full live voice pipeline (mic -> model -> TTS speaker)

12) Standalone websocket server (deployment-ready)

Variations We Tested

STT chunk/pacing variations

Long Gujarati + Hindi STT tests

Generate multilingual audio samples

Outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Live Voice Latency Benchmarks (Sarvam)

Code Map

Setup

Required Environment

Run Commands

1) STT benchmark

2) TTS benchmark

3) E2E single input

4) E2E multi-trial stats (min/avg/p95)

5) E2E length sweep (parallel)

6) Groq -> Sarvam TTS live latency

7) Groq -> Sarvam TTS benchmark (Hindi/Gujarati short/medium/long)

8) Cerebras -> Sarvam TTS live latency

9) Cerebras -> Sarvam TTS benchmark

10) One-command compare (Groq + Cerebras)

11) Full live voice pipeline (mic -> model -> TTS speaker)

12) Standalone websocket server (deployment-ready)

Variations We Tested

STT chunk/pacing variations

Long Gujarati + Hindi STT tests

Generate multilingual audio samples

Outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages