This repo contains working code to benchmark:
- TTS latency (text -> first audio chunk)
- STT latency (audio -> first transcript message)
- E2E latency (text -> audio -> text)
- Length variation sweeps (how latency changes as text length changes)
sarvam/scripts/sarvam_tts_low_latency.js: optimized Sarvam TTS websocket benchmarksarvam/scripts/sarvam_stt.js: optimized Sarvam STT websocket benchmarksarvam/scripts/sarvam_e2e_latency.js: end-to-end benchmark (TTS + STT), including multi-trial stats and length sweepssarvam/scripts/generate_multilang_tts.js: generate long/short multilingual audio samplessarvam/context/: project behavior and learningssarvam/benchmarks/: benchmark outputs
- Install deps:
npm install- Create env file:
cp .env.example .env- Fill API keys in
.env.
See .env.example in this repo.
Minimum required for Sarvam flows:
SARVAM_API_KEY(orSARVAM_API_SUBSCRIPTION_KEY)
npm run sarvam:sttnpm run sarvam:tts:low-latencynpm run sarvam:e2e -- "नमस्ते, यह end to end latency test है।"E2E_TRIALS=10 E2E_WARMUP_TRIALS=2 npm run sarvam:e2e -- "your text here"Runs multiple text lengths concurrently and prints variation table + overall averages.
E2E_LENGTH_SWEEP=true E2E_LENGTH_PARALLELISM=2 E2E_TRIALS=3 E2E_WARMUP_TRIALS=1 npm run sarvam:e2enpm run groq:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"Speaker mode (only when explicitly needed):
npm run groq:tts:live -- --speaker "नमस्ते, यह live speaker test है।"npm run groq:tts:benchnpm run cerebras:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"npm run cerebras:tts:benchTTS_COMPARE_TRIALS=2 TTS_COMPARE_TARGET_MS=400 npm run tts:both -- "नमस्ते, यह compare test है।"# Groq provider
npm run pipeline::groq
# Cerebras provider
npm run pipeline::cerebrasThis pipeline is isolated from standalone STT/TTS scripts and includes:
- VAD end-of-speech handoff
- barge-in drop/discard logic
- low-latency streaming speaker output
- first-token and first-audio latency counters
This is isolated under voice.ai/ and does not replace local scripts.
npm run voice:serverServer endpoint:
ws://localhost:8081/
Live test app (mic -> websocket server -> speaker):
npm run voice:client# 10ms chunks + 10ms pacing
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=10 SARVAM_STT_CHUNK_PACING_MS=10 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt
# 5ms chunks + no pacing (best practical)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt
# 1ms chunks + no pacing (stress)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=1 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt
# VAD off comparison
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_VAD_SIGNALS=false SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt# Gujarati long
SARVAM_STT_AUDIO_FILE=audio/gu_long_16k.wav SARVAM_STT_LANGUAGE_CODE=gu-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt
# Hindi long
SARVAM_STT_AUDIO_FILE=audio/hi_long_16k.wav SARVAM_STT_LANGUAGE_CODE=hi-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt# short sample
node sarvam/scripts/generate_multilang_tts.js short audio/multilang_short.mp3
# long sample
node sarvam/scripts/generate_multilang_tts.js long audio/multilang_long.mp3- STT per-request outputs:
audio/sarvam_stt_output_req*.json - E2E generated audio:
audio/e2e_tts_*.wav - Length sweep exports:
sarvam/benchmarks/length_sweep_*.json
- For optimization work, use
pipeline_ms(tts_total + stt_total) for model path latency. overall_e2e_msincludes process/orchestration overhead and will be higher.