Skip to content

ru-aish/voice.in

Repository files navigation

Live Voice Latency Benchmarks (Sarvam)

This repo contains working code to benchmark:

  • TTS latency (text -> first audio chunk)
  • STT latency (audio -> first transcript message)
  • E2E latency (text -> audio -> text)
  • Length variation sweeps (how latency changes as text length changes)

Code Map

  • sarvam/scripts/sarvam_tts_low_latency.js: optimized Sarvam TTS websocket benchmark
  • sarvam/scripts/sarvam_stt.js: optimized Sarvam STT websocket benchmark
  • sarvam/scripts/sarvam_e2e_latency.js: end-to-end benchmark (TTS + STT), including multi-trial stats and length sweeps
  • sarvam/scripts/generate_multilang_tts.js: generate long/short multilingual audio samples
  • sarvam/context/: project behavior and learnings
  • sarvam/benchmarks/: benchmark outputs

Setup

  1. Install deps:
npm install
  1. Create env file:
cp .env.example .env
  1. Fill API keys in .env.

Required Environment

See .env.example in this repo.

Minimum required for Sarvam flows:

  • SARVAM_API_KEY (or SARVAM_API_SUBSCRIPTION_KEY)

Run Commands

1) STT benchmark

npm run sarvam:stt

2) TTS benchmark

npm run sarvam:tts:low-latency

3) E2E single input

npm run sarvam:e2e -- "नमस्ते, यह end to end latency test है।"

4) E2E multi-trial stats (min/avg/p95)

E2E_TRIALS=10 E2E_WARMUP_TRIALS=2 npm run sarvam:e2e -- "your text here"

5) E2E length sweep (parallel)

Runs multiple text lengths concurrently and prints variation table + overall averages.

E2E_LENGTH_SWEEP=true E2E_LENGTH_PARALLELISM=2 E2E_TRIALS=3 E2E_WARMUP_TRIALS=1 npm run sarvam:e2e

6) Groq -> Sarvam TTS live latency

npm run groq:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"

Speaker mode (only when explicitly needed):

npm run groq:tts:live -- --speaker "नमस्ते, यह live speaker test है।"

7) Groq -> Sarvam TTS benchmark (Hindi/Gujarati short/medium/long)

npm run groq:tts:bench

8) Cerebras -> Sarvam TTS live latency

npm run cerebras:tts:live -- "नमस्ते, low latency assistant strategy बताओ।"

9) Cerebras -> Sarvam TTS benchmark

npm run cerebras:tts:bench

10) One-command compare (Groq + Cerebras)

TTS_COMPARE_TRIALS=2 TTS_COMPARE_TARGET_MS=400 npm run tts:both -- "नमस्ते, यह compare test है।"

11) Full live voice pipeline (mic -> model -> TTS speaker)

# Groq provider
npm run pipeline::groq

# Cerebras provider
npm run pipeline::cerebras

This pipeline is isolated from standalone STT/TTS scripts and includes:

  • VAD end-of-speech handoff
  • barge-in drop/discard logic
  • low-latency streaming speaker output
  • first-token and first-audio latency counters

12) Standalone websocket server (deployment-ready)

This is isolated under voice.ai/ and does not replace local scripts.

npm run voice:server

Server endpoint:

ws://localhost:8081/

Live test app (mic -> websocket server -> speaker):

npm run voice:client

Variations We Tested

STT chunk/pacing variations

# 10ms chunks + 10ms pacing
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=10 SARVAM_STT_CHUNK_PACING_MS=10 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# 5ms chunks + no pacing (best practical)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# 1ms chunks + no pacing (stress)
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=1 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# VAD off comparison
SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_VAD_SIGNALS=false SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

Long Gujarati + Hindi STT tests

# Gujarati long
SARVAM_STT_AUDIO_FILE=audio/gu_long_16k.wav SARVAM_STT_LANGUAGE_CODE=gu-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

# Hindi long
SARVAM_STT_AUDIO_FILE=audio/hi_long_16k.wav SARVAM_STT_LANGUAGE_CODE=hi-IN SARVAM_STT_TOTAL_REQUESTS=3 SARVAM_STT_WARMUP_REQUESTS=1 SARVAM_STT_CHUNK_MS=5 SARVAM_STT_CHUNK_PACING_MS=0 SARVAM_STT_WAIT_FOR_RESULT_MS=12000 npm run sarvam:stt

Generate multilingual audio samples

# short sample
node sarvam/scripts/generate_multilang_tts.js short audio/multilang_short.mp3

# long sample
node sarvam/scripts/generate_multilang_tts.js long audio/multilang_long.mp3

Outputs

  • STT per-request outputs: audio/sarvam_stt_output_req*.json
  • E2E generated audio: audio/e2e_tts_*.wav
  • Length sweep exports: sarvam/benchmarks/length_sweep_*.json

Notes

  • For optimization work, use pipeline_ms (tts_total + stt_total) for model path latency.
  • overall_e2e_ms includes process/orchestration overhead and will be higher.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors