A browser-based Vietnamese Text-to-Speech application powered by Piper TTS models and ONNX Runtime Web. Generate high-quality speech directly in your browser without requiring a server for inference. Live demo: https://text2speech.work.
- π Browser-Based TTS: Fully client-side text-to-speech processing using Web Workers
- π»π³ Vietnamese Language Support: Advanced Vietnamese text processing with automatic conversion of:
- Numbers to words (0 to billions)
- Dates and date ranges
- Time expressions
- Currency (VND, USD)
- Percentages and decimals
- Phone numbers
- Ordinals
- π€ Multi-Speaker Models: Support for models with multiple voices
- β‘ Real-Time Streaming: Stream audio chunks as they're generated
- ποΈ Speed Control: Adjustable speech speed
- π₯ Audio Download: Export generated audio as WAV files
- π Dark Mode: Built-in theme toggle
- π Text Statistics: Character and word count display
- π Dynamic Model Loading: Load models on-demand from Cloudflare R2 storage
This project is built on top of Piper TTS and fine-tuned using a custom dataset to generate realistic voices. Please see the Training Video here: https://www.youtube.com/watch?v=WgvBOljtNvE
- Based on Piper (English checkpoint)
- Lightweight, fast, and optimized for local inference
- Designed for real-time speech generation
Dataset:
- Dataset size: ~1,000 audio samples
- Voices: Multiple famous celebrity voices
- Training method: Fine-tuning on existing Piper English checkpoint
- Epochs: ~2,000 epochs
- Download training datasets: View Datasets on Google Drive
Available datasets include:
- Vietnamese celebrity voices (Mα»Ή TΓ’m, Ngα»c NgΓ’n, TrαΊ₯n ThΓ nh, Viα»t ThαΊ£o)
- Multi-speaker datasets
- Various dataset sizes (200, 1000+ samples)
- English voice datasets
Audio Preparation:
- Cleaned and normalized audio
- Matched textβaudio pairs
- Consistent sample rate
- Noise removed
What the Model Learns:
- Voice tone
- Accent
- Speech rhythm
- Natural pronunciation
- Web-based inference: No server required
- Runs fully locally: All processing happens in your browser
- Very fast inference: ~5Γ real-time speed
- User-friendly: Simply enter text, select a voice, and generate speech instantly
- β Based on Piper TTS
- β Fine-tuned with 1,000+ audio samples
- β Trained for ~2,000 epochs
- β No server required
- β Web-based & lightweight
- β Fast inference (β5Γ real-time)
- β Free & open-source
- β Allowed for commercial use
- β Easy to deploy or modify
Pre-trained Vietnamese TTS models are available for download:
Download from Google Drive: View Available Models
-
calmwoman3688.onnx (~60.6 MB)
- Configuration:
calmwoman3688.onnx.json
- Configuration:
-
deepman3909.onnx (~60.6 MB)
- Configuration:
deepman3909.onnx.json
- Configuration:
-
ngocngan3701.onnx (~60.6 MB)
- Configuration:
ngocngan3701.onnx.json
- Configuration:
-
vietthao3886.onnx (~60.6 MB)
- Configuration:
vietthao3886.onnx.json
- Configuration:
-
Giα»ng mα»i: Mα»Ή TΓ’m, TrαΊ₯n ThΓ nh, Ngα»c Huyα»n (review phim), Oryx (giα»ng nam siΓͺu trαΊ§m)
Each model includes both the .onnx model file and its corresponding .onnx.json configuration file. Download both files for each model to use it in the application.
- Frontend: Vue 3 + Vite
- TTS Engine: Piper TTS (ONNX format)
- Runtime: ONNX Runtime Web (WASM)
- Hosting: Cloudflare Pages
- Storage: Cloudflare R2 (for model files)
- Styling: Tailwind CSS
- Icons: Lucide Vue Next
nghitts/
βββ src/
β βββ App.vue # Main application component
β βββ components/ # Vue components
β β βββ AudioChunk.vue # Audio playback component
β β βββ ModelSelector.vue # Model selection dropdown
β β βββ SpeedControl.vue # Speech speed slider
β β βββ TextStatistics.vue # Text stats display
β β βββ ThemeToggle.vue # Dark/light mode toggle
β β βββ VoiceSelector.vue # Voice selection component
β βββ lib/
β β βββ piper-tts.js # Piper TTS implementation
β βββ utils/
β β βββ model-cache.js # Model file caching
β β βββ model-detector.js # Model discovery from API
β β βββ text-cleaner.js # Text cleaning and chunking
β β βββ vietnamese-processor.js # Vietnamese text processing
β βββ workers/
β βββ tts-worker.js # Web Worker for TTS processing
βββ functions/
β βββ api/
β βββ models.ts # List available models
β βββ model/[name].ts # Serve model files from R2
βββ public/
βββ non-vietnamese-words.csv # Word replacement dictionary
- Model Loading: Models are stored in Cloudflare R2 and served via Cloudflare Pages Functions
- Text Processing: Vietnamese text is processed to convert numbers, dates, times, etc. to spoken words
- Text Chunking: Input text is intelligently split into chunks for optimal processing
- Phoneme Conversion: Text is converted to phonemes using the phonemizer library
- Audio Generation: ONNX Runtime Web runs the Piper TTS model in a Web Worker
- Streaming: Audio chunks are streamed back to the main thread and played as they're generated
- Audio Merging: Chunks are merged, normalized, and trimmed for final output
The application includes comprehensive Vietnamese text processing that handles:
- Numbers: Automatic conversion to Vietnamese words (e.g., "123" β "mα»t trΔm hai mΖ°Ζ‘i ba")
- Dates: Multiple formats (DD/MM/YYYY, DD-MM-YYYY, date ranges)
- Times: Time expressions (HH:MM, HH:MM:SS, "X giα» Y phΓΊt")
- Currency: VND (Δα»ng) and USD conversion
- Percentages: Automatic conversion (e.g., "50%" β "nΔm mΖ°Ζ‘i phαΊ§n trΔm")
- Decimals: Vietnamese decimal format (comma as decimal separator)
- Phone Numbers: Digit-by-digit reading
- Ordinals: Conversion of ordinal numbers (thα»© 2 β thα»© hai)
- Node.js 18+
- npm or yarn
npm installnpm run devnpm run buildnpm run previewThe project is configured for Cloudflare Pages deployment:
- Models should be stored in a Cloudflare R2 bucket named
tts-bucket - Models should be placed in the
piper/prefix - Each model requires two files:
{model-name}.onnx- The ONNX model file{model-name}.onnx.json- The model configuration file
The Cloudflare Pages Function at /api/models will automatically discover available models from the R2 bucket.
The wrangler.toml file configures:
- Pages build output directory
- R2 bucket binding (
piperβtts-bucket)
Models must be in Piper TTS ONNX format with:
.onnxfile containing the ONNX model.onnx.jsonfile containing voice configuration (phoneme_id_map, audio settings, etc.)
- Removes emojis and special characters
- Normalizes Unicode (NFC)
- Handles Vietnamese-specific punctuation
- Cleans whitespace
- Intelligently splits text into optimal chunks
- Respects sentence boundaries
- Handles long sentences by splitting at word boundaries
- Maintains minimum and maximum chunk sizes for optimal processing
- Real-time streaming of audio chunks
- Automatic normalization and peak limiting
- Silence trimming
- Sample rate preservation
- Modern browsers with WebAssembly support
- Web Workers support required
- ES Modules support required
This project is:
- β Free to use
- β Open source
- β Allowed for commercial use
- β Customizable and deployable
- Built on Piper TTS (GPL) by OHF-Voice
- Inspired by piper-tts-web-demo by clowerweb
- Uses ONNX Runtime Web for browser-based inference