A production-ready FastAPI server for Qwen3-TTS models, supporting CustomVoice (preset speakers), VoiceDesign (natural language voice design), and Base models (voice cloning).
- Multiple Model Support: CustomVoice, VoiceDesign, and Base (voice cloning) models
- Interactive Demo Page: Built-in web UI for testing all API endpoints
- Streaming & Batch Generation: Real-time streaming or batch processing
- Voice Prompt Caching: Intelligent LRU cache for 60-80% latency reduction on repeated requests
- Smart Audio Preprocessing: Automatic silence removal, clipping, and normalization
- Performance Monitoring: Real-Time Factor (RTF) tracking and detailed metrics
- Speed Control: Adjust speech speed from 0.5x to 2.0x without pitch changes
- API Key Authentication: Secure access control
- Docker Support: Easy deployment with GPU support
- Auto-generated Documentation: Interactive API docs with Swagger UI
- Lazy Model Loading: Efficient memory usage by loading models on-demand
- RESTful API: Standard HTTP endpoints with JSON responses
- Python 3.10 or higher
- Conda or Miniconda (recommended for CUDA management)
- NVIDIA GPU with CUDA support (recommended)
- At least 8GB GPU memory for 1.7B models (16GB recommended)
- Docker and docker-compose (for Docker deployment)
Option 1: Automated Installation [RECOMMENDED]
Use the automated installation script:
# Create and activate conda environment first
conda create -n qwen-tts python=3.12
conda activate qwen-tts
# Run the installer
chmod +x install.sh
./install.shThe script will:
- Detect CUDA availability
- Install PyTorch (CUDA or CPU)
- Install Flash Attention if compatible
- Configure
.envautomatically
Option 2: Manual Installation (All platforms)
- Activate conda environment
conda activate qwen-tts # or your preferred environment- Install dependencies in order (important!)
# Step 1: Install PyTorch first
pip install torch>=2.1.0 --index-url https://download.pytorch.org/whl/cu121
# Step 2: Install other dependencies
pip install -r requirements.txt
# Step 3 (Optional): Install Flash Attention for better performance
# This requires CUDA and takes 5-15 minutes to compile
pip install flash-attn>=2.5.0 --no-build-isolationNote: Flash Attention is optional. If installation fails, the server works fine without it. Set USE_FLASH_ATTENTION=false in .env.
- Configure environment
cp .env.example .env
# Edit .env with your configuration- Start the server
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000Or use the unified run script:
chmod +x run.sh
./run.shQuick Start Options:
# Quick start (auto-detects conda environment)
./run.sh
# First time setup with dependency installation
./run.sh --setup
# Run with HTTPS
./run.sh --ssl
# Run with Docker
./run.sh --docker
# Development mode with auto-reload
./run.sh --dev
# Show all options
./run.sh --helpImportant
GPU Support Requirement: To run with --gpus all, you must have the NVIDIA Container Toolkit installed and configured on your host machine.
If you see could not select device driver error, run:
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerOption 1: Pull from Docker Hub (Recommended)
# GPU version (Recommended: mount models for persistence)
docker run -d --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/app/models \
linkary/qwen-tts-server:latest
# CPU version
docker run -d -p 8000:8000 \
-e CUDA_DEVICE=cpu \
-v ~/.cache/huggingface:/app/models \
linkary/qwen-tts-server:latestOption 2: Build locally
# Build and start with GPU support
docker-compose up -d
# Build and start with CPU only (no GPU required)
docker-compose -f docker-compose.yml -f docker-compose.cpu.yml up -d
# Development mode (with hot reload)
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -dManage containers
# Check logs
docker-compose logs -f
# Stop the server
docker-compose downEdit .env file to configure the server:
# Model Configuration
QWEN_TTS_CUSTOM_VOICE_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
QWEN_TTS_VOICE_DESIGN_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
QWEN_TTS_BASE_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base
QWEN_TTS_TOKENIZER=Qwen/Qwen3-TTS-Tokenizer-12Hz
# Device Configuration
CUDA_DEVICE=cuda:0
MODEL_DTYPE=bfloat16
USE_FLASH_ATTENTION=true
# API Configuration
API_KEYS=your-api-key-1,your-api-key-2,your-api-key-3
HOST=0.0.0.0
PORT=8000
# Model Caching
HF_HOME=/app/models
MODEL_CACHE_DIR=/app/models
# Logging
LOG_LEVEL=INFO
# Model Loading
PRELOAD_MODELS=false
# Voice Prompt Caching (NEW in v1.1.0)
VOICE_CACHE_ENABLED=true
VOICE_CACHE_MAX_SIZE=100
VOICE_CACHE_TTL_SECONDS=3600
# Audio Preprocessing (NEW in v1.1.0)
AUDIO_PREPROCESSING_ENABLED=true
REF_AUDIO_MAX_DURATION=15.0
REF_AUDIO_TARGET_DURATION_MIN=5.0
# Audio Validation (NEW in v1.1.0)
AUDIO_UPLOAD_MAX_SIZE_MB=5.0
AUDIO_UPLOAD_MAX_DURATION=60.0
# Performance Monitoring (NEW in v1.1.0)
ENABLE_PERFORMANCE_LOGGING=true
ENABLE_WARMUP=true
WARMUP_TEXT="This is a warmup test to initialize the model."Once the server is running, access the interactive API documentation:
- Demo Page: http://localhost:8000/demo - Interactive web UI for testing all features
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI Schema: http://localhost:8000/openapi.json
The built-in demo page provides a rich visual interface for testing all TTS features:
Generate speech using preset speakers with emotional control and style instructions.
Create custom voices using natural language descriptions.
Clone any voice from a reference audio sample. Choose between Upload File and Microphone tabs for flexible input.
Access complete API reference directly within the demo. Features a widened view for better readability and interactive "View in Swagger" links.
Configure API access and monitor server status.
Features:
- 🌐 Bilingual UI (English / 中文)
- 🎙️ Built-in voice recording for cloning
- 💾 Save and reuse voice prompts
- 📊 Real-time server status monitoring
- 🎨 Retro-futuristic design
Basic health check
curl http://localhost:8000/healthCheck which models are loaded
curl http://localhost:8000/health/modelsGenerate speech with preset speakers
curl -X POST http://localhost:8000/api/v1/custom-voice/generate \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a test of the Qwen TTS system.",
"language": "English",
"speaker": "Ryan",
"instruct": "Speak in a cheerful and energetic tone",
"response_format": "wav"
}' \
--output output.wavStream generation with Server-Sent Events
curl -X POST http://localhost:8000/api/v1/custom-voice/generate-stream \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "This is streaming audio generation.",
"language": "English",
"speaker": "Aiden"
}'List available speakers
curl http://localhost:8000/api/v1/custom-voice/speakers \
-H "X-API-Key: your-api-key-1"List supported languages
curl http://localhost:8000/api/v1/custom-voice/languages \
-H "X-API-Key: your-api-key-1"Generate speech with natural language voice description
curl -X POST http://localhost:8000/api/v1/voice-design/generate \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "Welcome to the future of voice synthesis.",
"language": "English",
"instruct": "A warm, professional female voice with a slight British accent, speaking confidently and clearly",
"response_format": "wav"
}' \
--output voice_design.wavStream voice design generation
curl -X POST http://localhost:8000/api/v1/voice-design/generate-stream \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "Streaming with custom voice design.",
"language": "English",
"instruct": "Deep male voice, calm and soothing"
}'Clone voice from reference audio
curl -X POST http://localhost:8000/api/v1/base/clone \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "This is cloned speech using the reference voice.",
"language": "English",
"ref_audio_url": "https://example.com/reference.wav",
"ref_text": "This is the transcript of the reference audio.",
"response_format": "wav"
}' \
--output cloned.wavCreate reusable voice clone prompt
curl -X POST http://localhost:8000/api/v1/base/create-prompt \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"ref_audio_url": "https://example.com/reference.wav",
"ref_text": "This is the transcript of the reference audio.",
"x_vector_only_mode": false
}'Response:
{
"prompt_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "Prompt created successfully"
}Generate using saved prompt
curl -X POST http://localhost:8000/api/v1/base/generate-with-prompt \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "New text to synthesize with the saved voice.",
"language": "English",
"prompt_id": "550e8400-e29b-41d4-a716-446655440000",
"response_format": "wav"
}' \
--output prompt_output.wavUpload reference audio file
curl -X POST http://localhost:8000/api/v1/base/upload-ref-audio \
-H "X-API-Key: your-api-key-1" \
-F "[email protected]"Get voice cache statistics
curl http://localhost:8000/api/v1/base/cache/stats \
-H "X-API-Key: your-api-key-1"Response:
{
"enabled": true,
"size": 15,
"max_size": 100,
"hits": 120,
"misses": 35,
"evictions": 2,
"hit_rate_percent": 77.42,
"total_requests": 155
}Clear all cached voice prompts
curl -X POST http://localhost:8000/api/v1/base/cache/clear \
-H "X-API-Key: your-api-key-1"All generation endpoints now support an optional speed parameter to adjust speech speed:
curl -X POST http://localhost:8000/api/v1/custom-voice/generate \
-H "X-API-Key: your-api-key-1" \
-H "Content-Type: application/json" \
-d '{
"text": "This will be spoken faster.",
"language": "English",
"speaker": "Ryan",
"speed": 1.5
}' \
--output fast_speech.wavSpeed range: 0.5x (slower) to 2.0x (faster), default is 1.0x.
All generation responses include performance headers:
X-Generation-Time: 2.340
X-Audio-Duration: 3.500
X-RTF: 0.670
X-Cache-Status: hit
X-Preprocessing-Time: 0.120
- RTF (Real-Time Factor): Lower is better. 0.67x means it took 67% of the audio duration to generate.
- Cache-Status: Shows if voice prompt cache was used (
hitormiss).
Voice prompts are automatically cached for faster repeated generations. Cache behavior is transparent:
- First request with new reference audio: Cache miss, full feature extraction
- Subsequent requests with same audio: Cache hit, 60-80% faster
Configure caching in .env:
VOICE_CACHE_ENABLED=true
VOICE_CACHE_MAX_SIZE=100
VOICE_CACHE_TTL_SECONDS=3600Reference audio is automatically preprocessed for optimal quality:
- Smart clipping to 5-15 seconds at natural pauses
- Silence removal from beginning and end
- Mono conversion and normalization
Configure preprocessing in .env:
AUDIO_PREPROCESSING_ENABLED=true
REF_AUDIO_MAX_DURATION=15.0
REF_AUDIO_TARGET_DURATION_MIN=5.0import requests
import base64
API_KEY = "your-api-key-1"
BASE_URL = "http://localhost:8000"
headers = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
# Generate with CustomVoice
response = requests.post(
f"{BASE_URL}/api/v1/custom-voice/generate",
headers=headers,
json={
"text": "Hello from Python!",
"language": "English",
"speaker": "Ryan",
"instruct": "Speak enthusiastically",
"response_format": "base64"
}
)
if response.status_code == 200:
result = response.json()
audio_data = base64.b64decode(result["audio"])
with open("output.wav", "wb") as f:
f.write(audio_data)
print(f"Audio saved! Sample rate: {result['sample_rate']} Hz")
else:
print(f"Error: {response.status_code} - {response.text}")import requests
API_KEY = "your-api-key-1"
BASE_URL = "http://localhost:8000"
headers = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
# Generate with VoiceDesign
response = requests.post(
f"{BASE_URL}/api/v1/voice-design/generate",
headers=headers,
json={
"text": "Custom voice design in action.",
"language": "English",
"instruct": "A young male voice, 20-25 years old, speaking with excitement and energy",
"response_format": "wav"
}
)
if response.status_code == 200:
with open("voice_design.wav", "wb") as f:
f.write(response.content)
print("Voice design audio saved!")import requests
import base64
API_KEY = "your-api-key-1"
BASE_URL = "http://localhost:8000"
headers = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
# Read reference audio and encode to base64
with open("reference.wav", "rb") as f:
ref_audio_base64 = base64.b64encode(f.read()).decode()
# Clone voice
response = requests.post(
f"{BASE_URL}/api/v1/base/clone",
headers=headers,
json={
"text": "This is my cloned voice speaking new words.",
"language": "English",
"ref_audio_base64": ref_audio_base64,
"ref_text": "Original text from reference audio.",
"x_vector_only_mode": False,
"response_format": "wav"
}
)
if response.status_code == 200:
with open("cloned.wav", "wb") as f:
f.write(response.content)
print("Cloned voice saved!")import requests
import json
import base64
API_KEY = "your-api-key-1"
BASE_URL = "http://localhost:8000"
headers = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
# Stream audio generation
response = requests.post(
f"{BASE_URL}/api/v1/custom-voice/generate-stream",
headers=headers,
json={
"text": "Streaming audio in real-time.",
"language": "English",
"speaker": "Aiden"
},
stream=True
)
chunks = []
for line in response.iter_lines():
if line:
line_str = line.decode('utf-8')
if line_str.startswith('data: '):
data = line_str[6:] # Remove 'data: ' prefix
if data != "complete":
try:
chunks.append(data)
except:
pass
print(f"Received {len(chunks)} audio chunks")| Speaker | Description | Native Language |
|---|---|---|
| Vivian | Bright, slightly edgy young female voice | Chinese |
| Serena | Warm, gentle young female voice | Chinese |
| Uncle_Fu | Seasoned male voice with a low, mellow timbre | Chinese |
| Dylan | Youthful Beijing male voice with a clear, natural timbre | Chinese (Beijing Dialect) |
| Eric | Lively Chengdu male voice with a slightly husky brightness | Chinese (Sichuan Dialect) |
| Ryan | Dynamic male voice with strong rhythmic drive | English |
| Aiden | Sunny American male voice with a clear midrange | English |
| Ono_Anna | Playful Japanese female voice with a light, nimble timbre | Japanese |
| Sohee | Warm Korean female voice with rich emotion | Korean |
- Auto (automatic language detection)
- Chinese
- English
- Japanese
- Korean
- German
- French
- Russian
- Portuguese
- Spanish
- Italian
- Enable Voice Caching: Set
VOICE_CACHE_ENABLED=truefor 60-80% faster repeated requests - Preload Models: Set
PRELOAD_MODELS=truefor faster first request (requires more memory) - Use Prompts: For voice cloning, create reusable prompts to avoid re-extracting features
- Enable Warmup: Set
ENABLE_WARMUP=trueto eliminate first-request latency - Flash Attention: Enable for faster inference on supported GPUs (
USE_FLASH_ATTENTION=true) - Model Size: Use 0.6B models for faster inference with slightly lower quality
- Audio Preprocessing: Keep
AUDIO_PREPROCESSING_ENABLED=truefor better voice clone quality - Monitor Performance: Check
X-RTFheaders to identify bottlenecks
With caching enabled on RTX 4090:
| Operation | First Request | Cached Request | Speedup |
|---|---|---|---|
| Voice Clone (Base) | 2.5s | 0.6s | 4.2x |
| CustomVoice | 1.8s | 1.8s | N/A |
| VoiceDesign | 2.1s | 2.1s | N/A |
Note: CustomVoice and VoiceDesign don't use voice caching (preset voices only).
If you encounter out-of-memory errors:
-
Use the 0.6B models instead of 1.7B:
QWEN_TTS_CUSTOM_VOICE_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice QWEN_TTS_BASE_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base
-
Set
PRELOAD_MODELS=falseto load models on-demand -
Reduce batch sizes in requests
Flash Attention is optional and requires:
- NVIDIA GPU (Ampere or newer)
- CUDA Toolkit 11.8+
- PyTorch installed first
If Flash Attention fails to install:
-
Make sure torch is installed first:
pip install torch>=2.1.0 --index-url https://download.pytorch.org/whl/cu121 -
Try installing with
--no-build-isolation:pip install flash-attn>=2.5.0 --no-build-isolation -
If it still fails, skip it and set
USE_FLASH_ATTENTION=falsein.env- The server works perfectly without Flash Attention
- You'll just have slightly slower inference
Models are automatically downloaded on first use. If you have download issues:
- Pre-download models using ModelScope or HuggingFace CLI
- Set
MODEL_CACHE_DIRto point to your download location - Use local model paths instead of HuggingFace IDs
If authentication fails:
- Verify
API_KEYSis set in.env - Ensure you're passing
X-API-Keyheader in requests - For development without authentication, leave
API_KEYSempty
- Preload Models: Set
PRELOAD_MODELS=truefor faster first request (requires more memory) - Use Prompts: For voice cloning, create reusable prompts to avoid re-extracting features
- Batch Requests: Use batch endpoints for multiple texts to improve throughput
- Flash Attention: Enable for faster inference on supported GPUs
- Model Size: Use 0.6B models for faster inference with slightly lower quality
This project follows the Qwen3-TTS license. See the official repository for details.
If you use this API server, please cite the Qwen3-TTS paper:
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}- Official Qwen3-TTS: https://github.com/QwenLM/Qwen3-TTS
- Issues: Report issues on the GitHub repository
- Documentation: Visit the interactive API docs at
/docs
This API server is built on top of the excellent Qwen3-TTS project by the Qwen team at Alibaba Cloud.




