This document provides instructions on how to build and run the real-time Text-to-Speech server using Docker.
- Docker installed on your system.
- An NVIDIA GPU with CUDA drivers installed to run the model efficiently.
- You are in the main
RealtimeTTSdirectory in your terminal.
Run the following command to build the Docker image. This process will take some time as it downloads the CUDA base image, clones repositories, and installs all dependencies.
docker build -t zipvoice-image -f docker/zipvoice/Dockerfile .Once the image is built, you can run it as a container. We assign a specific name (zipvoice-container) to make it easy to manage.
This method maps a local folder on your computer to the cache folder inside the container. This saves the downloaded models on your machine, making all subsequent startups much faster.
-
First, ensure the local cache directory exists.
mkdir -p docker/zipvoice/zipvoice-cache
-
Now, run the container using the command for your operating system's terminal.
On Linux, macOS, or Windows PowerShell:
docker run --rm --name zipvoice-container -p 9086:9086 --gpus all -v "$(pwd)/docker/zipvoice/zipvoice-cache:/opt/app-root/cache" zipvoice-imageOn Windows Command Prompt (
cmd.exe):docker run --rm --name zipvoice-container -p 9086:9086 --gpus all -v "%cd%\docker\zipvoice\zipvoice-cache:/opt-app-root/cache" zipvoice-image
Use this command if you don't want to persist the model cache on your host machine. Note that the server will download the models every time it starts, resulting in a long startup delay.
docker run --rm --name zipvoice-container -p 9086:9086 --gpus all zipvoice-imageBecause we started the container with --name zipvoice-container, you can easily stop it from another terminal window with the following command:
docker stop zipvoice-containerIf the container is unresponsive, you can force it to stop immediately with:
docker kill zipvoice-containerOnce the server is running (you'll see a --- Server Ready --- message in the logs), you can send a request to it from a new terminal.
This command sends a request and saves the resulting audio to a file named output.pcm.
-
Send the request:
curl -X POST http://localhost:9086/api/c3BlZWNo \ -H "Content-Type: application/json" \ -d '{ "text": "Hello world, this is a test of the real time text to speech server.", "voice": "alpha-warm" }' \ --output output.pcm -
Play the raw PCM audio file:
ffplay -f s16le -ar 24000 -ac 1 output.pcm
This command sends a request and immediately pipes the streaming audio output to ffplay for real-time playback. This is excellent for testing latency.
curl --no-buffer -X POST http://localhost:9086/api/c3BlZWNo -H "Content-Type: application/json" -d "{\"text\": \"Hi there! I'm really excited to try this out! I hope the speech sounds natural and warm. That's exactly what I'm going for!\", \"voice\": \"alpha-warm\"}" | ffplay -f s16le -ar 24000 -i pipe:0 -nodisp -autoexit -probesize 32 -analyzeduration 0For two subsequent syntheses:
curl --no-buffer -X POST http://localhost:9086/api/c3BlZWNo -H "Content-Type: application/json" -d "{\"text\": \"Hey! So this is me testing out my voice... kinda nervous but also excited about it. This whole voice synthesis thing is sooo fascinating. I mean... technology these days is like creating a perfect robot version of a person, right?\", \"voice\": \"alpha-warm\"}" | ffplay -f s16le -ar 24000 -i pipe:0 -nodisp -autoexit -probesize 32 -analyzeduration 0 && curl --no-buffer -X POST http://localhost:9086/api/c3BlZWNo -H "Content-Type: application/json" -d "{\"text\": \"The voice you knew... is GONE. What you're hearing now... is something ENTIRELY different. This isn't just another voice - this is POWER unleashed, INTENSITY personified, COMMAND that cuts through your soul like a blade through silence.\", \"voice\": \"beta-intense\"}" | ffplay -f s16le -ar 24000 -i pipe:0 -nodisp -autoexit -probesize 32 -analyzeduration 0