GLM-4-Voice

Read this in Chinese

Thank you, cydxg, for quantizing and providing code for this model: https://huggingface.co/cydxg/glm-4-voice-9b-int4

GLM-4-Voice

GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.

Model Architecture

We provide the three components of GLM-4-Voice:

GLM-4-Voice-Tokenizer: Trained by adding vector quantization to the encoder part of Whisper, converting continuous speech input into discrete tokens. Each second of audio is converted into 12.5 discrete tokens.
GLM-4-Voice-9B: Pre-trained and aligned on speech modality based on GLM-4-9B, enabling understanding and generation of discretized speech.
GLM-4-Voice-Decoder: A speech decoder supporting streaming inference, retrained based on CosyVoice, converting discrete speech tokens into continuous speech output. Generation can start with as few as 10 audio tokens, reducing conversation latency.

A more detailed technical report will be published later.

Model List

Model	Type	Download
GLM-4-Voice-Tokenizer	Speech Tokenizer	🤗 Huggingface
GLM-4-Voice-9B-int4	Chat Model	🤗 Huggingface
GLM-4-Voice-Decoder	Speech Decoder	🤗 Huggingface

Usage

We provide a Web Demo that can be launched directly. Users can input speech or text, and the model will respond with both speech and text.

Preparation

First, download the repository

git clone --recurse-submodules https://github.com/PasiKoodaa/GLM-4-Voice-12GB
cd GLM-4-Voice

Then, install the dependencies.

pip install -r requirements.txt

Since the Decoder model does not support initialization via transformers, the checkpoint needs to be downloaded separately.

# Git model download, please ensure git-lfs is installed
git clone https://huggingface.co/THUDM/glm-4-voice-decoder

Launch Web Demo

First, start the model service

python model_server.py --model-path glm-4-voice-9b

Then, start the web service

python web_demo.py

You can then access the web demo at http://127.0.0.1:8888.

Known Issues

Gradio’s streaming audio playback can be unstable. The audio quality will be higher when clicking on the audio in the dialogue box after generation is complete.

Examples

We provide some dialogue cases for GLM-4-Voice, including emotion control, speech rate alteration, dialect generation, etc. (The examples are in English.)

Use a gentle voice to guide me to relax

1027.2.1.mp4

Use an excited voice to commentate a football match

1027.2.mp4

Tell a ghost story with a mournful voice

1027.2.2.mp4

Introduce how cold winter is with a Northeastern dialect

1027.2.3.mp4

Cry about your lost cat

1027.2.4.mp4

Acknowledge

Some code in this project is from:

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
cosyvoice		cosyvoice
resources		resources
speech_tokenizer		speech_tokenizer
third_party		third_party
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_ch.md		README_ch.md
flow_inference.py		flow_inference.py
model_server.py		model_server.py
quantification.py		quantification.py
requirements.txt		requirements.txt
web_demo.py		web_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLM-4-Voice

Model Architecture

Model List

Usage

Preparation

Launch Web Demo

Known Issues

Examples

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GLM-4-Voice

Model Architecture

Model List

Usage

Preparation

Launch Web Demo

Known Issues

Examples

Acknowledge

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages