OmniVoice TTS nodes for ComfyUI — Zero-shot multilingual text-to-speech with voice cloning and voice design. Supports 600+ languages with state-of-the-art quality.
- 600+ Languages — Broadest language coverage among zero-shot TTS models
- Voice Cloning — Clone any voice from 3-15 seconds of reference audio
- Voice Design — Create synthetic voices from text descriptions (gender, age, pitch, accent)
- Multi-Speaker Dialogue — Generate conversations between multiple speakers using
[Speaker_N]:tags - Fast Inference — RTF as low as 0.025 (40x faster than real-time)
- Non-Verbal Expressions — Inline tags like
[laughter],[sigh],[sniff] - SageAttention Support — GPU-optimized attention via monkey-patching Qwen3Attention (CUDA, SM80+)
- Auto-Download — Models download automatically from HuggingFace on first use
- Whisper ASR Caching — Pre-load Whisper to avoid re-downloading on each run
- VRAM Efficient — Automatic CPU offload, VBAR/aimdo integration, smart cache invalidation
comfyui-omnivoice-example.mp4
Search for "OmniVoice" in ComfyUI Manager and click Install.
cd ComfyUI/custom_nodes
git clone https://github.com/saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.pyThe omnivoice pip package specifies torch==2.8.* as a dependency, which can downgrade your PyTorch to a CPU-only version and break ComfyUI's GPU acceleration. We work around this by installing omnivoice with --no-deps in install.py, then separately installing only the missing dependencies that ComfyUI doesn't already provide.
If another package accidentally downgrades your PyTorch, see the PyTorch Compatibility Matrix for restore commands matching your setup.
1. OmniVoice Longform TTS — Long-form text-to-speech with smart chunking and optional voice cloning
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | COMBO | (auto) | OmniVoice model checkpoint |
| text | STRING, multiline | "Hello!..." |
Text to synthesize |
| ref_text | STRING, multiline | "" | Reference audio transcript (empty=auto-detect) |
| steps | INT | 32 | Diffusion steps (4-64, 16=faster, 64=best) |
| guidance_scale | FLOAT | 2.0 | Classifier-free guidance scale (0-10) |
| t_shift | FLOAT | 0.1 | Time-step shift for noise schedule (0-1) |
| speed | FLOAT | 1.0 | Speaking speed (0.5-2.0, >1=faster) |
| duration | FLOAT | 0.0 | Fixed duration in seconds (0=auto) |
| device | COMBO | auto | auto, cuda, cpu, mps |
| dtype | COMBO | auto | auto, bf16, fp16, fp32 |
| attention | COMBO | auto | auto, eager, sage_attention |
| seed | INT | 0 | Random seed (0=random) |
| words_per_chunk | INT | 100 | Words per chunk (0=no chunking) |
| position_temperature | FLOAT | 5.0 | Mask-position temperature (0=greedy, higher=more random) |
| class_temperature | FLOAT | 0.0 | Token sampling temperature (0=greedy) |
| layer_penalty_factor | FLOAT | 5.0 | Penalty on deeper codebook layers |
| denoise | BOOLEAN | True | Prepend denoise token for cleaner output |
| preprocess_prompt | BOOLEAN | True | Preprocess reference audio (remove silences) |
| postprocess_output | BOOLEAN | True | Post-process generated audio (remove long silences) |
| keep_model_loaded | BOOLEAN | True | Keep model in memory (offloads to CPU between runs) |
| instruct | STRING | "" | Dialect/style instruction. Only specific values are supported — see Dialect/Style Instructions. Applied to every chunk |
Optional Inputs:
ref_audio— Reference audio for voice cloning (3-15s optimal)whisper_model— Pre-loaded Whisper ASR model
2. OmniVoice Voice Clone TTS — Clone a voice from reference audio
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | COMBO | (auto) | OmniVoice model checkpoint |
| text | STRING, multiline | "Hello!..." |
Text to synthesize in cloned voice |
| ref_audio | AUDIO | required | Reference audio (3-15s) |
| ref_text | STRING, multiline | "" | Transcript (empty=auto-transcribe with Whisper) |
| steps | INT | 32 | Diffusion steps (4-64) |
| guidance_scale | FLOAT | 2.0 | Classifier-free guidance scale (0-10) |
| t_shift | FLOAT | 0.1 | Time-step shift for noise schedule (0-1) |
| speed | FLOAT | 1.0 | Speaking speed (0.5-2.0) |
| duration | FLOAT | 0.0 | Fixed duration in seconds (0=auto) |
| device | COMBO | auto | auto, cuda, cpu, mps |
| dtype | COMBO | auto | auto, bf16, fp16, fp32 |
| attention | COMBO | auto | auto, eager, sage_attention |
| seed | INT | 0 | Random seed (0=random) |
| position_temperature | FLOAT | 5.0 | Mask-position temperature (0=greedy) |
| class_temperature | FLOAT | 0.0 | Token sampling temperature (0=greedy) |
| layer_penalty_factor | FLOAT | 5.0 | Penalty on deeper codebook layers |
| denoise | BOOLEAN | True | Prepend denoise token for cleaner output |
| preprocess_prompt | BOOLEAN | True | Preprocess reference audio (remove silences) |
| postprocess_output | BOOLEAN | True | Post-process generated audio (remove long silences) |
| keep_model_loaded | BOOLEAN | True | Keep model in memory |
| instruct | STRING | "" | Dialect/style instruction. Only specific values are supported — see Dialect/Style Instructions |
Optional Input:
whisper_model— Pre-loaded Whisper from OmniVoice Whisper Loader
3. OmniVoice Voice Design TTS — Design voices from text descriptions, no reference audio needed
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | COMBO | (auto) | OmniVoice model checkpoint |
| text | STRING, multiline | "Hello!..." |
Text to synthesize in designed voice |
| voice_instruct | STRING, multiline | "female, low pitch..." |
Voice attributes (comma-separated) |
| steps | INT | 32 | Diffusion steps (4-64) |
| guidance_scale | FLOAT | 2.0 | Classifier-free guidance scale (0-10) |
| t_shift | FLOAT | 0.1 | Time-step shift for noise schedule (0-1) |
| speed | FLOAT | 1.0 | Speaking speed (0.5-2.0) |
| duration | FLOAT | 0.0 | Fixed duration in seconds (0=auto) |
| device | COMBO | auto | auto, cuda, cpu, mps |
| dtype | COMBO | auto | auto, bf16, fp16, fp32 |
| attention | COMBO | auto | auto, eager, sage_attention |
| seed | INT | 0 | Random seed (0=random) |
| position_temperature | FLOAT | 5.0 | Mask-position temperature (0=greedy) |
| class_temperature | FLOAT | 0.0 | Token sampling temperature (0=greedy) |
| layer_penalty_factor | FLOAT | 5.0 | Penalty on deeper codebook layers |
| denoise | BOOLEAN | True | Prepend denoise token for cleaner output |
| postprocess_output | BOOLEAN | True | Post-process generated audio (remove long silences) |
| keep_model_loaded | BOOLEAN | True | Keep model in memory |
4. OmniVoice Multi-Speaker TTS — Generate dialogue between multiple speakers using [Speaker_N]: tags
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | COMBO | (auto) | OmniVoice model checkpoint |
| text | STRING, multiline | "[Speaker_1]: Hello..." |
Multi-speaker text |
| num_speakers | DYNAMIC | 2 | Number of speakers (2-10, dynamic inputs) |
| steps | INT | 32 | Diffusion steps per speaker |
| guidance_scale | FLOAT | 2.0 | Classifier-free guidance scale (0-10) |
| t_shift | FLOAT | 0.1 | Time-step shift for noise schedule (0-1) |
| speed | FLOAT | 1.0 | Speaking speed for all speakers |
| pause_between_speakers | FLOAT | 0.3 | Silence between speakers (seconds) |
| device | COMBO | auto | auto, cuda, cpu, mps |
| dtype | COMBO | auto | auto, bf16, fp16, fp32 |
| attention | COMBO | auto | auto, eager, sage_attention |
| position_temperature | FLOAT | 5.0 | Mask-position temperature (0=greedy) |
| class_temperature | FLOAT | 0.0 | Token sampling temperature (0=greedy) |
| layer_penalty_factor | FLOAT | 5.0 | Penalty on deeper codebook layers |
| denoise | BOOLEAN | True | Prepend denoise token for cleaner output |
| preprocess_prompt | BOOLEAN | True | Preprocess reference audio |
| postprocess_output | BOOLEAN | True | Post-process generated audio |
| seed | INT | 0 | Random seed (0=random) |
| keep_model_loaded | BOOLEAN | True | Keep model in memory |
| speaker_N_audio | AUDIO | optional | Reference audio for speaker N (1-10) |
| speaker_N_ref_text | STRING | "" | Transcript for speaker N's ref audio |
| speaker_N_instruct | STRING | "" | Dialect/style instruction for speaker N. Only specific values are supported — see Dialect/Style Instructions |
Speaker inputs dynamically show/hide based on num_speakers (ComfyUI >= 0.8.1).
5. OmniVoice Whisper Loader — Pre-load Whisper ASR model for auto-transcription
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | COMBO | (auto) | Whisper model selection |
| device | COMBO | auto | auto, cuda, cpu |
| dtype | COMBO | auto | auto, bf16, fp16, fp32 |
Auto-download: Select models with "(auto-download)" suffix to download on first use.
These parameters control the diffusion-based audio generation process:
| Parameter | What it does | Tips |
|---|---|---|
steps |
Number of iterative unmasking steps | 16 = faster, 32 = balanced, 64 = best quality |
guidance_scale |
Classifier-free guidance strength | Higher = more text-aligned; 2.0 is default |
t_shift |
Time-step shift for noise schedule | Smaller values emphasise earlier decoding steps |
speed |
Speaking speed factor | >1.0 = faster, <1.0 = slower |
duration |
Fixed output length in seconds | Overrides speed when set; 0 = automatic |
position_temperature |
Randomness in mask-position selection | 0 = greedy (deterministic), higher = more random |
class_temperature |
Randomness in token sampling | 0 = greedy (deterministic), higher = more random |
layer_penalty_factor |
Penalty on deeper codebook layers | Encourages lower layers to unmask first |
denoise |
Prepend denoise token to input | Generally improves output quality |
preprocess_prompt |
Clean reference audio | Removes long silences, adds punctuation |
postprocess_output |
Clean generated audio | Removes long silences from output |
OmniVoice's architecture (Qwen3 backbone) has limited attention support through transformers. The attention dropdown offers these options:
| Option | What Actually Happens |
|---|---|
auto |
OmniVoice's default (eager) |
eager |
Standard eager attention (always works) |
sage_attention |
Monkey-patches Qwen3Attention with SageAttention CUDA kernels. GPU-only, requires SM80+ (Ampere+). Falls back to SDPA when attention masks are present. Install: pip install sageattention |
| GPU Architecture | Compute Capability | Kernel Used |
|---|---|---|
| Blackwell (RTX 5090) | SM120 | FP8 |
| Hopper (RTX 4090) | SM90 | FP8 |
| Ada Lovelace (RTX 4070) | SM89 | FP8 |
| Ampere (RTX 3090) | SM80 | FP16 |
| Below SM80 | — | Not supported |
Use [Speaker_N]: tags in text to assign lines to different speakers:
[Speaker_1]: Hello, I'm speaker one.
[Speaker_2]: And I'm speaker two!
[Speaker_1]: Nice to meet you!
Each speaker needs reference audio connected to the corresponding speaker_N_audio input.
Voice Clone, Longform, and Multi-Speaker nodes expose an instruct field that tells the model to use a specific dialect or speaking style. Only the values listed below are supported — the model validates against a fixed list and will reject unsupported values.
English values (comma-separated, e.g. male, indian accent):
| Category | Valid Values |
|---|---|
| Gender | male, female |
| Age | child, young adult, teenager, middle-aged, elderly |
| Accent | american accent, british accent, australian accent, canadian accent, chinese accent, indian accent, japanese accent, korean accent, portuguese accent, russian accent |
| Pitch | very low pitch, low pitch, moderate pitch, high pitch, very high pitch |
| Style | whisper |
Chinese values (full-width comma-separated, e.g. 男,河南话):
| Category | Valid Values |
|---|---|
| Gender | 男, 女 |
| Age | 儿童, 少年, 青年, 中年, 老年 |
| Dialect | 四川话, 东北话, 陕西话, 河南话, 云南话, 贵州话, 甘肃话, 宁夏话, 石家庄话, 济南话, 青岛话, 桂林话 |
| Pitch | 极低音调, 低音调, 中音调, 高音调, 极高音调 |
| Style | 耳语 |
Note: Use only English or only Chinese values in a single instruct string — don't mix them.
Leave the field empty for default behaviour (Standard Mandarin for Chinese text).
Note: This is distinct from the Voice Design node's
voice_instructfield, which controls gender, age, pitch, and accent for synthesising entirely new voices.
Comma-separated attributes for voice_instruct (same valid values as the instruct field above):
| Category | Options |
|---|---|
| Gender | male, female |
| Age | child, young adult, teenager, middle-aged, elderly |
| Accent | american accent, british accent, australian accent, canadian accent, chinese accent, indian accent, japanese accent, korean accent, portuguese accent, russian accent |
| Pitch | very low pitch, low pitch, moderate pitch, high pitch, very high pitch |
| Style | whisper |
| Chinese Dialect | 四川话, 东北话, 陕西话, 河南话, 云南话, 贵州话, 甘肃话, 宁夏话, 石家庄话, 济南话, 青岛话, 桂林话 |
Example: "female, young, high pitch, british accent, whisper"
Insert these directly in your text:
| Tag | Effect |
|---|---|
[laughter] |
Natural laughter |
[sigh] |
Expressive sigh |
[sniff] |
Sniffing sound |
[question-en], [question-ah], [question-oh] |
Question intonations |
[surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo] |
Surprise expressions |
[dissatisfaction-hnn] |
Dissatisfaction sound |
[confirmation-en] |
Confirmation grunt |
Example:
[laughter] You really got me! [sigh] I didn't see that coming at all.
ComfyUI/models/
omnivoice/
OmniVoice/ (~4GB, fp32)
OmniVoice-bf16/ (~2GB, bf16)
audio_encoders/
openai_whisper-large-v3-turbo/
openai_whisper-large-v3/
openai_whisper-medium/
| Model | Size | Description |
|---|---|---|
OmniVoice |
~4GB | Full fp32 model - 600+ languages |
OmniVoice-bf16 |
~2GB | Bfloat16 quantized - lower VRAM |
| Model | VRAM | Link |
|---|---|---|
| whisper-large-v3-turbo | ~1.5GB | Download |
| whisper-large-v3 | ~3GB | Download |
| whisper-medium | ~1GB | Download |
| whisper-small | ~0.5GB | Download |
| whisper-tiny | ~0.4GB | Download |
Models auto-download from HuggingFace on first use.
| Precision | VRAM (Approx) |
|---|---|
| fp32 | ~8-12 GB |
| bf16/fp16 | ~4-6 GB |
| With CPU offload | ~2-4 GB |
The node caches loaded models for reuse. Changing any of these parameters forces a full cache clear (model unload + GC + CUDA cache flush), even when keep_model_loaded is True:
- Model selection
- Device
- Precision (dtype)
- Attention backend
For detailed troubleshooting guides, see docs/TROUBLESHOOTING.md.
Quick fixes for common issues
Set the HuggingFace mirror before starting ComfyUI:
export HF_ENDPOINT="https://hf-mirror.com"Connect OmniVoice Whisper Loader to whisper_model input on Voice Clone TTS to cache the model.
- Set
keep_model_loaded = False - Use
dtype = fp16orbf16 - Use
device = cpu(slower but works)
Restart ComfyUI completely to reload Python modules.
OmniVoice requires transformers>=5.3.0. If you see an error like omnivoice import failed or cannot import name 'HiggsAudioV2TokenizerModel' in your ComfyUI logs, your transformers version may be too old.
⚠️ Only do this if you know what you are doing. Upgrading transformers may break other custom nodes that depend on an older version. Test your other nodes after upgrading.
To upgrade:
path\to\ComfyUI\venv\Scripts\python.exe -m pip install "transformers>=5.3.0"
Add your FFmpeg bin/ folder to PATH in your ComfyUI launch .bat file, or use a WAV audio save node instead.
- OmniVoice — k2-fsa/OmniVoice by k2-fsa — Original fp32 model
- OmniVoice-bf16 — drbaph/OmniVoice-bf16 by drbaph — Bfloat16 quantized model
- ComfyUI Node — saganaki22/ComfyUI-OmniVoice-TTS — This custom node
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}This custom node is released under the Apache 2.0 License. The OmniVoice model has its own license — see k2-fsa/OmniVoice for details.