ComfyUI-OmniVoice-TTS

OmniVoice TTS nodes for ComfyUI — Zero-shot multilingual text-to-speech with voice cloning and voice design. Supports 600+ languages with state-of-the-art quality.

中文文档

Features

600+ Languages — Broadest language coverage among zero-shot TTS models
Voice Cloning — Clone any voice from 3-15 seconds of reference audio
Voice Design — Create synthetic voices from text descriptions (gender, age, pitch, accent)
Multi-Speaker Dialogue — Generate conversations between multiple speakers using [Speaker_N]: tags
Fast Inference — RTF as low as 0.025 (40x faster than real-time)
Non-Verbal Expressions — Inline tags like [laughter], [sigh], [sniff]
SageAttention Support — GPU-optimized attention via monkey-patching Qwen3Attention (CUDA, SM80+)
Auto-Download — Models download automatically from HuggingFace on first use
Whisper ASR Caching — Pre-load Whisper to avoid re-downloading on each run
VRAM Efficient — Automatic CPU offload, VBAR/aimdo integration, smart cache invalidation

comfyui-omnivoice-example.mp4

Installation

Method 1: ComfyUI Manager (Recommended)

Search for "OmniVoice" in ComfyUI Manager and click Install.

Method 2: Manual Install

cd ComfyUI/custom_nodes
git clone https://github.com/saganaki22/ComfyUI-OmniVoice-TTS.git
cd ComfyUI-OmniVoice-TTS
python install.py

Why `--no-deps`?

The omnivoice pip package specifies torch==2.8.* as a dependency, which can downgrade your PyTorch to a CPU-only version and break ComfyUI's GPU acceleration. We work around this by installing omnivoice with --no-deps in install.py, then separately installing only the missing dependencies that ComfyUI doesn't already provide.

If PyTorch Gets Broken

If another package accidentally downgrades your PyTorch, see the PyTorch Compatibility Matrix for restore commands matching your setup.

Nodes

1. OmniVoice Longform TTS — Long-form text-to-speech with smart chunking and optional voice cloning

Parameter	Type	Default	Description
model	COMBO	(auto)	OmniVoice model checkpoint
text	STRING, multiline	`"Hello!..."`	Text to synthesize
ref_text	STRING, multiline	""	Reference audio transcript (empty=auto-detect)
steps	INT	32	Diffusion steps (4-64, 16=faster, 64=best)
guidance_scale	FLOAT	2.0	Classifier-free guidance scale (0-10)
t_shift	FLOAT	0.1	Time-step shift for noise schedule (0-1)
speed	FLOAT	1.0	Speaking speed (0.5-2.0, >1=faster)
duration	FLOAT	0.0	Fixed duration in seconds (0=auto)
device	COMBO	auto	`auto`, `cuda`, `cpu`, `mps`
dtype	COMBO	auto	`auto`, `bf16`, `fp16`, `fp32`
attention	COMBO	auto	`auto`, `eager`, `sage_attention`
seed	INT	0	Random seed (0=random)
words_per_chunk	INT	100	Words per chunk (0=no chunking)
position_temperature	FLOAT	5.0	Mask-position temperature (0=greedy, higher=more random)
class_temperature	FLOAT	0.0	Token sampling temperature (0=greedy)
layer_penalty_factor	FLOAT	5.0	Penalty on deeper codebook layers
denoise	BOOLEAN	True	Prepend denoise token for cleaner output
preprocess_prompt	BOOLEAN	True	Preprocess reference audio (remove silences)
postprocess_output	BOOLEAN	True	Post-process generated audio (remove long silences)
keep_model_loaded	BOOLEAN	True	Keep model in memory (offloads to CPU between runs)
instruct	STRING	""	Dialect/style instruction. Only specific values are supported — see Dialect/Style Instructions. Applied to every chunk

Optional Inputs:

ref_audio — Reference audio for voice cloning (3-15s optimal)
whisper_model — Pre-loaded Whisper ASR model

2. OmniVoice Voice Clone TTS — Clone a voice from reference audio

Parameter	Type	Default	Description
model	COMBO	(auto)	OmniVoice model checkpoint
text	STRING, multiline	`"Hello!..."`	Text to synthesize in cloned voice
ref_audio	AUDIO	required	Reference audio (3-15s)
ref_text	STRING, multiline	""	Transcript (empty=auto-transcribe with Whisper)
steps	INT	32	Diffusion steps (4-64)
guidance_scale	FLOAT	2.0	Classifier-free guidance scale (0-10)
t_shift	FLOAT	0.1	Time-step shift for noise schedule (0-1)
speed	FLOAT	1.0	Speaking speed (0.5-2.0)
duration	FLOAT	0.0	Fixed duration in seconds (0=auto)
device	COMBO	auto	`auto`, `cuda`, `cpu`, `mps`
dtype	COMBO	auto	`auto`, `bf16`, `fp16`, `fp32`
attention	COMBO	auto	`auto`, `eager`, `sage_attention`
seed	INT	0	Random seed (0=random)
position_temperature	FLOAT	5.0	Mask-position temperature (0=greedy)
class_temperature	FLOAT	0.0	Token sampling temperature (0=greedy)
layer_penalty_factor	FLOAT	5.0	Penalty on deeper codebook layers
denoise	BOOLEAN	True	Prepend denoise token for cleaner output
preprocess_prompt	BOOLEAN	True	Preprocess reference audio (remove silences)
postprocess_output	BOOLEAN	True	Post-process generated audio (remove long silences)
keep_model_loaded	BOOLEAN	True	Keep model in memory
instruct	STRING	""	Dialect/style instruction. Only specific values are supported — see Dialect/Style Instructions

Optional Input:

whisper_model — Pre-loaded Whisper from OmniVoice Whisper Loader

3. OmniVoice Voice Design TTS — Design voices from text descriptions, no reference audio needed

Parameter	Type	Default	Description
model	COMBO	(auto)	OmniVoice model checkpoint
text	STRING, multiline	`"Hello!..."`	Text to synthesize in designed voice
voice_instruct	STRING, multiline	`"female, low pitch..."`	Voice attributes (comma-separated)
steps	INT	32	Diffusion steps (4-64)
guidance_scale	FLOAT	2.0	Classifier-free guidance scale (0-10)
t_shift	FLOAT	0.1	Time-step shift for noise schedule (0-1)
speed	FLOAT	1.0	Speaking speed (0.5-2.0)
duration	FLOAT	0.0	Fixed duration in seconds (0=auto)
device	COMBO	auto	`auto`, `cuda`, `cpu`, `mps`
dtype	COMBO	auto	`auto`, `bf16`, `fp16`, `fp32`
attention	COMBO	auto	`auto`, `eager`, `sage_attention`
seed	INT	0	Random seed (0=random)
position_temperature	FLOAT	5.0	Mask-position temperature (0=greedy)
class_temperature	FLOAT	0.0	Token sampling temperature (0=greedy)
layer_penalty_factor	FLOAT	5.0	Penalty on deeper codebook layers
denoise	BOOLEAN	True	Prepend denoise token for cleaner output
postprocess_output	BOOLEAN	True	Post-process generated audio (remove long silences)
keep_model_loaded	BOOLEAN	True	Keep model in memory

4. OmniVoice Multi-Speaker TTS — Generate dialogue between multiple speakers using [Speaker_N]: tags

Parameter	Type	Default	Description
model	COMBO	(auto)	OmniVoice model checkpoint
text	STRING, multiline	`"[Speaker_1]: Hello..."`	Multi-speaker text
num_speakers	DYNAMIC	2	Number of speakers (2-10, dynamic inputs)
steps	INT	32	Diffusion steps per speaker
guidance_scale	FLOAT	2.0	Classifier-free guidance scale (0-10)
t_shift	FLOAT	0.1	Time-step shift for noise schedule (0-1)
speed	FLOAT	1.0	Speaking speed for all speakers
pause_between_speakers	FLOAT	0.3	Silence between speakers (seconds)
device	COMBO	auto	`auto`, `cuda`, `cpu`, `mps`
dtype	COMBO	auto	`auto`, `bf16`, `fp16`, `fp32`
attention	COMBO	auto	`auto`, `eager`, `sage_attention`
position_temperature	FLOAT	5.0	Mask-position temperature (0=greedy)
class_temperature	FLOAT	0.0	Token sampling temperature (0=greedy)
layer_penalty_factor	FLOAT	5.0	Penalty on deeper codebook layers
denoise	BOOLEAN	True	Prepend denoise token for cleaner output
preprocess_prompt	BOOLEAN	True	Preprocess reference audio
postprocess_output	BOOLEAN	True	Post-process generated audio
seed	INT	0	Random seed (0=random)
keep_model_loaded	BOOLEAN	True	Keep model in memory
speaker_N_audio	AUDIO	optional	Reference audio for speaker N (1-10)
speaker_N_ref_text	STRING	""	Transcript for speaker N's ref audio
speaker_N_instruct	STRING	""	Dialect/style instruction for speaker N. Only specific values are supported — see Dialect/Style Instructions

Speaker inputs dynamically show/hide based on num_speakers (ComfyUI >= 0.8.1).

5. OmniVoice Whisper Loader — Pre-load Whisper ASR model for auto-transcription

Parameter	Type	Default	Description
model	COMBO	(auto)	Whisper model selection
device	COMBO	auto	`auto`, `cuda`, `cpu`
dtype	COMBO	auto	`auto`, `bf16`, `fp16`, `fp32`

Auto-download: Select models with "(auto-download)" suffix to download on first use.

Generation Parameters Guide

These parameters control the diffusion-based audio generation process:

Parameter	What it does	Tips
`steps`	Number of iterative unmasking steps	16 = faster, 32 = balanced, 64 = best quality
`guidance_scale`	Classifier-free guidance strength	Higher = more text-aligned; 2.0 is default
`t_shift`	Time-step shift for noise schedule	Smaller values emphasise earlier decoding steps
`speed`	Speaking speed factor	>1.0 = faster, <1.0 = slower
`duration`	Fixed output length in seconds	Overrides speed when set; 0 = automatic
`position_temperature`	Randomness in mask-position selection	0 = greedy (deterministic), higher = more random
`class_temperature`	Randomness in token sampling	0 = greedy (deterministic), higher = more random
`layer_penalty_factor`	Penalty on deeper codebook layers	Encourages lower layers to unmask first
`denoise`	Prepend denoise token to input	Generally improves output quality
`preprocess_prompt`	Clean reference audio	Removes long silences, adds punctuation
`postprocess_output`	Clean generated audio	Removes long silences from output

Attention Backends

OmniVoice's architecture (Qwen3 backbone) has limited attention support through transformers. The attention dropdown offers these options:

Option	What Actually Happens
`auto`	OmniVoice's default (eager)
`eager`	Standard eager attention (always works)
`sage_attention`	Monkey-patches Qwen3Attention with SageAttention CUDA kernels. GPU-only, requires SM80+ (Ampere+). Falls back to SDPA when attention masks are present. Install: `pip install sageattention`

SageAttention GPU Compatibility

GPU Architecture	Compute Capability	Kernel Used
Blackwell (RTX 5090)	SM120	FP8
Hopper (RTX 4090)	SM90	FP8
Ada Lovelace (RTX 4070)	SM89	FP8
Ampere (RTX 3090)	SM80	FP16
Below SM80	—	Not supported

Multi-Speaker Usage

Use [Speaker_N]: tags in text to assign lines to different speakers:

[Speaker_1]: Hello, I'm speaker one.
[Speaker_2]: And I'm speaker two!
[Speaker_1]: Nice to meet you!

Each speaker needs reference audio connected to the corresponding speaker_N_audio input.

Dialect/Style Instructions

Voice Clone, Longform, and Multi-Speaker nodes expose an instruct field that tells the model to use a specific dialect or speaking style. Only the values listed below are supported — the model validates against a fixed list and will reject unsupported values.

English values (comma-separated, e.g. male, indian accent):

Category	Valid Values
Gender	`male`, `female`
Age	`child`, `young adult`, `teenager`, `middle-aged`, `elderly`
Accent	`american accent`, `british accent`, `australian accent`, `canadian accent`, `chinese accent`, `indian accent`, `japanese accent`, `korean accent`, `portuguese accent`, `russian accent`
Pitch	`very low pitch`, `low pitch`, `moderate pitch`, `high pitch`, `very high pitch`
Style	`whisper`

Chinese values (full-width comma-separated, e.g. 男，河南话):

Category	Valid Values
Gender	`男`, `女`
Age	`儿童`, `少年`, `青年`, `中年`, `老年`
Dialect	`四川话`, `东北话`, `陕西话`, `河南话`, `云南话`, `贵州话`, `甘肃话`, `宁夏话`, `石家庄话`, `济南话`, `青岛话`, `桂林话`
Pitch	`极低音调`, `低音调`, `中音调`, `高音调`, `极高音调`
Style	`耳语`

Note: Use only English or only Chinese values in a single instruct string — don't mix them.

Leave the field empty for default behaviour (Standard Mandarin for Chinese text).

Note: This is distinct from the Voice Design node's voice_instruct field, which controls gender, age, pitch, and accent for synthesising entirely new voices.

Voice Design Attributes

Comma-separated attributes for voice_instruct (same valid values as the instruct field above):

Category	Options
Gender	`male`, `female`
Age	`child`, `young adult`, `teenager`, `middle-aged`, `elderly`
Accent	`american accent`, `british accent`, `australian accent`, `canadian accent`, `chinese accent`, `indian accent`, `japanese accent`, `korean accent`, `portuguese accent`, `russian accent`
Pitch	`very low pitch`, `low pitch`, `moderate pitch`, `high pitch`, `very high pitch`
Style	`whisper`
Chinese Dialect	`四川话`, `东北话`, `陕西话`, `河南话`, `云南话`, `贵州话`, `甘肃话`, `宁夏话`, `石家庄话`, `济南话`, `青岛话`, `桂林话`

Example: "female, young, high pitch, british accent, whisper"

Non-Verbal Tags

Insert these directly in your text:

Tag	Effect
`[laughter]`	Natural laughter
`[sigh]`	Expressive sigh
`[sniff]`	Sniffing sound
`[question-en]`, `[question-ah]`, `[question-oh]`	Question intonations
`[surprise-ah]`, `[surprise-oh]`, `[surprise-wa]`, `[surprise-yo]`	Surprise expressions
`[dissatisfaction-hnn]`	Dissatisfaction sound
`[confirmation-en]`	Confirmation grunt

Example:

[laughter] You really got me! [sigh] I didn't see that coming at all.

Model Storage

ComfyUI/models/
  omnivoice/
    OmniVoice/          (~4GB, fp32)
    OmniVoice-bf16/     (~2GB, bf16)
  audio_encoders/
    openai_whisper-large-v3-turbo/
    openai_whisper-large-v3/
    openai_whisper-medium/

Available OmniVoice Models

Model	Size	Description
`OmniVoice`	~4GB	Full fp32 model - 600+ languages
`OmniVoice-bf16`	~2GB	Bfloat16 quantized - lower VRAM

Whisper Models

Model	VRAM	Link
whisper-large-v3-turbo	~1.5GB	Download
whisper-large-v3	~3GB	Download
whisper-medium	~1GB	Download
whisper-small	~0.5GB	Download
whisper-tiny	~0.4GB	Download

Models auto-download from HuggingFace on first use.

VRAM Requirements

Precision	VRAM (Approx)
fp32	~8-12 GB
bf16/fp16	~4-6 GB
With CPU offload	~2-4 GB

Model Caching

The node caches loaded models for reuse. Changing any of these parameters forces a full cache clear (model unload + GC + CUDA cache flush), even when keep_model_loaded is True:

Model selection
Device
Precision (dtype)
Attention backend

Troubleshooting

For detailed troubleshooting guides, see docs/TROUBLESHOOTING.md.

Quick fixes for common issues

Model download fails (China)

Set the HuggingFace mirror before starting ComfyUI:

export HF_ENDPOINT="https://hf-mirror.com"

Whisper re-downloads every run

Connect OmniVoice Whisper Loader to whisper_model input on Voice Clone TTS to cache the model.

CUDA out of memory

Set keep_model_loaded = False
Use dtype = fp16 or bf16
Use device = cpu (slower but works)

Import errors after install

Restart ComfyUI completely to reload Python modules.

Transformers version

OmniVoice requires transformers>=5.3.0. If you see an error like omnivoice import failed or cannot import name 'HiggsAudioV2TokenizerModel' in your ComfyUI logs, your transformers version may be too old.

⚠️ Only do this if you know what you are doing. Upgrading transformers may break other custom nodes that depend on an older version. Test your other nodes after upgrading.

To upgrade:

path\to\ComfyUI\venv\Scripts\python.exe -m pip install "transformers>=5.3.0"

FFmpeg error on Windows when saving audio

Add your FFmpeg bin/ folder to PATH in your ComfyUI launch .bat file, or use a WAV audio save node instead.

Credits

OmniVoice — k2-fsa/OmniVoice by k2-fsa — Original fp32 model
OmniVoice-bf16 — drbaph/OmniVoice-bf16 by drbaph — Bfloat16 quantized model
ComfyUI Node — saganaki22/ComfyUI-OmniVoice-TTS — This custom node

Citation

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}

License

This custom node is released under the Apache 2.0 License. The OmniVoice model has its own license — see k2-fsa/OmniVoice for details.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
docs		docs
example_workflow		example_workflow
nodes		nodes
voice_samples		voice_samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
__init__.py		__init__.py
install.py		install.py
pyproject.toml		pyproject.toml
pytorch_compatibility_matrix.md		pytorch_compatibility_matrix.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-OmniVoice-TTS

Features

Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Install

Why --no-deps?

If PyTorch Gets Broken

Nodes

Generation Parameters Guide

Attention Backends

SageAttention GPU Compatibility

Multi-Speaker Usage

Dialect/Style Instructions

Voice Design Attributes

Non-Verbal Tags

Model Storage

Available OmniVoice Models

Whisper Models

VRAM Requirements

Model Caching

Troubleshooting

Model download fails (China)

Whisper re-downloads every run

CUDA out of memory

Import errors after install

Transformers version

FFmpeg error on Windows when saving audio

Credits

Citation

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 28

Contributors

Uh oh!

Languages

Why `--no-deps`?