Transcribe audio and add subtitles to videos using Whisper in ComfyUI. Support multiple languages, prompt guidance and multiple whisper models.
Last tested: 2 January 2026 (ComfyUI v0.7.0@f2fda02 | Torch 2.9.1 | Triton 3.5.1 | Python 3.10.12 | RTX4090 | CUDA 13.0 | Debian 12)
If you like my projects and wish to see updates and new features, please consider supporting me. It helps a lot!
Install via ComfyUI Manager
Load this workflow into ComfyUI
Models are auto-downloaded to /ComfyUI/models/stt/whisper
'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo'
Transcribe audio and get timestamps for each segment and word.
Add subtitles on the video frames. You can specify font family, font color and x/y positions.
Add subtitles like wordcloud on blank frames
Export alignments as SRT files in /ComfyUI/output/srt directory
- Export alignments as SRT
- Add
torchcodecto requirements
- Merge #22 by @francislabountyjr for model patcher, more whisper models support, comfyui model directory support
- Merge #18 by @qy8502 for Prompt Guidance support
- Support YRDZST Semibold Font
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)


