Skip to content
This repository was archived by the owner on May 11, 2026. It is now read-only.

justdubit/just-dub-it

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚠️ Notice: This repo is now archived, please move to our offical repository and checkout the latest LTX2.3 support on lipdub.

JustDubit: Video Dubbing via Joint Audio-Visual Diffusion

Paper Website Model Dataset

SIGGRAPH 2026

Anthony Chen*†, Naomi Ken Korem*, Gal Zeevi, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or

Tel Aviv University · Lightricks

* Equal contribution · † Work done during visit at Tel Aviv University and internship at Lightricks

📰 News

  • [2026/05/11] 🔥 JustDubIt trained on LTX2.3 is released! check out our offical repo.
  • [2026/05/11] 🔥 JustDubIt is accepted to SIGGRAPH 2026! See you in LA!
  • [2026/02/10] 🔥 Code, checkpoints, and data released
  • [2026/01/29] 🔥 Tech report released

📄 Abstract

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks.

Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings.

In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion.

To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half.

By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.


🚀 Quick Links

Resource Description
Inference Pipeline Run video dubbing with the JustDubit pipeline
Training Guide Train your own JustDubit LoRA

📦 Repository Structure

just-dub-it/
├── packages/
│   ├── ltx-pipelines/     # Inference pipeline for video dubbing
│   │   └── README.md      # Pipeline usage guide
│   ├── ltx-trainer/       # Training tools for JustDubit LoRA
│   │   └── README.md      # Training guide
│   └── ltx-core/          # Core model components
└── README.md              # This file

🎬 Inference

See the Pipeline README for:

  • Installation instructions
  • Model checkpoint downloads
  • Prompt format guide
  • CLI arguments reference

🏋️ Training

See the Trainer README for:

  • Dataset download and preparation
  • Preprocessing pipeline
  • Training configuration
  • Multi-GPU training setup

📝 Citation

If you find this work useful, please cite our paper:

@misc{chen2026justdubitvideodubbingjoint,
      title={JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion}, 
      author={Anthony Chen and Naomi Ken Korem and Gal Zeevi and Tavi Halperin and Matan Ben Yosef and Urska Jelercic and Ofir Bibi and Or Patashnik and Daniel Cohen-Or},
      year={2026},
      eprint={2601.22143},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2601.22143}, 
}

About

Code for 'JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages