Official PyTorch implementation of "Implicit Diffusion Distillation with Domain Disentanglement for Domain Generalization" (ICME 2026).
Authors: Yujun Tong, Dongliang Chang*, Junhan Chen, Yuanchen Fang, and Zhanyu Ma Affiliation: Beijing University of Posts and Telecommunications (BUPT)
DDG (Diffusion-Distilled Generalization) is a novel framework that leverages frozen diffusion models to address the challenge of domain generalization. Instead of relying on computationally expensive data augmentation, we propose an efficient end-to-end distillation paradigm where the visual backbone is optimized as a conditioning agent to guide a frozen diffusion teacher in reconstructing source images.
- Generative Inverse Problem Paradigm: Distills structure-aware knowledge from frozen Stable Diffusion models into discriminative backbones
- Domain-Debiasing Guidance: Employs Classifier-Free Guidance (CFG) to explicitly suppress domain-specific style biases
- State-of-the-Art Performance: Achieves 79.57% average accuracy on DomainBed benchmark with CLIP initialization
- Two-Stage Training: Stage I aligns features with CLIP text space; Stage II performs generative distillation
Traditional methods rely on scarce source data, leading to biased alignment where learned representations fail to cover unseen target distributions. Our DDG framework leverages the robust generative prior of a frozen diffusion model to effectively expand the feature support, bridging the gap to unseen domains.
The framework operates in two stages:
- Stage I - Semantic Alignment: Aligns visual features with frozen CLIP text embeddings
- Stage II - Generative Distillation: Optimizes the backbone to maximize image likelihood under the diffusion prior while suppressing domain-specific styles
- Python 3.8+
- PyTorch 2.9.1+
- CUDA 12.8+ (for GPU support)
# Install dependencies
pip install -r requirements.txt
# Install CLIP from source
pip install git+https://github.com/openai/CLIP.gitThe code requires the following pre-trained models:
- CLIP Model: openai/clip-vit-large-patch14
- Stable Diffusion v1.4: CompVis/stable-diffusion-v1-4
You need to manually download these models and update the paths in the code:
Step 1: Download models from Hugging Face and place them in your preferred directory.
Step 2: Update the model paths in domainbed/algorithms/distill_diffusion.py:
# Line 131: CLIP model path
clip_model_id = hparams.get("clip_model_path", "/path/to/clip-vit-large-patch14")Step 3: Update the Stable Diffusion model path in domainbed/networks/build.py:
# Line 152: Stable Diffusion v1.4 path
model_id = "/path/to/stable-diffusion-v1-4"Note: The config_tta.yaml file is already included in the repository root directory and does not need to be modified.
We evaluate on five benchmarks from DomainBed:
- PACS: 4 domains (Photo, Art, Cartoon, Sketch)
- VLCS: 4 domains (VOC2007, LabelMe, Caltech101, Sun09)
- TerraIncognita: 4 domains (L100, L38, L43, L46)
- OfficeHome: 4 domains (Art, Clipart, Product, Real World)
- DomainNet: 6 domains (Clipart, Infograph, Painting, Quickdraw, Real, Sketch)
Please follow the DomainBed data preparation guide to download and organize the datasets. Update the --data_dir argument in training scripts to point to your dataset directory.
Train on PACS dataset with ViT-B/16 student backbone:
bash scripts/clipvitb_student/run.sh 0 1Arguments:
0: GPU ID1: Dataset index (0=OfficeHome, 1=PACS, 2=VLCS, 3=TerraIncognita, 4=DomainNet)
python train_all.py vitb-CLIP \
--clip_backbone "ViT-L/14" \
--backbone "clip_vit-b16" \
--algorithm DistillDiffusion \
--dataset PACS \
--data_dir /path/to/datasets \
--stage1_steps 5000 \
--swadstart_steps 5000 \
--steps 8000 \
--lmd 0.5 \
--seed 0 \
--swad True--stage1_steps: Number of steps for Stage I (semantic alignment), default: 5000--swadstart_steps: Step to start SWAD model averaging, default: 5000--steps: Total training steps, default: 8000--lmd: Weight for alignment loss (λ in paper), default: 0.5--clip_backbone: CLIP text encoder architecture for diffusion model conditioning (must match the text encoder used in Stable Diffusion), options:ViT-B/16,ViT-L/14--backbone: Student visual backbone architecture, options:clip_vit-b16,vit-base,resnet50
| Method | OH | TerraInc | VLCS | PACS | DomainNet | Avg |
|---|---|---|---|---|---|---|
| VL2V-SD | 87.38 | 58.54 | 83.25 | 96.68 | 62.79 | 77.73 |
| DDG (Paper) | 87.90 | 65.37 | 84.43 | 96.83 | 63.31 | 79.57 |
| DDG (A800) | 87.74 | 65.44 | 84.04 | 96.97 | 63.63 | 79.56 |
Note: The results in the "DDG (A800)" row are obtained by re-running experiments on A800 GPUs and may differ slightly from the paper due to hardware variations. Our reproduced results consistently match or exceed the paper's reported performance.
Trained model checkpoints will be available soon.
Download links: Coming soon
If you find this work helpful, please consider citing:
@inproceedings{tong2026ddg,
title={Implicit Diffusion Distillation with Domain Disentanglement for Domain Generalization},
author={Tong, Yujun and Chang, Dongliang and Chen, Junhan and Fang, Yuanchen and Ma, Zhanyu},
booktitle={Proceedings of the IEEE International Conference on Multimedia \& Expo (ICME)},
year={2026}
}This codebase is built upon the following excellent projects:
We sincerely thank the authors for their open-source contributions.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open an issue on GitHub or contact:
- Yujun Tong: tongyujun@bupt.edu.cn
- Dongliang Chang: changdongliang@bupt.edu.cn

