Hyunkoo Lee* · Wooseok Jang* · Jini Yang* · Taehwan Kim · Sangoh Kim · Sangwon Jung · Seungryong Kim
KAIST AI
* Equal contribution
This is the official implementation of the paper "V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping". V-Warper enables video personalization that generates videos with appearance-consistent subjects from just a few reference images. Without requiring large-scale video finetuning, our method achieves superior appearance fidelity while preserving motion dynamics and prompt alignment.
git clone https://github.com/cvlab-kaist/V-Warper.git
cd V-Warper
conda create -n v_warper python=3.12
conda activate v_warper
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
To personalize your own subject, you need to prepare a subject image dataset with the following structure. See CogKit/data/sample/dog as an example:
CogKit/data/sample/
└── <subject_name>/ # e.g., dog, fancy-boot
├── placeholder_token.txt # Placeholder token for the subject (e.g., "<dog>")
├── initializer_token.txt # Initial text for learnable subject embedding (e.g., "dog")
├── train/
│ └── image/
│ ├── 00.jpg # Few-shot reference images (at least 5 images recommended)
│ ├── 01.jpg # Resolution: 720×480, with subject clearly visible
│ ├── ...
│ └── metadata.jsonl # Records image file names: {"file_name": "00.jpg"}
└── test/
└── prompt.jsonl # Validation prompts: {"prompt": "A video of <dog> running"}
File Descriptions:
- placeholder_token.txt: The special token used to refer to your subject in prompts (e.g.,
<dog>). This is the learnable subject token. - initializer_token.txt: The initialization text for the learnable subject embedding (e.g.,
dog). - train/image/: Training images for personalization
- Include at least 5 reference images showing the subject from different angles
- Images should be preprocessed to 720×480 resolution
- Ensure the subject is clearly visible in each image
metadata.jsonlshould list all image file names, one per line
- test/prompt.jsonl: Validation prompts for testing
- Each line contains a prompt using the placeholder token (e.g.,
{"prompt": "A video of <dog> running"}) - The placeholder token must match the one defined in
placeholder_token.txt
- Each line contains a prompt using the placeholder token (e.g.,
-
Configure training settings in
CogKit/quickstart/scripts/t2v_stage1/config.yaml:- Set checkpoint configurations
- Adjust validation settings
- Configure total epochs (300-600 epochs recommended)
-
Set execution parameters in
CogKit/quickstart/scripts/t2v_stage1/train.sh:SUBJECT: Subject name (e.g.,dog,fancy-boot)GPU_IDS: GPU device IDs (e.g.,0,1)DATA_ROOT: Path to your dataset directory
-
Run Optimization:
bash CogKit/quickstart/scripts/t2v_stage1/train.shTo help you quickly get started, we provide ready-to-use sample cases based on our coarse appearance adaptation stage. You can find the pre-trained weights on Link
Place the downloaded weights in stage1_checkpoints/sample/{subject}/ directory.
Configure parameters in script/inference.sh:
tgt_prompt: Text prompt for video generationimage_path: Path to one of the reference images used in optimizationsubject_name: Token name fromplaceholder_token.txt(e.g.,dogfrom<dog>)lora_checkpoint_path: Path to checkpoint folder from Coarse Appearance Adaptation stage
Run inference:
bash script/inference.shThis code is based on the work of CogVideoX and CogKit. Many thanks to them for making their projects available.
If you find our work useful in your research, please consider citing:
@misc{lee2025vwarperappearanceconsistentvideodiffusion,
title={V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping},
author={Hyunkoo Lee and Wooseok Jang and Jini Yang and Taehwan Kim and Sangoh Kim and Sangwon Jung and Seungryong Kim},
year={2025},
eprint={2512.12375},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.12375},
}
