V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Hyunkoo Lee^* · Wooseok Jang^* · Jini Yang^* · Taehwan Kim · Sangoh Kim · Sangwon Jung · Seungryong Kim

KAIST AI

* Equal contribution

This is the official implementation of the paper "V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping". V-Warper enables video personalization that generates videos with appearance-consistent subjects from just a few reference images. Without requiring large-scale video finetuning, our method achieves superior appearance fidelity while preserving motion dynamics and prompt alignment.

Environment Settings

git clone https://github.com/cvlab-kaist/V-Warper.git
cd V-Warper

conda create -n v_warper python=3.12
conda activate v_warper
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Dataset Preparation

To personalize your own subject, you need to prepare a subject image dataset with the following structure. See CogKit/data/sample/dog as an example:

CogKit/data/sample/
└── <subject_name>/                    # e.g., dog, fancy-boot
    ├── placeholder_token.txt          # Placeholder token for the subject (e.g., "<dog>")
    ├── initializer_token.txt          # Initial text for learnable subject embedding (e.g., "dog")
    ├── train/
    │   └── image/
    │       ├── 00.jpg                 # Few-shot reference images (at least 5 images recommended)
    │       ├── 01.jpg                 # Resolution: 720×480, with subject clearly visible
    │       ├── ...
    │       └── metadata.jsonl         # Records image file names: {"file_name": "00.jpg"}
    └── test/
        └── prompt.jsonl               # Validation prompts: {"prompt": "A video of <dog> running"}

File Descriptions:

placeholder_token.txt: The special token used to refer to your subject in prompts (e.g., <dog>). This is the learnable subject token.
initializer_token.txt: The initialization text for the learnable subject embedding (e.g., dog).
train/image/: Training images for personalization
- Include at least 5 reference images showing the subject from different angles
- Images should be preprocessed to 720×480 resolution
- Ensure the subject is clearly visible in each image
- metadata.jsonl should list all image file names, one per line
test/prompt.jsonl: Validation prompts for testing
- Each line contains a prompt using the placeholder token (e.g., {"prompt": "A video of <dog> running"})
- The placeholder token must match the one defined in placeholder_token.txt

Run V-Warper

Coarse Appearance Adaptation

Configure training settings in CogKit/quickstart/scripts/t2v_stage1/config.yaml:
- Set checkpoint configurations
- Adjust validation settings
- Configure total epochs (300-600 epochs recommended)
Set execution parameters in CogKit/quickstart/scripts/t2v_stage1/train.sh:
- SUBJECT: Subject name (e.g., dog, fancy-boot)
- GPU_IDS: GPU device IDs (e.g., 0,1)
- DATA_ROOT: Path to your dataset directory
Run Optimization:

bash CogKit/quickstart/scripts/t2v_stage1/train.sh

Pre-trained Weights

To help you quickly get started, we provide ready-to-use sample cases based on our coarse appearance adaptation stage. You can find the pre-trained weights on Link

Place the downloaded weights in stage1_checkpoints/sample/{subject}/ directory.

Fine Appearance Injection

Configure parameters in script/inference.sh:

tgt_prompt: Text prompt for video generation
image_path: Path to one of the reference images used in optimization
subject_name: Token name from placeholder_token.txt (e.g., dog from <dog>)
lora_checkpoint_path: Path to checkpoint folder from Coarse Appearance Adaptation stage

Run inference:

bash script/inference.sh

Results

Acknowledgements

This code is based on the work of CogVideoX and CogKit. Many thanks to them for making their projects available.

Citation

If you find our work useful in your research, please consider citing:

@misc{lee2025vwarperappearanceconsistentvideodiffusion,
    title={V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping}, 
    author={Hyunkoo Lee and Wooseok Jang and Jini Yang and Taehwan Kim and Sangoh Kim and Sangwon Jung and Seungryong Kim},
    year={2025},
    eprint={2512.12375},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.12375}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CogKit		CogKit
CogVideo		CogVideo
assets		assets
model		model
script		script
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Environment Settings

Dataset Preparation

Run V-Warper

Coarse Appearance Adaptation

Pre-trained Weights

Fine Appearance Injection

Results

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Environment Settings

Dataset Preparation

Run V-Warper

Coarse Appearance Adaptation

Pre-trained Weights

Fine Appearance Injection

Results

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages