Skip to content

cvlab-kaist/V-Warper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Project Page arXiv

Hyunkoo Lee* · Wooseok Jang* · Jini Yang* · Taehwan Kim · Sangoh Kim · Sangwon Jung · Seungryong Kim

KAIST AI 

* Equal contribution


V-Warper Teaser

This is the official implementation of the paper "V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping". V-Warper enables video personalization that generates videos with appearance-consistent subjects from just a few reference images. Without requiring large-scale video finetuning, our method achieves superior appearance fidelity while preserving motion dynamics and prompt alignment.

Environment Settings

git clone https://github.com/cvlab-kaist/V-Warper.git
cd V-Warper
conda create -n v_warper python=3.12
conda activate v_warper
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Dataset Preparation

To personalize your own subject, you need to prepare a subject image dataset with the following structure. See CogKit/data/sample/dog as an example:

CogKit/data/sample/
└── <subject_name>/                    # e.g., dog, fancy-boot
    ├── placeholder_token.txt          # Placeholder token for the subject (e.g., "<dog>")
    ├── initializer_token.txt          # Initial text for learnable subject embedding (e.g., "dog")
    ├── train/
    │   └── image/
    │       ├── 00.jpg                 # Few-shot reference images (at least 5 images recommended)
    │       ├── 01.jpg                 # Resolution: 720×480, with subject clearly visible
    │       ├── ...
    │       └── metadata.jsonl         # Records image file names: {"file_name": "00.jpg"}
    └── test/
        └── prompt.jsonl               # Validation prompts: {"prompt": "A video of <dog> running"}

File Descriptions:

  • placeholder_token.txt: The special token used to refer to your subject in prompts (e.g., <dog>). This is the learnable subject token.
  • initializer_token.txt: The initialization text for the learnable subject embedding (e.g., dog).
  • train/image/: Training images for personalization
    • Include at least 5 reference images showing the subject from different angles
    • Images should be preprocessed to 720×480 resolution
    • Ensure the subject is clearly visible in each image
    • metadata.jsonl should list all image file names, one per line
  • test/prompt.jsonl: Validation prompts for testing
    • Each line contains a prompt using the placeholder token (e.g., {"prompt": "A video of <dog> running"})
    • The placeholder token must match the one defined in placeholder_token.txt

Run V-Warper

Coarse Appearance Adaptation

  1. Configure training settings in CogKit/quickstart/scripts/t2v_stage1/config.yaml:

    • Set checkpoint configurations
    • Adjust validation settings
    • Configure total epochs (300-600 epochs recommended)
  2. Set execution parameters in CogKit/quickstart/scripts/t2v_stage1/train.sh:

    • SUBJECT: Subject name (e.g., dog, fancy-boot)
    • GPU_IDS: GPU device IDs (e.g., 0,1)
    • DATA_ROOT: Path to your dataset directory
  3. Run Optimization:

bash CogKit/quickstart/scripts/t2v_stage1/train.sh

Pre-trained Weights

To help you quickly get started, we provide ready-to-use sample cases based on our coarse appearance adaptation stage. You can find the pre-trained weights on Link

Place the downloaded weights in stage1_checkpoints/sample/{subject}/ directory.

Fine Appearance Injection

Configure parameters in script/inference.sh:

  • tgt_prompt: Text prompt for video generation
  • image_path: Path to one of the reference images used in optimization
  • subject_name: Token name from placeholder_token.txt (e.g., dog from <dog>)
  • lora_checkpoint_path: Path to checkpoint folder from Coarse Appearance Adaptation stage

Run inference:

bash script/inference.sh

Results

V-Warper Baseline Comparison

Acknowledgements

This code is based on the work of CogVideoX and CogKit. Many thanks to them for making their projects available.

Citation

If you find our work useful in your research, please consider citing:

@misc{lee2025vwarperappearanceconsistentvideodiffusion,
    title={V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping}, 
    author={Hyunkoo Lee and Wooseok Jang and Jini Yang and Taehwan Kim and Sangoh Kim and Sangwon Jung and Seungryong Kim},
    year={2025},
    eprint={2512.12375},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.12375}, 
}

About

Official implementation of "V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages