[📄 Paper] [🌐 Project Page] [🤗 Model Weights]
WorldCanvas.mp4
Hanlin Wang1,2, Hao Ouyang2, Qiuyu Wang2, Yue Yu1,2, Yihao Meng1,2,
Wen Wang3,2, Ka Leong Cheng2, Shuailei Ma4,2, Qingyan Bai1,2, Yixuan Li5,2,
Cheng Chen6,2, Yanhong Zeng2, Xing Zhu2, Yujun Shen2, Qifeng Chen1
1HKUST, 2Ant Group, 3ZJU, 4NEU, 5CUHK, 6NTU
WorldCanvas is an I2V framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images.
Strongly recommend seeing our demo page.
If you enjoyed the videos we created, please consider giving us a star 🌟.
- Full inference code
WorldCanvas-14BWorldCanvas-14B-ref
git clone https://github.com/pPetrichor/WorldCanvas.git
cd WorldCanvasWe use a environment similar to diffsynth. If you have a diffsynth environment, you can probably reuse it. Our environment also requires SAM to be installed.
conda create -n WorldCanvas python=3.10
conda activate WorldCanvas
pip install -e .
pip install git+https://github.com/facebookresearch/segment-anything.git
pip install opencv-python pycocotools matplotlib onnxruntime onnxWe use FlashAttention-3 to implement the sparse inter-shot attention. We highly recommend using FlashAttention-3 for its fast speed. We provide a simple instruction on how to install FlashAttention-3.
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
cd hopper
python setup.py installIf you encounter environment problem when installing FlashAttention-3, you can refer to their official github page https://github.com/Dao-AILab/flash-attention.
If you cannot install FlashAttention-3, you can use FlashAttention-2 as an alternative, and our code will automatically detect the FlashAttention version. It will be slower than FlashAttention-3,but can also produce the right result.
If you want to install FlashAttention-2, you can use the following command:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whlIf you already have downloaded Wan 2.2 14B T2V before, skip this section.
If not, you need the T5 text encoder and the VAE from the original Wan 2.2 repository: https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B
Based on the repository's file structure, you only need to download models_t5_umt5-xxl-enc-bf16.pth and Wan2.1_VAE.pth.
You do not need to download the google, high_noise_model, or low_noise_model folders, nor any other files.
We recommend using huggingface-cli to download only the necessary files. Make sure you have huggingface_hub installed (pip install huggingface_hub).
This command will download only the required T5 and VAE models into the correct directory:
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B \
--local-dir checkpoints/Wan2.2-T2V-A14B \
--allow-patterns "models_t5_*.pth" "Wan2.1_VAE.pth"Alternatively, go to the "Files" tab on the Hugging Face repo and manually download the following two files:
models_t5_umt5-xxl-enc-bf16.pthWan2.1_VAE.pth
Place both files inside a new folder named checkpoints/Wan2.2-T2V-A14B/.
Download SAM vit_h model from here.
Download our fine-tuned high-noise and low-noise DiT checkpoints from the following link:
[➡️ Download WorldCanvas_dit Model Checkpoints Here]
This download contain the four fine-tuned model files. Two for no reference images version: WorldCanvas/high_model.safetensors, WorldCanvas/low_model.safetensors. And two for reference-based version: WorldCanvas_ref/high_model.safetensors, WorldCanvas_ref/low_model.safetensors.
Make sure your checkpoints directory look like this:
checkpoints/
├── sam_vit_h_4b8939.pth
├── Wan2.2-T2V-A14B/
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ └── Wan2.1_VAE.pth
└── WorldCanvas_dit/
├── WorldCanvas/
│ ├── high_model.safetensors
│ └── low_model.safetensors
└── WorldCanvas_ref/
├── high_model.safetensors
└── low_model.safetensors
If you don't have a reference image, you can proceed with inference as follows:
cd gradio
python draw_traj.py(b) Then use SAM to select the subject you want to manipulate. You can change the type in "Point Type" to determine the type of SAM point to add. After making your selection, click "Confirm Mask."
(c) Now, draw a trajectory for your selected subject. By clicking directly on the image, you can create a path that sequentially connects these points. The time interval between each pair of consecutive points will be treated as equal, meaning that the smaller the distance between points, the slower the movement speed; conversely, the larger the distance, the faster the movement speed. In the illustration, we clicked three points to form the trajectory.
After finishing the clicking, set your desired start and end times for the trajectory in the "st" and "et" fields under 'Stage Two: Trajectory Drawing' (this can simulate the object suddenly appearing or disappearing). Once confirmed, click "Generate Trajectory [st, et)" to create the drawn trajectory.
(d) Finally, in the "Stage Three" panel, fill in the "Object ID" and its corresponding text in "Text Description". Click "Confirm and Add to Results" to record all the conditions. (Note: A single subject performing the same action can have multiple trajectories, but they should all share the same Object ID and text.)
Tip 1: After performing step (c), you can also erase certain segments of the generated trajectory to indicate that those parts are invisible, simulating scenarios such as occlusion or rotation. Simply click "Erase Mode" after clicking "Generate Trajectory [st, et)", then select any two points along the trajectory—the segment between these two points will be marked as invisible and displayed in red. Remember to turn off Erase Mode once you've finished erasing.
For example, to achieve a rotation effect for a puppy, you can perform the following steps:
Tip 2: To simulate the effect of a new object appearing out of nowhere, you can first select any arbitrary mask:
Then, when generating the trajectory, simply set the trajectory's start time to the moment you want the object to appear.
Also, associate the corresponding text with this trajectory.
Similarly, to simulate the effect of an object disappearing, simply set the trajectory's end time to the desired moment when the object should vanish, and then assign the corresponding text.
Tip 3: If you want to keep the camera stationary, you can randomly select several background objects. When drawing the trajectory, click only a single point for each mask and set the trajectory time range to 0–81 to create static trajectories. For these trajectories, simply enter "None" as the corresponding text; the code will automatically ignore this text and consider only the trajectory for camera control purposes.
If you want to control camera motion, you can randomly select several background objects and draw movement trajectories for them. Similarly, just enter "None" as the corresponding text for these trajectories.
(f) Finally, set the save path and save the generated conditions to a JSON file by clicking "Save JSON".
Set the 'sample' variable on line 174 of WorldCanvas_inference.py to ['your initial image path', 'your JSON file path'], and set the save paths on lines 247 and 248, then run:
python WorldCanvas_inference.py --seed 0You can change the seed to obtain different results.
If you have reference images, you first need to generate an image with the reference image added with gradio:
cd gradio
python ref_image_generation.pyYou will see an interface like this:
First, upload the background image.
Then, if you need to expand the canvas, click "Canvas Expansion" below and set the number of pixels to extend in the top, bottom, left, and right directions. After confirming your settings, click "Confirm & Expand".
Next, click "2. Add Subject" in the top-right corner to add the reference image. You need to upload the reference image and use the SAM model to select the content you want. Click "Confirm Crop" to confirm. (Note: When you first enter "2. Add Subject," the background image will temporarily disappear—this is normal and can be ignored. Simply continue with the process, and the background will reappear automatically in "Step 2.2.")
Next, adjust the parameters in "Step 2.2" to control the size and position of the reference image. After confirming the reference image mask in 'Step 2.1', the reference image will not immediately appear on the background. It will only be displayed in real time on the background when you adjust the parameters in this step. Note that during adjustment, the preview will appear very blurry, but once you've finalized the size and position and click "Confirm Paste," the result will become clear.
You can repeat the above steps to continuously adjust the canvas size and insert any number of reference images. (Click the ❌ in the top-right corner of the reference image block in 'Step 2.1' to delete the current image and add a new one.)
Finally, click "Generate JPG Link" to download the resulting image. (Note: Before saving the result, please check the "Dimensions" hint below the image and try to keep the aspect ratio as close as possible to 832(width):480(height). This is because we will resize the initial image when drawing trajectories, and a significant deviation from this ratio may cause distortion of the main subject.)
Once you have the resulting composite image, you can follow steps in Inference without reference image and use this composite image as the initial image to draw the conditions for video generation.
After generating conditions, set the 'sample' variable on line 174 of WorldCanvas_inference_refimage.py to ['your initial image path', 'your JSON file path'], and set the save paths on lines 247 and 248, then run:
python WorldCanvas_inference_refimage.py --seed 0You can change the seed to obtain different results.
We provide several examples directly within the example folder and use these examples in WorldCanvas_inference.py and WorldCanvas_inference_refimage.py script. You can try them out immediately.
If you find this work useful, please consider citing our paper:
@article{wang2025worldcanvas,
title={The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text},
author={Hanlin Wang and Hao Ouyang and Qiuyu Wang and Yue Yu and Yihao Meng and Wen Wang and Ka Leong Cheng and Shuailei Ma and Qingyan Bai and Yixuan Li and Cheng Chen and Yanhong Zeng and Xing Zhu and Yujun Shen and Qifeng Chen},
journal={arXiv preprint arXiv:2512.16924},
year={2025}
}This project is licensed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License).
The code is provided for academic research purposes only.
For any questions, please contact [email protected].














