Skip to content

Latest commit

Β 

History

History
149 lines (108 loc) Β· 9.78 KB

File metadata and controls

149 lines (108 loc) Β· 9.78 KB

πŸš€ Benchmark

This document showcases the performance test results of LightX2V across different hardware environments, including detailed comparison data for H200 and RTX 4090 platforms.


πŸ–₯️ H200 Environment (~140GB VRAM)

πŸ“‹ Software Environment Configuration

Component Version
Python 3.11
PyTorch 2.7.1+cu128
SageAttention 2.2.0
vLLM 0.9.2
sgl-kernel 0.1.8

🎬 480P 5s Video Test

Test Configuration:

πŸ“Š Performance Comparison Table

Configuration Inference Time(s) GPU Memory(GB) Speedup Video Effect
Wan2.1 Official 366 71 1.0x
baseline.mp4
FastVideo 292 26 1.25x
fastvideo.mp4
LightX2V_1 250 53 1.46x
output_lightx2v_wan_i2v_1.mp4
LightX2V_2 216 50 1.70x
output_lightx2v_wan_i2v_2.mp4
LightX2V_3 191 35 1.92x
output_lightx2v_wan_i2v_3.mp4
LightX2V_3-Distill 14 35 πŸ† 20.85x
distill.mp4
LightX2V_4 107 35 3.41x
output_lightx2v_wan_i2v_4.mp4

🎬 720P 5s Video Test

Test Configuration:

πŸ“Š Performance Comparison Table

Configuration Inference Time(s) GPU Memory(GB) Speedup Video Effect
Wan2.1 Official 974 81 1.0x
baseline.mp4
FastVideo 914 40 1.07x
fastvideo.mp4
LightX2V_1 807 65 1.21x
output_lightx2v_wan_i2v_720_1.mp4
LightX2V_2 751 57 1.30x
output_lightx2v_wan_i2v_720_2.mp4
LightX2V_3 671 43 1.45x
output_lightx2v_wan_i2v_720_3.mp4
LightX2V_3-Distill 44 43 πŸ† 22.14x
output_lightx2v_wan_i2v_720_3_distill.mp4
LightX2V_4 344 46 2.83x
output_lightx2v_wan_i2v_720_4.mp4

πŸ–₯️ RTX 4090 Environment (~24GB VRAM)

πŸ“‹ Software Environment Configuration

Component Version
Python 3.9.16
PyTorch 2.5.1+cu124
SageAttention 2.1.0
vLLM 0.6.6
sgl-kernel 0.0.5
q8-kernels 0.0.0

🎬 480P 5s Video Test

Test Configuration:

πŸ“Š Performance Comparison Table

Configuration Inference Time(s) GPU Memory(GB) Speedup Video Effect
Wan2GP(profile=3) 779 20 1.0x
wan2gp_480p.mp4
LightX2V_5 738 16 1.05x
lightx2v_5_480p.mp4
LightX2V_5-Distill 68 16 11.45x
lightx2v_5_distill_480p.mp4
LightX2V_6 630 12 1.24x
lightx2v_6_480p.mp4
LightX2V_6-Distill 63 12 πŸ† 12.36x
lightx2v_6_distill_480p.mp4

🎬 720P 5s Video Test

Test Configuration:

πŸ“Š Performance Comparison Table

Configuration Inference Time(s) GPU Memory(GB) Speedup Video Effect
Wan2GP(profile=3) -- OOM --
LightX2V_5 2473 23 --
lightx2v_5_720p.mp4
LightX2V_5-Distill 183 23 --
lightx2v_5_distill_720p.mp4
LightX2V_6 2169 18 --
lightx2v_6_720p.mp4
LightX2V_6-Distill 171 18 --
lightx2v_6_distill_720p.mp4

πŸ“– Configuration Descriptions

πŸ–₯️ H200 Environment Configuration Descriptions

Configuration Technical Features
Wan2.1 Official Based on Wan2.1 official repository original implementation
FastVideo Based on FastVideo official repository, using SageAttention2 backend optimization
LightX2V_1 Uses SageAttention2 to replace native attention mechanism, adopts DIT BF16+FP32 (partial sensitive layers) mixed precision computation, improving computational efficiency while maintaining precision
LightX2V_2 Unified BF16 precision computation, further reducing memory usage and computational overhead while maintaining generation quality
LightX2V_3 Introduces FP8 quantization technology to significantly reduce computational precision requirements, combined with Tiling VAE technology to optimize memory usage
LightX2V_3-Distill Based on LightX2V_3 using 4-step distillation model(infer_steps=4, enable_cfg=False), further reducing inference steps while maintaining generation quality
LightX2V_4 Based on LightX2V_3 with TeaCache(teacache_thresh=0.2) caching reuse technology, achieving acceleration through intelligent redundant computation skipping

πŸ–₯️ RTX 4090 Environment Configuration Descriptions

Configuration Technical Features
Wan2GP(profile=3) Implementation based on Wan2GP repository, using MMGP optimization technology. Profile=3 configuration is suitable for RTX 3090/4090 environments with at least 32GB RAM and 24GB VRAM, adapting to limited memory resources by sacrificing VRAM. Uses quantized models: 480P model and 720P model
LightX2V_5 Uses SageAttention2 to replace native attention mechanism, adopts DIT FP8+FP32 (partial sensitive layers) mixed precision computation, enables CPU offload technology, executes partial sensitive layers with FP32 precision, asynchronously offloads DIT inference process data to CPU, saves VRAM, with block-level offload granularity
LightX2V_5-Distill Based on LightX2V_5 using 4-step distillation model(infer_steps=4, enable_cfg=False), further reducing inference steps while maintaining generation quality
LightX2V_6 Based on LightX2V_3 with CPU offload technology enabled, executes partial sensitive layers with FP32 precision, asynchronously offloads DIT inference process data to CPU, saves VRAM, with block-level offload granularity
LightX2V_6-Distill Based on LightX2V_6 using 4-step distillation model(infer_steps=4, enable_cfg=False), further reducing inference steps while maintaining generation quality

πŸ“ Configuration Files Reference

Benchmark-related configuration files and execution scripts are available at:

Type Link Description
Configuration Files configs/bench Contains JSON files with various optimization configurations
Execution Scripts scripts/bench Contains benchmark execution scripts

πŸ’‘ Tip: It is recommended to choose the appropriate optimization solution based on your hardware configuration to achieve the best performance.