LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.
| Quantization Mode | Weight Quantization | Activation Quantization | Compute Kernel | Supported Hardware |
|---|---|---|---|---|
fp8-vllm |
FP8 channel symmetric | FP8 channel dynamic symmetric | VLLM | H100/H200/H800, RTX 40 series, etc. |
int8-vllm |
INT8 channel symmetric | INT8 channel dynamic symmetric | VLLM | A100/A800, RTX 30/40 series, etc. |
fp8-sgl |
FP8 channel symmetric | FP8 channel dynamic symmetric | SGL | H100/H200/H800, RTX 40 series, etc. |
int8-sgl |
INT8 channel symmetric | INT8 channel dynamic symmetric | SGL | A100/A800, RTX 30/40 series, etc. |
fp8-q8f |
FP8 channel symmetric | FP8 channel dynamic symmetric | Q8-Kernels | RTX 40 series, L40S, etc. |
int8-q8f |
INT8 channel symmetric | INT8 channel dynamic symmetric | Q8-Kernels | RTX 40 series, L40S, etc. |
int8-torchao |
INT8 channel symmetric | INT8 channel dynamic symmetric | TorchAO | A100/A800, RTX 30/40 series, etc. |
int4-g128-marlin |
INT4 group symmetric | FP16 | Marlin | H200/H800/A100/A800, RTX 30/40 series, etc. |
fp8-b128-deepgemm |
FP8 block symmetric | FP8 group symmetric | DeepGemm | H100/H200/H800, RTX 40 series, etc. |
Download pre-quantized models from LightX2V model repositories:
DIT Models
Download pre-quantized DIT models from Wan2.1-Distill-Models:
# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
--local-dir ./models \
--include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"Encoder Models
Download pre-quantized T5 and CLIP models from Encoders-LightX2V:
# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_t5_umt5-xxl-enc-fp8.pth"
# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"For detailed quantization tool usage, refer to: Model Conversion Documentation
DIT quantization modes (dit_quant_scheme) support: fp8-vllm, int8-vllm, fp8-sgl, int8-sgl, fp8-q8f, int8-q8f, int8-torchao, int4-g128-marlin, fp8-b128-deepgemm
{
"dit_quantized": true,
"dit_quant_scheme": "fp8-sgl",
"dit_quantized_ckpt": "/path/to/dit_quantized_model" // Optional
}💡 Tip: When there's only one DIT model in the script's
model_path,dit_quantized_ckptdoesn't need to be specified separately.
T5 quantization modes (t5_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao
{
"t5_quantized": true,
"t5_quant_scheme": "fp8-sgl",
"t5_quantized_ckpt": "/path/to/t5_quantized_model" // Optional
}💡 Tip: When a T5 quantized model exists in the script's specified
model_path(such asmodels_t5_umt5-xxl-enc-fp8.pthormodels_t5_umt5-xxl-enc-int8.pth),t5_quantized_ckptdoesn't need to be specified separately.
CLIP quantization modes (clip_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao
{
"clip_quantized": true,
"clip_quant_scheme": "fp8-sgl",
"clip_quantized_ckpt": "/path/to/clip_quantized_model" // Optional
}💡 Tip: When a CLIP quantized model exists in the script's specified
model_path(such asmodels_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pthormodels_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth),clip_quantized_ckptdoesn't need to be specified separately.
If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to Parameter Offload Documentation:
- Wan2.1 Configuration: Refer to offload config files
- Wan2.2 Configuration: Refer to wan22 config files with
4090suffix
Through this document, you should be able to:
✅ Understand quantization schemes supported by LightX2V ✅ Select appropriate quantization strategies based on hardware ✅ Correctly configure quantization parameters ✅ Obtain and use quantized models ✅ Optimize inference performance and memory usage
If you have other questions, feel free to ask in GitHub Issues.