Skip to content

Latest commit

 

History

History
158 lines (107 loc) · 6.7 KB

File metadata and controls

158 lines (107 loc) · 6.7 KB

Model Quantization Techniques

📖 Overview

LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.


🔧 Quantization Modes

Quantization Mode Weight Quantization Activation Quantization Compute Kernel Supported Hardware
fp8-vllm FP8 channel symmetric FP8 channel dynamic symmetric VLLM H100/H200/H800, RTX 40 series, etc.
int8-vllm INT8 channel symmetric INT8 channel dynamic symmetric VLLM A100/A800, RTX 30/40 series, etc.
fp8-sgl FP8 channel symmetric FP8 channel dynamic symmetric SGL H100/H200/H800, RTX 40 series, etc.
int8-sgl INT8 channel symmetric INT8 channel dynamic symmetric SGL A100/A800, RTX 30/40 series, etc.
fp8-q8f FP8 channel symmetric FP8 channel dynamic symmetric Q8-Kernels RTX 40 series, L40S, etc.
int8-q8f INT8 channel symmetric INT8 channel dynamic symmetric Q8-Kernels RTX 40 series, L40S, etc.
int8-torchao INT8 channel symmetric INT8 channel dynamic symmetric TorchAO A100/A800, RTX 30/40 series, etc.
int4-g128-marlin INT4 group symmetric FP16 Marlin H200/H800/A100/A800, RTX 30/40 series, etc.
fp8-b128-deepgemm FP8 block symmetric FP8 group symmetric DeepGemm H100/H200/H800, RTX 40 series, etc.

🔧 Obtaining Quantized Models

Method 1: Download Pre-Quantized Models

Download pre-quantized models from LightX2V model repositories:

DIT Models

Download pre-quantized DIT models from Wan2.1-Distill-Models:

# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
    --local-dir ./models \
    --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"

Encoder Models

Download pre-quantized T5 and CLIP models from Encoders-LightX2V:

# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_t5_umt5-xxl-enc-fp8.pth"

# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"

Method 2: Self-Quantize Models

For detailed quantization tool usage, refer to: Model Conversion Documentation


🚀 Using Quantized Models

DIT Model Quantization

Supported Quantization Modes

DIT quantization modes (dit_quant_scheme) support: fp8-vllm, int8-vllm, fp8-sgl, int8-sgl, fp8-q8f, int8-q8f, int8-torchao, int4-g128-marlin, fp8-b128-deepgemm

Configuration Example

{
    "dit_quantized": true,
    "dit_quant_scheme": "fp8-sgl",
    "dit_quantized_ckpt": "/path/to/dit_quantized_model"  // Optional
}

💡 Tip: When there's only one DIT model in the script's model_path, dit_quantized_ckpt doesn't need to be specified separately.

T5 Model Quantization

Supported Quantization Modes

T5 quantization modes (t5_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example

{
    "t5_quantized": true,
    "t5_quant_scheme": "fp8-sgl",
    "t5_quantized_ckpt": "/path/to/t5_quantized_model"  // Optional
}

💡 Tip: When a T5 quantized model exists in the script's specified model_path (such as models_t5_umt5-xxl-enc-fp8.pth or models_t5_umt5-xxl-enc-int8.pth), t5_quantized_ckpt doesn't need to be specified separately.

CLIP Model Quantization

Supported Quantization Modes

CLIP quantization modes (clip_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example

{
    "clip_quantized": true,
    "clip_quant_scheme": "fp8-sgl",
    "clip_quantized_ckpt": "/path/to/clip_quantized_model"  // Optional
}

💡 Tip: When a CLIP quantized model exists in the script's specified model_path (such as models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth or models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth), clip_quantized_ckpt doesn't need to be specified separately.

Performance Optimization Strategy

If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to Parameter Offload Documentation:


📚 Related Resources

Configuration File Examples

Run Scripts

Tool Documentation

Model Repositories


Through this document, you should be able to:

✅ Understand quantization schemes supported by LightX2V ✅ Select appropriate quantization strategies based on hardware ✅ Correctly configure quantization parameters ✅ Obtain and use quantized models ✅ Optimize inference performance and memory usage

If you have other questions, feel free to ask in GitHub Issues.