Model Quantization Techniques

📖 Overview

LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.

🔧 Quantization Modes

Quantization Mode	Weight Quantization	Activation Quantization	Compute Kernel	Supported Hardware
`fp8-vllm`	FP8 channel symmetric	FP8 channel dynamic symmetric	VLLM	H100/H200/H800, RTX 40 series, etc.
`int8-vllm`	INT8 channel symmetric	INT8 channel dynamic symmetric	VLLM	A100/A800, RTX 30/40 series, etc.
`fp8-sgl`	FP8 channel symmetric	FP8 channel dynamic symmetric	SGL	H100/H200/H800, RTX 40 series, etc.
`int8-sgl`	INT8 channel symmetric	INT8 channel dynamic symmetric	SGL	A100/A800, RTX 30/40 series, etc.
`fp8-q8f`	FP8 channel symmetric	FP8 channel dynamic symmetric	Q8-Kernels	RTX 40 series, L40S, etc.
`int8-q8f`	INT8 channel symmetric	INT8 channel dynamic symmetric	Q8-Kernels	RTX 40 series, L40S, etc.
`int8-torchao`	INT8 channel symmetric	INT8 channel dynamic symmetric	TorchAO	A100/A800, RTX 30/40 series, etc.
`int4-g128-marlin`	INT4 group symmetric	FP16	Marlin	H200/H800/A100/A800, RTX 30/40 series, etc.
`fp8-b128-deepgemm`	FP8 block symmetric	FP8 group symmetric	DeepGemm	H100/H200/H800, RTX 40 series, etc.

🔧 Obtaining Quantized Models

Method 1: Download Pre-Quantized Models

Download pre-quantized models from LightX2V model repositories:

DIT Models

Download pre-quantized DIT models from Wan2.1-Distill-Models:

# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
    --local-dir ./models \
    --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"

Encoder Models

Download pre-quantized T5 and CLIP models from Encoders-LightX2V:

# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_t5_umt5-xxl-enc-fp8.pth"

# Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
    --local-dir ./models \
    --include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"

Method 2: Self-Quantize Models

For detailed quantization tool usage, refer to: Model Conversion Documentation

🚀 Using Quantized Models

DIT Model Quantization

Supported Quantization Modes

DIT quantization modes (dit_quant_scheme) support: fp8-vllm, int8-vllm, fp8-sgl, int8-sgl, fp8-q8f, int8-q8f, int8-torchao, int4-g128-marlin, fp8-b128-deepgemm

Configuration Example

{
    "dit_quantized": true,
    "dit_quant_scheme": "fp8-sgl",
    "dit_quantized_ckpt": "/path/to/dit_quantized_model"  // Optional
}

💡 Tip: When there's only one DIT model in the script's model_path, dit_quantized_ckpt doesn't need to be specified separately.

T5 Model Quantization

Supported Quantization Modes

T5 quantization modes (t5_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example

{
    "t5_quantized": true,
    "t5_quant_scheme": "fp8-sgl",
    "t5_quantized_ckpt": "/path/to/t5_quantized_model"  // Optional
}

💡 Tip: When a T5 quantized model exists in the script's specified model_path (such as models_t5_umt5-xxl-enc-fp8.pth or models_t5_umt5-xxl-enc-int8.pth), t5_quantized_ckpt doesn't need to be specified separately.

CLIP Model Quantization

Supported Quantization Modes

CLIP quantization modes (clip_quant_scheme) support: int8-vllm, fp8-sgl, int8-q8f, fp8-q8f, int8-torchao

Configuration Example

{
    "clip_quantized": true,
    "clip_quant_scheme": "fp8-sgl",
    "clip_quantized_ckpt": "/path/to/clip_quantized_model"  // Optional
}

💡 Tip: When a CLIP quantized model exists in the script's specified model_path (such as models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth or models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth), clip_quantized_ckpt doesn't need to be specified separately.

Performance Optimization Strategy

If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to Parameter Offload Documentation:

Wan2.1 Configuration: Refer to offload config files

Wan2.2 Configuration: Refer to wan22 config files with 4090 suffix

📚 Related Resources

Configuration File Examples

Run Scripts

Quantization Inference Scripts

Tool Documentation

Model Repositories

Through this document, you should be able to:

✅ Understand quantization schemes supported by LightX2V ✅ Select appropriate quantization strategies based on hardware ✅ Correctly configure quantization parameters ✅ Obtain and use quantized models ✅ Optimize inference performance and memory usage

If you have other questions, feel free to ask in GitHub Issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Quantization Techniques

📖 Overview

🔧 Quantization Modes

🔧 Obtaining Quantized Models

Method 1: Download Pre-Quantized Models

Method 2: Self-Quantize Models

🚀 Using Quantized Models

DIT Model Quantization

Supported Quantization Modes

Configuration Example

T5 Model Quantization

Supported Quantization Modes

Configuration Example

CLIP Model Quantization

Supported Quantization Modes

Configuration Example

Performance Optimization Strategy

📚 Related Resources

Configuration File Examples

Run Scripts

Tool Documentation

Model Repositories

FilesExpand file tree

quantization.md

Latest commit

History

quantization.md

File metadata and controls

Model Quantization Techniques

📖 Overview

🔧 Quantization Modes

🔧 Obtaining Quantized Models

Method 1: Download Pre-Quantized Models

Method 2: Self-Quantize Models

🚀 Using Quantized Models

DIT Model Quantization

Supported Quantization Modes

Configuration Example

T5 Model Quantization

Supported Quantization Modes

Configuration Example

CLIP Model Quantization

Supported Quantization Modes

Configuration Example

Performance Optimization Strategy

📚 Related Resources

Configuration File Examples

Run Scripts

Tool Documentation

Model Repositories