NOTE: stable-fast is only in beta stage and is prone to be buggy, feel free to try it out and give suggestions!
stable-fast is an ultra lightweight inference optimization library for HuggingFace Diffusers on NVIDIA GPUs.
stable-fast provides super fast inference optimization by utilizing some key techniques and features:
- CUDNN Convolution Fusion:
stable-fastimplements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations ofConv + Bias + Add + Actcomputation patterns. - Low Precision & Fused GEMM:
stable-fastimplements a series of fused GEMM operators that compute withfp16precision, which is fast than PyTorch's defaults (read & write withfp16while compute withfp32). - NHWC & Fused GroupNorm:
stable-fastimplements a highly optimized fused NHWCGroupNorm + GELUoperator with OpenAI'striton, which eliminates the need of memory format permutation operators. - Fully Traced Model:
stable-fastimproves thetorch.jit.traceinterface to make it more proper for tracing complex models. Nearly every part ofStableDiffusionPipelinecan be traced and converted to TorchScript. It is more stable thantorch.compileand has a significantly lower CPU overhead thantorch.compileand supports ControlNet and LoRA. - CUDA Graph:
stable-fastcan capture the UNet structure into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. - Fused Multihead Attention:
stable-fastjust uses xformers and make it compatible with TorchScript.
- Fast:
stable-fastis specialy optimized for HuggingFace Diffusers. It achieves the best performance over all libraries. - Minimal:
stable-fastworks as a plugin framework forPyTorch. it utilizes existingPyTorchfunctionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
| Framework | Performance |
|---|---|
| Vanilla PyTorch | 23 it/s |
| AITemplate | 44 it/s |
| TensorRT | 52 it/s |
| OneFlow | 55 it/s |
| Stable Fast (with xformers & triton) | 60 it/s |
| Framework | Performance |
|---|---|
| Vanilla PyTorch | 16 it/s |
| AITemplate | 31 it/s |
| TensorRT | 33 it/s |
| OneFlow | 39 it/s |
| Stable Fast (with xformers & triton) | 38 it/s |
NOTE: stable-fast is currently only tested on Linux. You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).
# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas
# Install PyTorch with CUDA and other packages at first
pip install torch diffusers xformers 'triton>=2.1.0'
# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)NOTE: Any usage outside sfast.compilers is not guaranteed to be backward compatible.
NOTE: To get the best performance, xformers and OpenAI's triton>=2.1.0 need to be installed and enabled. You might need to build xformers from source to make it compatible with your PyTorch.
# TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...import packaging.version
import torch
if packaging.version.parse(torch.__version__) >= packaging.version.parse('1.12.0'):
torch.backends.cuda.matmul.allow_tf32 = Trueimport torch
from diffusers import StableDiffusionPipeline
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
CompilationConfig
)
def load_model():
model = StableDiffusionPipeline.from_pretrained(
'runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16)
model.safety_checker = None
model.to(torch.device('cuda'))
return model
model = load_model()
config = CompilationConfig.Default()
# xformers and triton are suggested for achieving best performance.
# It might be slow for triton to generate, compile and fine-tune kernels.
try:
import xformers
config.enable_xformers = True
except ImportError:
print('xformers not installed, skip')
try:
import triton
config.enable_triton = True
except ImportError:
print('triton not installed, skip')
# CUDA Graph is suggested for small batch sizes.
# After capturing, the model only accepts one fixed image size.
# If you want the model to be dynamic, don't enable it.
config.enable_cuda_graph = True
compiled_model = compile(model, config)
kwarg_inputs = dict(
prompt=
'(masterpiece:1,2), best quality, masterpiece, best detail face, lineart, monochrome, a beautiful girl',
height=512,
width=512,
num_inference_steps=50,
num_images_per_prompt=1,
)
# NOTE: Warm it up.
# The first call will trigger compilation and might be very slow.
# After the first call, it should be very fast.
output_image = compiled_model(**kwarg_inputs).images[0]
# Let's see the second call!
output_image = compiled_model(**kwarg_inputs).images[0]