You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.
71
-
72
-
<Tip>
73
-
Attention slicing is useful even if a batch size of just 1 is used - as long
74
-
as the model uses more than one attention head. If there is more than one
75
-
attention head the *QK^T* attention matrix can be computed sequentially for
76
-
each head which can save a significant amount of memory.
77
-
</Tip>
78
-
79
-
To perform the attention computation sequentially over each head, you only need to invoke [`~DiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
80
-
81
-
```Python
82
-
import torch
83
-
from diffusers import DiffusionPipeline
84
-
85
-
pipe = DiffusionPipeline.from_pretrained(
86
-
"runwayml/stable-diffusion-v1-5",
87
-
torch_dtype=torch.float16,
88
-
)
89
-
pipe = pipe.to("cuda")
90
-
91
-
prompt ="a photo of an astronaut riding a horse on mars"
92
-
pipe.enable_attention_slicing()
93
-
image = pipe(prompt).images[0]
94
-
```
95
-
96
-
There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!
97
-
98
-
99
68
## Sliced VAE decode for larger batches
100
69
101
70
To decode large batches of images with limited VRAM, or to enable batches with 32 images or more, you can use sliced VAE decode that decodes the batch latents one image at a time.
102
71
103
-
You likely want to couple this with [`~StableDiffusionPipeline.enable_attention_slicing`] or [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
72
+
You likely want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
104
73
105
74
To perform the VAE decode one image at a time, invoke [`~StableDiffusionPipeline.enable_vae_slicing`] in your pipeline before inference. For example:
106
75
@@ -126,7 +95,7 @@ You may see a small performance boost in VAE decode on multi-image batches. Ther
126
95
127
96
Tiled VAE processing makes it possible to work with large images on limited VRAM. For example, generating 4k images in 8GB of VRAM. Tiled VAE decoder splits the image into overlapping tiles, decodes the tiles, and blends the outputs to make the final image.
128
97
129
-
You want to couple this with [`~StableDiffusionPipeline.enable_attention_slicing`] or [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
98
+
You want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
130
99
131
100
To use tiled VAE processing, invoke [`~StableDiffusionPipeline.enable_vae_tiling`] in your pipeline before inference. For example:
0 commit comments