Skip to content

Commit f09ca90

Browse files
DN6yiyixuxu
andauthored
Multiple small fixes to Video Pipeline docs (huggingface#6805)
* update * update * update * Update src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py Co-authored-by: YiYi Xu <[email protected]> * update * update --------- Co-authored-by: YiYi Xu <[email protected]>
1 parent a5fc62f commit f09ca90

File tree

7 files changed

+39
-34
lines changed

7 files changed

+39
-34
lines changed

docs/source/en/api/pipelines/i2vgenxl.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ The abstract from the paper is:
1818

1919
*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*
2020

21-
The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).
21+
The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).
2222

2323
<Tip>
2424

25-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
25+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
2626

2727
</Tip>
2828

@@ -31,7 +31,7 @@ Sample output with I2VGenXL:
3131
<table>
3232
<tr>
3333
<td><center>
34-
masterpiece, bestquality, sunset.
34+
library.
3535
<br>
3636
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
3737
alt="library"
@@ -43,9 +43,9 @@ Sample output with I2VGenXL:
4343
## Notes
4444

4545
* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
46-
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
47-
* Unlike SVD, it additionally accepts text prompts as inputs.
48-
* It can generate higher resolution videos.
46+
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
47+
* Unlike SVD, it additionally accepts text prompts as inputs.
48+
* It can generate higher resolution videos.
4949
* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.
5050

5151
## I2VGenXLPipeline

docs/source/en/api/pipelines/pia.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Here are some sample outputs:
7070
<table>
7171
<tr>
7272
<td><center>
73-
masterpiece, bestquality, sunset.
73+
cat in a field.
7474
<br>
7575
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pia-default-output.gif"
7676
alt="cat in a field"
@@ -119,7 +119,7 @@ image = load_image(
119119
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
120120
)
121121
image = image.resize((512, 512))
122-
prompt = "cat in a hat"
122+
prompt = "cat in a field"
123123
negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"
124124

125125
generator = torch.Generator("cpu").manual_seed(0)
@@ -132,7 +132,7 @@ export_to_gif(frames, "pia-freeinit-animation.gif")
132132
<table>
133133
<tr>
134134
<td><center>
135-
masterpiece, bestquality, sunset.
135+
cat in a field.
136136
<br>
137137
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pia-freeinit-output-cat.gif"
138138
alt="cat in a field"

docs/source/en/api/pipelines/text_to_video.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", tor
4141
pipe = pipe.to("cuda")
4242

4343
prompt = "Spiderman is surfing"
44-
video_frames = pipe(prompt).frames
44+
video_frames = pipe(prompt).frames[0]
4545
video_path = export_to_video(video_frames)
4646
video_path
4747
```
@@ -64,7 +64,7 @@ pipe.enable_model_cpu_offload()
6464
pipe.enable_vae_slicing()
6565

6666
prompt = "Darth Vader surfing a wave"
67-
video_frames = pipe(prompt, num_frames=64).frames
67+
video_frames = pipe(prompt, num_frames=64).frames[0]
6868
video_path = export_to_video(video_frames)
6969
video_path
7070
```
@@ -83,7 +83,7 @@ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
8383
pipe.enable_model_cpu_offload()
8484

8585
prompt = "Spiderman is surfing"
86-
video_frames = pipe(prompt, num_inference_steps=25).frames
86+
video_frames = pipe(prompt, num_inference_steps=25).frames[0]
8787
video_path = export_to_video(video_frames)
8888
video_path
8989
```
@@ -130,7 +130,7 @@ pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
130130
pipe.enable_vae_slicing()
131131

132132
prompt = "Darth Vader surfing a wave"
133-
video_frames = pipe(prompt, num_frames=24).frames
133+
video_frames = pipe(prompt, num_frames=24).frames[0]
134134
video_path = export_to_video(video_frames)
135135
video_path
136136
```
@@ -148,7 +148,7 @@ pipe.enable_vae_slicing()
148148

149149
video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]
150150

151-
video_frames = pipe(prompt, video=video, strength=0.6).frames
151+
video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
152152
video_path = export_to_video(video_frames)
153153
video_path
154154
```

src/diffusers/pipelines/animatediff/pipeline_output.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,13 @@
1111
@dataclass
1212
class AnimateDiffPipelineOutput(BaseOutput):
1313
r"""
14-
Output class for AnimateDiff pipelines.
14+
Output class for AnimateDiff pipelines.
1515
16-
Args:
17-
frames (`List[List[PIL.Image.Image]]` or `torch.Tensor` or `np.ndarray`):
18-
List of PIL Images of length `batch_size` or torch.Tensor or np.ndarray of shape
19-
`(batch_size, num_frames, height, width, num_channels)`.
16+
Args:
17+
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
18+
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
19+
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
20+
`(batch_size, num_frames, channels, height, width)`
2021
"""
2122

22-
frames: Union[List[List[PIL.Image.Image]], torch.Tensor, np.ndarray]
23+
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]

src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
```py
4747
>>> import torch
4848
>>> from diffusers import I2VGenXLPipeline
49+
>>> from diffusers.utils import export_to_gif, load_image
4950
5051
>>> pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
5152
>>> pipeline.enable_model_cpu_offload()
@@ -95,15 +96,16 @@ def tensor2vid(video: torch.Tensor, processor: "VaeImageProcessor", output_type:
9596
@dataclass
9697
class I2VGenXLPipelineOutput(BaseOutput):
9798
r"""
98-
Output class for image-to-video pipeline.
99+
Output class for image-to-video pipeline.
99100
100-
Args:
101-
frames (`List[np.ndarray]` or `torch.FloatTensor`)
102-
List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as
103-
a `torch` tensor. The length of the list denotes the video length (the number of frames).
101+
Args:
102+
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
103+
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
104+
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
105+
`(batch_size, num_frames, channels, height, width)`
104106
"""
105107

106-
frames: Union[List[np.ndarray], torch.FloatTensor]
108+
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]
107109

108110

109111
class I2VGenXLPipeline(DiffusionPipeline):

src/diffusers/pipelines/pia/pipeline_pia.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -200,13 +200,13 @@ class PIAPipelineOutput(BaseOutput):
200200
Output class for PIAPipeline.
201201
202202
Args:
203-
frames (`torch.Tensor`, `np.ndarray`, or List[PIL.Image.Image]):
203+
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
204204
Nested list of length `batch_size` with denoised PIL image sequences of length `num_frames`,
205205
NumPy array of shape `(batch_size, num_frames, channels, height, width,
206206
Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
207207
"""
208208

209-
frames: Union[torch.Tensor, np.ndarray, PIL.Image.Image]
209+
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]
210210

211211

212212
class PIAPipeline(DiffusionPipeline, TextualInversionLoaderMixin, IPAdapterMixin, LoraLoaderMixin):

src/diffusers/pipelines/text_to_video_synthesis/pipeline_output.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from typing import List, Union
33

44
import numpy as np
5+
import PIL
56
import torch
67

78
from ...utils import (
@@ -12,12 +13,13 @@
1213
@dataclass
1314
class TextToVideoSDPipelineOutput(BaseOutput):
1415
"""
15-
Output class for text-to-video pipelines.
16+
Output class for text-to-video pipelines.
1617
17-
Args:
18-
frames (`List[np.ndarray]` or `torch.FloatTensor`)
19-
List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as
20-
a `torch` tensor. The length of the list denotes the video length (the number of frames).
18+
Args:
19+
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
20+
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
21+
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
22+
`(batch_size, num_frames, channels, height, width)`
2123
"""
2224

23-
frames: Union[List[np.ndarray], torch.FloatTensor]
25+
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]

0 commit comments

Comments
 (0)