Frames and channels are swapped in UNet3DConditionModel

### Describe the bug

In docstring for the forward of this model it's stated that sample should have the shape: `batch, num_frames, channel, height, width`, but later before any permutations the model has a string `num_frames = sample.shape[2]`. It seems that these statements contradict each other. The model works when frames go at dim=2 and channels at dim=1 but it contradicts the documentation.

### Reproduction

```
model = UNet3DConditionModel(
        sample_size=(240, 320),
        in_channels=3,
        out_channels=3,
        layers_per_block=2,
        block_out_channels=(12,),
        norm_num_groups=2,
        down_block_types=(
            "DownBlock3D",
        ),
        up_block_types=(
            "UpBlock3D",
        ),
        cross_attention_dim=24,
        attention_head_dim=8,
)

model.forward(
    sample = torch.randn(1, 75, 3, 240, 320),
    timestep = 500,
    encoder_hidden_states = torch.ones(1, 75, 24) * 3.0,
)
```

### Logs

_No response_

### System Info

- `diffusers` version: 0.25.1
- Platform: Linux-6.7.0-0-MANJARO-x86_64-with-glibc2.38
- Python version: 3.11.6
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Huggingface_hub version: 0.20.2
- Transformers version: 4.36.1
- Accelerate version: 0.25.0
- xFormers version: not installed
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False

### Who can help?

@DN6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Frames and channels are swapped in UNet3DConditionModel #6657

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frames and channels are swapped in UNet3DConditionModel #6657

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions