Skip to content

Frames and channels are swapped in UNet3DConditionModel #6657

@MK-2012

Description

@MK-2012

Describe the bug

In docstring for the forward of this model it's stated that sample should have the shape: batch, num_frames, channel, height, width, but later before any permutations the model has a string num_frames = sample.shape[2]. It seems that these statements contradict each other. The model works when frames go at dim=2 and channels at dim=1 but it contradicts the documentation.

Reproduction

model = UNet3DConditionModel(
        sample_size=(240, 320),
        in_channels=3,
        out_channels=3,
        layers_per_block=2,
        block_out_channels=(12,),
        norm_num_groups=2,
        down_block_types=(
            "DownBlock3D",
        ),
        up_block_types=(
            "UpBlock3D",
        ),
        cross_attention_dim=24,
        attention_head_dim=8,
)

model.forward(
    sample = torch.randn(1, 75, 3, 240, 320),
    timestep = 500,
    encoder_hidden_states = torch.ones(1, 75, 24) * 3.0,
)

Logs

No response

System Info

  • diffusers version: 0.25.1
  • Platform: Linux-6.7.0-0-MANJARO-x86_64-with-glibc2.38
  • Python version: 3.11.6
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Huggingface_hub version: 0.20.2
  • Transformers version: 4.36.1
  • Accelerate version: 0.25.0
  • xFormers version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Who can help?

@DN6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions