diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml index b865f6c33d51..a0517725284e 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.yml +++ b/.github/ISSUE_TEMPLATE/bug-report.yml @@ -57,50 +57,54 @@ body: description: | Your issue will be replied to more quickly if you can figure out the right person to tag with @. If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. - + All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and a core maintainer will ping the right person. - + Please tag a maximum of 2 people. - Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): + Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): @sayakpaul @DN6 Questions on pipelines: - - Stable Diffusion @yiyixuxu @DN6 @sayakpaul - - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 - - Kandinsky @yiyixuxu - - ControlNet @sayakpaul @yiyixuxu @DN6 - - T2I Adapter @sayakpaul @yiyixuxu @DN6 - - IF @DN6 - - Text-to-Video / Video-to-Video @DN6 @sayakpaul - - Wuerstchen @DN6 + - Stable Diffusion @yiyixuxu @asomoza + - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 + - Stable Diffusion 3: @yiyixuxu @sayakpaul @DN6 @asomoza + - Kandinsky @yiyixuxu + - ControlNet @sayakpaul @yiyixuxu @DN6 + - T2I Adapter @sayakpaul @yiyixuxu @DN6 + - IF @DN6 + - Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w + - Wuerstchen @DN6 - Other: @yiyixuxu @DN6 + - Improving generation quality: @asomoza Questions on models: - - UNet @DN6 @yiyixuxu @sayakpaul - - VAE @sayakpaul @DN6 @yiyixuxu - - Transformers/Attention @DN6 @yiyixuxu @sayakpaul @DN6 + - UNet @DN6 @yiyixuxu @sayakpaul + - VAE @sayakpaul @DN6 @yiyixuxu + - Transformers/Attention @DN6 @yiyixuxu @sayakpaul + + Questions on single file checkpoints: @DN6 - Questions on Schedulers: @yiyixuxu + Questions on Schedulers: @yiyixuxu - Questions on LoRA: @sayakpaul + Questions on LoRA: @sayakpaul - Questions on Textual Inversion: @sayakpaul + Questions on Textual Inversion: @sayakpaul - Questions on Training: - - DreamBooth @sayakpaul - - Text-to-Image Fine-tuning @sayakpaul - - Textual Inversion @sayakpaul - - ControlNet @sayakpaul + Questions on Training: + - DreamBooth @sayakpaul + - Text-to-Image Fine-tuning @sayakpaul + - Textual Inversion @sayakpaul + - ControlNet @sayakpaul - Questions on Tests: @DN6 @sayakpaul @yiyixuxu + Questions on Tests: @DN6 @sayakpaul @yiyixuxu Questions on Documentation: @stevhliu Questions on JAX- and MPS-related things: @pcuenca - Questions on audio pipelines: @DN6 - + Questions on audio pipelines: @sanchit-gandhi + + - placeholder: "@Username ..." diff --git a/.github/ISSUE_TEMPLATE/remote-vae-pilot-feedback.yml b/.github/ISSUE_TEMPLATE/remote-vae-pilot-feedback.yml new file mode 100644 index 000000000000..c94d3bed9738 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/remote-vae-pilot-feedback.yml @@ -0,0 +1,38 @@ +name: "\U0001F31F Remote VAE" +description: Feedback for remote VAE pilot +labels: [ "Remote VAE" ] + +body: + - type: textarea + id: positive + validations: + required: true + attributes: + label: Did you like the remote VAE solution? + description: | + If you liked it, we would appreciate it if you could elaborate what you liked. + + - type: textarea + id: feedback + validations: + required: true + attributes: + label: What can be improved about the current solution? + description: | + Let us know the things you would like to see improved. Note that we will work optimizing the solution once the pilot is over and we have usage. + + - type: textarea + id: others + validations: + required: true + attributes: + label: What other VAEs you would like to see if the pilot goes well? + description: | + Provide a list of the VAEs you would like to see in the future if the pilot goes well. + + - type: textarea + id: additional-info + attributes: + label: Notify the members of the team + description: | + Tag the following folks when submitting this feedback: @hlky @sayakpaul diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index a0337eaaaac5..e4b2b45a4ecd 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -38,9 +38,9 @@ members/contributors who may be interested in your PR. Core library: -- Schedulers: @yiyixuxu -- Pipelines: @sayakpaul @yiyixuxu @DN6 -- Training examples: @sayakpaul +- Schedulers: @yiyixuxu +- Pipelines and pipeline callbacks: @yiyixuxu and @asomoza +- Training examples: @sayakpaul - Docs: @stevhliu and @sayakpaul - JAX and MPS: @pcuenca - Audio: @sanchit-gandhi @@ -48,7 +48,8 @@ Core library: Integrations: -- deepspeed: HF Trainer/Accelerate: @pacman100 +- deepspeed: HF Trainer/Accelerate: @SunMarc +- PEFT: @sayakpaul @BenjaminBossan HF projects: diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 718c67731bd5..cc97e043c139 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -7,20 +7,25 @@ on: env: DIFFUSERS_IS_CI: yes + HF_HUB_ENABLE_HF_TRANSFER: 1 HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 + BASE_PATH: benchmark_outputs jobs: - torch_pipelines_cuda_benchmark_tests: - name: Torch Core Pipelines CUDA Benchmarking Tests + torch_models_cuda_benchmark_tests: + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_BENCHMARK }} + name: Torch Core Models CUDA Benchmarking Tests strategy: fail-fast: false max-parallel: 1 - runs-on: [single-gpu, nvidia-gpu, a10, ci] + runs-on: + group: aws-g6e-4xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 + options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -31,23 +36,54 @@ jobs: nvidia-smi - name: Install dependencies run: | + apt update + apt install -y libpq-dev postgresql-client python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install pandas peft + python -m uv pip install -r benchmarks/requirements.txt - name: Environment run: | python utils/print_env.py - name: Diffusers Benchmarking env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }} - BASE_PATH: benchmark_outputs + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | - export TOTAL_GPU_MEMORY=$(python -c "import torch; print(torch.cuda.get_device_properties(0).total_memory / (1024**3))") - cd benchmarks && mkdir ${BASE_PATH} && python run_all.py && python push_results.py + cd benchmarks && python run_all.py + + - name: Push results to the Hub + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }} + run: | + cd benchmarks && python push_results.py + mkdir $BASE_PATH && cp *.csv $BASE_PATH - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: benchmark_test_reports - path: benchmarks/benchmark_outputs \ No newline at end of file + path: benchmarks/${{ env.BASE_PATH }} + + # TODO: enable this once the connection problem has been resolved. + - name: Update benchmarking results to DB + env: + PGDATABASE: metrics + PGHOST: ${{ secrets.DIFFUSERS_BENCHMARKS_PGHOST }} + PGUSER: transformers_benchmarks + PGPASSWORD: ${{ secrets.DIFFUSERS_BENCHMARKS_PGPASSWORD }} + BRANCH_NAME: ${{ github.head_ref || github.ref_name }} + run: | + git config --global --add safe.directory /__w/diffusers/diffusers + commit_id=$GITHUB_SHA + commit_msg=$(git show -s --format=%s "$commit_id" | cut -c1-70) + cd benchmarks && python populate_into_db.py "$BRANCH_NAME" "$commit_id" "$commit_msg" + + - name: Report success status + if: ${{ success() }} + run: | + pip install requests && python utils/notify_benchmarking_status.py --status=success + + - name: Report failure status + if: ${{ failure() }} + run: | + pip install requests && python utils/notify_benchmarking_status.py --status=failure \ No newline at end of file diff --git a/.github/workflows/build_docker_images.yml b/.github/workflows/build_docker_images.yml index 82ef885b240e..583853c6d649 100644 --- a/.github/workflows/build_docker_images.yml +++ b/.github/workflows/build_docker_images.yml @@ -20,26 +20,34 @@ env: jobs: test-build-docker-images: - runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ] + runs-on: + group: aws-general-8-plus if: github.event_name == 'pull_request' steps: - name: Set up Docker Buildx uses: docker/setup-buildx-action@v1 - + - name: Check out code uses: actions/checkout@v3 - + - name: Find Changed Dockerfiles id: file_changes uses: jitterbit/get-changed-files@v1 with: - format: 'space-delimited' + format: "space-delimited" token: ${{ secrets.GITHUB_TOKEN }} - + - name: Build Changed Docker Images + env: + CHANGED_FILES: ${{ steps.file_changes.outputs.all }} run: | - CHANGED_FILES="${{ steps.file_changes.outputs.all }}" - for FILE in $CHANGED_FILES; do + echo "$CHANGED_FILES" + for FILE in $CHANGED_FILES; do + # skip anything that isn't still on disk + if [[ ! -f "$FILE" ]]; then + echo "Skipping removed file $FILE" + continue + fi if [[ "$FILE" == docker/*Dockerfile ]]; then DOCKER_PATH="${FILE%/Dockerfile}" DOCKER_TAG=$(basename "$DOCKER_PATH") @@ -50,9 +58,10 @@ jobs: if: steps.file_changes.outputs.all != '' build-and-push-docker-images: - runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ] + runs-on: + group: aws-general-8-plus if: github.event_name != 'pull_request' - + permissions: contents: read packages: write @@ -63,12 +72,10 @@ jobs: image-name: - diffusers-pytorch-cpu - diffusers-pytorch-cuda - - diffusers-pytorch-compile-cuda + - diffusers-pytorch-cuda - diffusers-pytorch-xformers-cuda - - diffusers-flax-cpu - - diffusers-flax-tpu - - diffusers-onnxruntime-cpu - - diffusers-onnxruntime-cuda + - diffusers-pytorch-minimum-cuda + - diffusers-doc-builder steps: - name: Checkout repository @@ -90,24 +97,11 @@ jobs: - name: Post to a Slack channel id: slack - uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001 + uses: huggingface/hf-workflows/.github/actions/post-slack@main with: # Slack channel id, channel name, or user id to post message. # See also: https://api.slack.com/methods/chat.postMessage#channels - channel-id: ${{ env.CI_SLACK_CHANNEL }} - # For posting a rich message using Block Kit - payload: | - { - "text": "${{ matrix.image-name }} Docker Image build result: ${{ job.status }}\n${{ github.event.head_commit.url }}", - "blocks": [ - { - "type": "section", - "text": { - "type": "mrkdwn", - "text": "${{ matrix.image-name }} Docker Image build result: ${{ job.status }}\n${{ github.event.head_commit.url }}" - } - } - ] - } - env: - SLACK_BOT_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} + slack_channel: ${{ env.CI_SLACK_CHANNEL }} + title: "🤗 Results of the ${{ matrix.image-name }} Docker Image build" + status: ${{ job.status }} + slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index d9054928ed8d..6d4193e3cccc 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -21,7 +21,7 @@ jobs: package: diffusers notebook_folder: diffusers_doc languages: en ko zh ja pt - + custom_container: diffusers/diffusers-doc-builder secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index 8e19d8fafbe3..52e075733163 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -20,3 +20,4 @@ jobs: install_libgl1: true package: diffusers languages: en ko zh ja pt + custom_container: diffusers/diffusers-doc-builder diff --git a/.github/workflows/mirror_community_pipeline.yml b/.github/workflows/mirror_community_pipeline.yml new file mode 100644 index 000000000000..9cf573312b34 --- /dev/null +++ b/.github/workflows/mirror_community_pipeline.yml @@ -0,0 +1,102 @@ +name: Mirror Community Pipeline + +on: + # Push changes on the main branch + push: + branches: + - main + paths: + - 'examples/community/**.py' + + # And on tag creation (e.g. `v0.28.1`) + tags: + - '*' + + # Manual trigger with ref input + workflow_dispatch: + inputs: + ref: + description: "Either 'main' or a tag ref" + required: true + default: 'main' + +jobs: + mirror_community_pipeline: + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_COMMUNITY_MIRROR }} + + runs-on: ubuntu-22.04 + steps: + # Checkout to correct ref + # If workflow dispatch + # If ref is 'main', set: + # CHECKOUT_REF=refs/heads/main + # PATH_IN_REPO=main + # Else it must be a tag. Set: + # CHECKOUT_REF=refs/tags/{tag} + # PATH_IN_REPO={tag} + # If not workflow dispatch + # If ref is 'refs/heads/main' => set 'main' + # Else it must be a tag => set {tag} + - name: Set checkout_ref and path_in_repo + run: | + if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then + if [ -z "${{ github.event.inputs.ref }}" ]; then + echo "Error: Missing ref input" + exit 1 + elif [ "${{ github.event.inputs.ref }}" == "main" ]; then + echo "CHECKOUT_REF=refs/heads/main" >> $GITHUB_ENV + echo "PATH_IN_REPO=main" >> $GITHUB_ENV + else + echo "CHECKOUT_REF=refs/tags/${{ github.event.inputs.ref }}" >> $GITHUB_ENV + echo "PATH_IN_REPO=${{ github.event.inputs.ref }}" >> $GITHUB_ENV + fi + elif [ "${{ github.ref }}" == "refs/heads/main" ]; then + echo "CHECKOUT_REF=${{ github.ref }}" >> $GITHUB_ENV + echo "PATH_IN_REPO=main" >> $GITHUB_ENV + else + # e.g. refs/tags/v0.28.1 -> v0.28.1 + echo "CHECKOUT_REF=${{ github.ref }}" >> $GITHUB_ENV + echo "PATH_IN_REPO=$(echo ${{ github.ref }} | sed 's/^refs\/tags\///')" >> $GITHUB_ENV + fi + - name: Print env vars + run: | + echo "CHECKOUT_REF: ${{ env.CHECKOUT_REF }}" + echo "PATH_IN_REPO: ${{ env.PATH_IN_REPO }}" + - uses: actions/checkout@v3 + with: + ref: ${{ env.CHECKOUT_REF }} + + # Setup + install dependencies + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: "3.10" + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install --upgrade huggingface_hub + + # Check secret is set + - name: whoami + run: hf auth whoami + env: + HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} + + # Push to HF! (under subfolder based on checkout ref) + # https://huggingface.co/datasets/diffusers/community-pipelines-mirror + - name: Mirror community pipeline to HF + run: hf upload diffusers/community-pipelines-mirror ./examples/community ${PATH_IN_REPO} --repo-type dataset + env: + PATH_IN_REPO: ${{ env.PATH_IN_REPO }} + HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} + + - name: Report success status + if: ${{ success() }} + run: | + pip install requests && python utils/notify_community_pipelines_mirror.py --status=success + + - name: Report failure status + if: ${{ failure() }} + run: | + pip install requests && python utils/notify_community_pipelines_mirror.py --status=failure \ No newline at end of file diff --git a/.github/workflows/nightly_tests.yml b/.github/workflows/nightly_tests.yml index 2f73c66de829..479e5503eed2 100644 --- a/.github/workflows/nightly_tests.yml +++ b/.github/workflows/nightly_tests.yml @@ -7,19 +7,23 @@ on: env: DIFFUSERS_IS_CI: yes - HF_HOME: /mnt/cache + HF_HUB_ENABLE_HF_TRANSFER: 1 OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 PYTEST_TIMEOUT: 600 RUN_SLOW: yes RUN_NIGHTLY: yes - PIPELINE_USAGE_CUTOFF: 5000 + PIPELINE_USAGE_CUTOFF: 0 SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} + CONSOLIDATED_REPORT_PATH: consolidated_test_report.md jobs: setup_torch_cuda_pipeline_matrix: - name: Setup Torch Pipelines Matrix - runs-on: ubuntu-latest + name: Setup Torch Pipelines CUDA Slow Tests Matrix + runs-on: + group: aws-general-8-plus + container: + image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: @@ -27,13 +31,9 @@ jobs: uses: actions/checkout@v3 with: fetch-depth: 2 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - name: Install dependencies run: | - pip install -e . + pip install -e .[test] pip install huggingface_hub - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix @@ -44,22 +44,24 @@ jobs: - name: Pipeline Tests Artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: test-pipelines.json path: reports run_nightly_tests_for_torch_pipelines: - name: Torch Pipelines CUDA Nightly Tests + name: Nightly Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false + max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} - runs-on: [single-gpu, nvidia-gpu, t4, ci] + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 + options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -67,61 +69,53 @@ jobs: fetch-depth: 2 - name: NVIDIA-SMI run: nvidia-smi - - name: Install dependencies run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git python -m uv pip install pytest-reportlog - - name: Environment run: | python utils/print_env.py - - - name: Nightly PyTorch CUDA checkpoint (pipelines) tests + - name: Pipeline CUDA Test env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -s -v -k "not Flax and not Onnx" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ - --report-log=tests_pipeline_${{ matrix.module }}_cuda.log \ + --report-log=tests_pipeline_${{ matrix.module }}_cuda.log \ tests/pipelines/${{ matrix.module }} - - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pipeline_${{ matrix.module }}_test_reports path: reports - - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY run_nightly_tests_for_other_torch_modules: - name: Torch Non-Pipelines CUDA Nightly Tests - runs-on: docker-gpu + name: Nightly Torch CUDA Tests + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 + options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: + fail-fast: false + max-parallel: 2 matrix: - module: [models, schedulers, others, examples] + module: [models, schedulers, lora, others, single_file, examples] steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -132,283 +126,490 @@ jobs: run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git python -m uv pip install pytest-reportlog - - name: Environment run: python utils/print_env.py - name: Run nightly PyTorch CUDA tests for non-pipeline modules - if: ${{ matrix.module != 'examples'}} + if: ${{ matrix.module != 'examples'}} env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -s -v -k "not Flax and not Onnx" \ --make-reports=tests_torch_${{ matrix.module }}_cuda \ - --report-log=tests_torch_${{ matrix.module }}_cuda.log \ + --report-log=tests_torch_${{ matrix.module }}_cuda.log \ tests/${{ matrix.module }} - name: Run nightly example tests with Torch if: ${{ matrix.module == 'examples' }} env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | - python -m uv pip install peft@git+https://github.com/huggingface/peft.git python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -s -v --make-reports=examples_torch_cuda \ - --report-log=examples_torch_cuda.log \ + --report-log=examples_torch_cuda.log \ examples/ - name: Failure short reports if: ${{ failure() }} run: | - cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt + cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: torch_${{ matrix.module }}_cuda_test_reports path: reports - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY + run_torch_compile_tests: + name: PyTorch Compile CUDA tests + + runs-on: + group: aws-g4dn-2xlarge - run_lora_nightly_tests: - name: Nightly LoRA Tests with PEFT and TORCH - runs-on: docker-gpu container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 - defaults: - run: - shell: bash + options: --gpus all --shm-size "16gb" --ipc host + steps: - name: Checkout diffusers uses: actions/checkout@v3 with: fetch-depth: 2 + - name: NVIDIA-SMI + run: | + nvidia-smi - name: Install dependencies run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - python -m uv pip install pytest-reportlog - + python -m uv pip install -e [quality,test,training] - name: Environment - run: python utils/print_env.py - - - name: Run nightly LoRA tests with PEFT and Torch + run: | + python utils/print_env.py + - name: Run torch compile tests on GPU env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + RUN_COMPILE: yes run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_torch_lora_cuda \ - --report-log=tests_torch_lora_cuda.log \ - tests/lora - + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/ - name: Failure short reports if: ${{ failure() }} - run: | - cat reports/tests_torch_lora_cuda_stats.txt - cat reports/tests_torch_lora_cuda_failures_short.txt + run: cat reports/tests_torch_compile_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: - name: torch_lora_cuda_test_reports + name: torch_compile_test_reports path: reports - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_flax_tpu_tests: - name: Nightly Flax TPU Tests - runs-on: docker-tpu - if: github.event_name == 'schedule' - + run_big_gpu_torch_tests: + name: Torch tests on big GPU + strategy: + fail-fast: false + max-parallel: 2 + runs-on: + group: aws-g6e-xlarge-plus container: - image: diffusers/diffusers-flax-tpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --privileged + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "16gb" --ipc host --gpus all + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + - name: NVIDIA-SMI + run: nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + python -m uv pip install pytest-reportlog + - name: Environment + run: | + python utils/print_env.py + - name: Selected Torch CUDA Test on big GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + BIG_GPU_MEMORY: 40 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -m "big_accelerator" \ + --make-reports=tests_big_gpu_torch_cuda \ + --report-log=tests_big_gpu_torch_cuda.log \ + tests/ + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_big_gpu_torch_cuda_stats.txt + cat reports/tests_big_gpu_torch_cuda_failures_short.txt + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_cuda_big_gpu_test_reports + path: reports + + torch_minimum_version_cuda_tests: + name: Torch Minimum Version CUDA Tests + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-minimum-cuda + options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 - - name: Environment - run: python utils/print_env.py + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - name: Run nightly Flax TPU tests - env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - run: | - python -m pytest -n 0 \ - -s -v -k "Flax" \ - --make-reports=tests_flax_tpu \ - --report-log=tests_flax_tpu.log \ - tests/ + - name: Environment + run: | + python utils/print_env.py - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_flax_tpu_stats.txt - cat reports/tests_flax_tpu_failures_short.txt + - name: Run PyTorch CUDA tests + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ + --make-reports=tests_torch_minimum_version_cuda \ + tests/models/test_modeling_common.py \ + tests/pipelines/test_pipelines_common.py \ + tests/pipelines/test_pipeline_utils.py \ + tests/pipelines/test_pipelines.py \ + tests/pipelines/test_pipelines_auto.py \ + tests/schedulers/test_schedulers.py \ + tests/others - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v2 - with: - name: flax_tpu_test_reports - path: reports + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_torch_minimum_version_cuda_stats.txt + cat reports/tests_torch_minimum_version_cuda_failures_short.txt - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_minimum_version_cuda_test_reports + path: reports - run_nightly_onnx_tests: - name: Nightly ONNXRuntime CUDA tests on Ubuntu - runs-on: docker-gpu + run_nightly_quantization_tests: + name: Torch quantization nightly tests + strategy: + fail-fast: false + max-parallel: 2 + matrix: + config: + - backend: "bitsandbytes" + test_location: "bnb" + additional_deps: ["peft"] + - backend: "gguf" + test_location: "gguf" + additional_deps: ["peft", "kernels"] + - backend: "torchao" + test_location: "torchao" + additional_deps: [] + - backend: "optimum_quanto" + test_location: "quanto" + additional_deps: [] + - backend: "nvidia_modelopt" + test_location: "modelopt" + additional_deps: [] + runs-on: + group: aws-g6e-xlarge-plus container: - image: diffusers/diffusers-onnxruntime-cuda - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: nvidia-smi - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - - name: Environment - run: python utils/print_env.py - - - name: Run nightly ONNXRuntime CUDA tests - env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_onnx_cuda \ - --report-log=tests_onnx_cuda.log \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_onnx_cuda_stats.txt - cat reports/tests_onnx_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v2 - with: - name: ${{ matrix.config.report }}_test_reports - path: reports - - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_nightly_tests_apple_m1: - name: Nightly PyTorch MPS tests on MacOS - runs-on: [ self-hosted, apple-m1 ] - if: github.event_name == 'schedule' - + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "20gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v3 with: fetch-depth: 2 - - - name: Clean checkout - shell: arch -arch arm64 bash {0} + - name: NVIDIA-SMI + run: nvidia-smi + - name: Install dependencies run: | - git clean -fxd - - - name: Setup miniconda - uses: ./.github/actions/setup-miniconda + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install -U ${{ matrix.config.backend }} + if [ "${{ join(matrix.config.additional_deps, ' ') }}" != "" ]; then + python -m uv pip install ${{ join(matrix.config.additional_deps, ' ') }} + fi + python -m uv pip install pytest-reportlog + - name: Environment + run: | + python utils/print_env.py + - name: ${{ matrix.config.backend }} quantization tests on GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + BIG_GPU_MEMORY: 40 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + --make-reports=tests_${{ matrix.config.backend }}_torch_cuda \ + --report-log=tests_${{ matrix.config.backend }}_torch_cuda.log \ + tests/quantization/${{ matrix.config.test_location }} + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_${{ matrix.config.backend }}_torch_cuda_stats.txt + cat reports/tests_${{ matrix.config.backend }}_torch_cuda_failures_short.txt + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 with: - python-version: 3.9 - + name: torch_cuda_${{ matrix.config.backend }}_reports + path: reports + + run_nightly_pipeline_level_quantization_tests: + name: Torch quantization nightly tests + strategy: + fail-fast: false + max-parallel: 2 + runs-on: + group: aws-g6e-xlarge-plus + container: + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "20gb" --ipc host --gpus all + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + - name: NVIDIA-SMI + run: nvidia-smi - name: Install dependencies - shell: arch -arch arm64 bash {0} run: | - ${CONDA_RUN} python -m pip install --upgrade pip uv - ${CONDA_RUN} python -m uv pip install -e [quality,test] - ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu - ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate - ${CONDA_RUN} python -m uv pip install pytest-reportlog - + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install -U bitsandbytes optimum_quanto + python -m uv pip install pytest-reportlog - name: Environment - shell: arch -arch arm64 bash {0} run: | - ${CONDA_RUN} python utils/print_env.py - - - name: Run nightly PyTorch tests on M1 (MPS) - shell: arch -arch arm64 bash {0} + python utils/print_env.py + - name: Pipeline-level quantization tests on GPU env: - HF_HOME: /System/Volumes/Data/mnt/cache - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + BIG_GPU_MEMORY: 40 run: | - ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \ - --report-log=tests_torch_mps.log \ - tests/ - + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + --make-reports=tests_pipeline_level_quant_torch_cuda \ + --report-log=tests_pipeline_level_quant_torch_cuda.log \ + tests/quantization/test_pipeline_level_quantization.py - name: Failure short reports if: ${{ failure() }} - run: cat reports/tests_torch_mps_failures_short.txt - + run: | + cat reports/tests_pipeline_level_quant_torch_cuda_stats.txt + cat reports/tests_pipeline_level_quant_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: - name: torch_mps_test_reports + name: torch_cuda_pipeline_level_quant_reports path: reports - - name: Generate Report and Notify Channel - if: always() + generate_consolidated_report: + name: Generate Consolidated Test Report + needs: [ + run_nightly_tests_for_torch_pipelines, + run_nightly_tests_for_other_torch_modules, + run_torch_compile_tests, + run_big_gpu_torch_tests, + run_nightly_quantization_tests, + run_nightly_pipeline_level_quantization_tests, + # run_nightly_onnx_tests, + torch_minimum_version_cuda_tests, + # run_flax_tpu_tests + ] + if: always() + runs-on: + group: aws-general-8-plus + container: + image: diffusers/diffusers-pytorch-cpu + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Create reports directory + run: mkdir -p combined_reports + + - name: Download all test reports + uses: actions/download-artifact@v4 + with: + path: artifacts + + - name: Prepare reports + run: | + # Move all report files to a single directory for processing + find artifacts -name "*.txt" -exec cp {} combined_reports/ \; + + - name: Install dependencies run: | + pip install -e .[test] pip install slack_sdk tabulate - python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY + + - name: Generate consolidated report + run: | + python utils/consolidated_test_report.py \ + --reports_dir combined_reports \ + --output_file $CONSOLIDATED_REPORT_PATH \ + --slack_channel_name diffusers-ci-nightly + + - name: Show consolidated report + run: | + cat $CONSOLIDATED_REPORT_PATH >> $GITHUB_STEP_SUMMARY + + - name: Upload consolidated report + uses: actions/upload-artifact@v4 + with: + name: consolidated_test_report + path: ${{ env.CONSOLIDATED_REPORT_PATH }} + +# M1 runner currently not well supported +# TODO: (Dhruv) add these back when we setup better testing for Apple Silicon +# run_nightly_tests_apple_m1: +# name: Nightly PyTorch MPS tests on MacOS +# runs-on: [ self-hosted, apple-m1 ] +# if: github.event_name == 'schedule' +# +# steps: +# - name: Checkout diffusers +# uses: actions/checkout@v3 +# with: +# fetch-depth: 2 +# +# - name: Clean checkout +# shell: arch -arch arm64 bash {0} +# run: | +# git clean -fxd +# - name: Setup miniconda +# uses: ./.github/actions/setup-miniconda +# with: +# python-version: 3.9 +# +# - name: Install dependencies +# shell: arch -arch arm64 bash {0} +# run: | +# ${CONDA_RUN} python -m pip install --upgrade pip uv +# ${CONDA_RUN} python -m uv pip install -e [quality,test] +# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu +# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate +# ${CONDA_RUN} python -m uv pip install pytest-reportlog +# - name: Environment +# shell: arch -arch arm64 bash {0} +# run: | +# ${CONDA_RUN} python utils/print_env.py +# - name: Run nightly PyTorch tests on M1 (MPS) +# shell: arch -arch arm64 bash {0} +# env: +# HF_HOME: /System/Volumes/Data/mnt/cache +# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} +# run: | +# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \ +# --report-log=tests_torch_mps.log \ +# tests/ +# - name: Failure short reports +# if: ${{ failure() }} +# run: cat reports/tests_torch_mps_failures_short.txt +# +# - name: Test suite reports artifacts +# if: ${{ always() }} +# uses: actions/upload-artifact@v4 +# with: +# name: torch_mps_test_reports +# path: reports +# +# - name: Generate Report and Notify Channel +# if: always() +# run: | +# pip install slack_sdk tabulate +# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY run_nightly_tests_apple_m1: +# name: Nightly PyTorch MPS tests on MacOS +# runs-on: [ self-hosted, apple-m1 ] +# if: github.event_name == 'schedule' +# +# steps: +# - name: Checkout diffusers +# uses: actions/checkout@v3 +# with: +# fetch-depth: 2 +# +# - name: Clean checkout +# shell: arch -arch arm64 bash {0} +# run: | +# git clean -fxd +# - name: Setup miniconda +# uses: ./.github/actions/setup-miniconda +# with: +# python-version: 3.9 +# +# - name: Install dependencies +# shell: arch -arch arm64 bash {0} +# run: | +# ${CONDA_RUN} python -m pip install --upgrade pip uv +# ${CONDA_RUN} python -m uv pip install -e [quality,test] +# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu +# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate +# ${CONDA_RUN} python -m uv pip install pytest-reportlog +# - name: Environment +# shell: arch -arch arm64 bash {0} +# run: | +# ${CONDA_RUN} python utils/print_env.py +# - name: Run nightly PyTorch tests on M1 (MPS) +# shell: arch -arch arm64 bash {0} +# env: +# HF_HOME: /System/Volumes/Data/mnt/cache +# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} +# run: | +# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \ +# --report-log=tests_torch_mps.log \ +# tests/ +# - name: Failure short reports +# if: ${{ failure() }} +# run: cat reports/tests_torch_mps_failures_short.txt +# +# - name: Test suite reports artifacts +# if: ${{ always() }} +# uses: actions/upload-artifact@v4 +# with: +# name: torch_mps_test_reports +# path: reports +# +# - name: Generate Report and Notify Channel +# if: always() +# run: | +# pip install slack_sdk tabulate +# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY diff --git a/.github/workflows/notify_slack_about_release.yml b/.github/workflows/notify_slack_about_release.yml index 95f2d0f917af..612ad4e24503 100644 --- a/.github/workflows/notify_slack_about_release.yml +++ b/.github/workflows/notify_slack_about_release.yml @@ -7,16 +7,16 @@ on: jobs: build: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - + - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.8' - + - name: Notify Slack about the release env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} diff --git a/.github/workflows/pr_dependency_test.yml b/.github/workflows/pr_dependency_test.yml index f21f09ef875e..d9350c09ac42 100644 --- a/.github/workflows/pr_dependency_test.yml +++ b/.github/workflows/pr_dependency_test.yml @@ -16,7 +16,7 @@ concurrency: jobs: check_dependencies: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python @@ -33,4 +33,3 @@ jobs: run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" pytest tests/others/test_dependencies.py - \ No newline at end of file diff --git a/.github/workflows/pr_flax_dependency_test.yml b/.github/workflows/pr_flax_dependency_test.yml deleted file mode 100644 index bbad72929917..000000000000 --- a/.github/workflows/pr_flax_dependency_test.yml +++ /dev/null @@ -1,38 +0,0 @@ -name: Run Flax dependency tests - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - push: - branches: - - main - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - check_flax_dependencies: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install --upgrade pip uv - python -m uv pip install -e . - python -m uv pip install "jax[cpu]>=0.2.16,!=0.3.2" - python -m uv pip install "flax>=0.4.1" - python -m uv pip install "jaxlib>=0.1.65" - python -m uv pip install pytest - - name: Check for soft dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - pytest tests/others/test_dependencies.py diff --git a/.github/workflows/pr_test_peft_backend.yml b/.github/workflows/pr_modular_tests.yml similarity index 53% rename from .github/workflows/pr_test_peft_backend.yml rename to .github/workflows/pr_modular_tests.yml index ca91535f0274..75258771e4dc 100644 --- a/.github/workflows/pr_test_peft_backend.yml +++ b/.github/workflows/pr_modular_tests.yml @@ -1,12 +1,24 @@ -name: Fast tests for PRs - PEFT backend +name: Fast PR tests for Modular on: pull_request: - branches: - - main + branches: [main] paths: - - "src/diffusers/**.py" - - "tests/**.py" + - "src/diffusers/modular_pipelines/**.py" + - "src/diffusers/models/modeling_utils.py" + - "src/diffusers/models/model_loading_utils.py" + - "src/diffusers/pipelines/pipeline_utils.py" + - "src/diffusers/pipeline_loading_utils.py" + - "src/diffusers/loaders/lora_base.py" + - "src/diffusers/loaders/lora_pipeline.py" + - "src/diffusers/loaders/peft.py" + - "tests/modular_pipelines/**.py" + - ".github/**.yml" + - "utils/**.py" + - "setup.py" + push: + branches: + - ci-* concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} @@ -14,19 +26,20 @@ concurrency: env: DIFFUSERS_IS_CI: yes + HF_HUB_ENABLE_HF_TRANSFER: 1 OMP_NUM_THREADS: 4 MKL_NUM_THREADS: 4 PYTEST_TIMEOUT: 60 jobs: check_code_quality: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: - python-version: "3.8" + python-version: "3.10" - name: Install dependencies run: | python -m pip install --upgrade pip @@ -40,13 +53,13 @@ jobs: check_repository_consistency: needs: check_code_quality - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: - python-version: "3.8" + python-version: "3.10" - name: Install dependencies run: | python -m pip install --upgrade pip @@ -55,6 +68,7 @@ jobs: run: | python utils/check_copies.py python utils/check_dummies.py + python utils/check_support_list.py make deps_table_check_updated - name: Check if failure if: ${{ failure() }} @@ -66,15 +80,20 @@ jobs: strategy: fail-fast: false matrix: - lib-versions: ["main", "latest"] + config: + - name: Fast PyTorch Modular Pipeline CPU tests + framework: pytorch_pipelines + runner: aws-highmemory-32-plus + image: diffusers/diffusers-pytorch-cpu + report: torch_cpu_modular_pipelines + name: ${{ matrix.config.name }} - name: LoRA - ${{ matrix.lib-versions }} - - runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ] + runs-on: + group: ${{ matrix.config.runner }} container: - image: diffusers/diffusers-pytorch-cpu + image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: @@ -91,23 +110,32 @@ jobs: run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - if [ "${{ matrix.lib-versions }}" == "main" ]; then - python -m uv pip install -U peft@git+https://github.com/huggingface/peft.git - python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git - python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - else - python -m uv pip install -U peft transformers accelerate - fi + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps - name: Environment run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python utils/print_env.py - - name: Run fast PyTorch LoRA CPU tests with PEFT backend + - name: Run fast PyTorch Pipeline CPU tests + if: ${{ matrix.config.framework == 'pytorch_pipelines' }} run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v \ + python -m pytest -n 8 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ --make-reports=tests_${{ matrix.config.report }} \ - tests/lora/ + tests/modular_pipelines + + - name: Failure short reports + if: ${{ failure() }} + run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: pr_${{ matrix.config.framework }}_${{ matrix.config.report }}_test_reports + path: reports + + diff --git a/.github/workflows/pr_style_bot.yml b/.github/workflows/pr_style_bot.yml new file mode 100644 index 000000000000..c60004720783 --- /dev/null +++ b/.github/workflows/pr_style_bot.yml @@ -0,0 +1,17 @@ +name: PR Style Bot + +on: + issue_comment: + types: [created] + +permissions: + contents: write + pull-requests: write + +jobs: + style: + uses: huggingface/huggingface_hub/.github/workflows/style-bot-action.yml@main + with: + python_quality_dependencies: "[quality]" + secrets: + bot_token: ${{ secrets.HF_STYLE_BOT_ACTION }} \ No newline at end of file diff --git a/.github/workflows/pr_test_fetcher.yml b/.github/workflows/pr_test_fetcher.yml index 4dbb118c6092..b032bb842786 100644 --- a/.github/workflows/pr_test_fetcher.yml +++ b/.github/workflows/pr_test_fetcher.yml @@ -15,7 +15,8 @@ concurrency: jobs: setup_pr_tests: name: Setup PR Tests - runs-on: docker-cpu + runs-on: + group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ @@ -73,7 +74,8 @@ jobs: max-parallel: 2 matrix: modules: ${{ fromJson(needs.setup_pr_tests.outputs.matrix) }} - runs-on: docker-cpu + runs-on: + group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ @@ -123,12 +125,13 @@ jobs: config: - name: Hub tests for models, schedulers, and pipelines framework: hub_tests_pytorch - runner: docker-cpu + runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_hub name: ${{ matrix.config.name }} - runs-on: ${{ matrix.config.runner }} + runs-on: + group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ @@ -168,7 +171,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pr_${{ matrix.config.report }}_test_reports path: reports diff --git a/.github/workflows/pr_tests.yml b/.github/workflows/pr_tests.yml index aa4afebb9cc1..1543b264b0cc 100644 --- a/.github/workflows/pr_tests.yml +++ b/.github/workflows/pr_tests.yml @@ -2,8 +2,7 @@ name: Fast tests for PRs on: pull_request: - branches: - - main + branches: [main] paths: - "src/diffusers/**.py" - "benchmarks/**.py" @@ -12,6 +11,7 @@ on: - "tests/**.py" - ".github/**.yml" - "utils/**.py" + - "setup.py" push: branches: - ci-* @@ -22,13 +22,14 @@ concurrency: env: DIFFUSERS_IS_CI: yes + HF_HUB_ENABLE_HF_TRANSFER: 1 OMP_NUM_THREADS: 4 MKL_NUM_THREADS: 4 PYTEST_TIMEOUT: 60 jobs: check_code_quality: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python @@ -48,7 +49,7 @@ jobs: check_repository_consistency: needs: check_code_quality - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python @@ -63,6 +64,7 @@ jobs: run: | python utils/check_copies.py python utils/check_dummies.py + python utils/check_support_list.py make deps_table_check_updated - name: Check if failure if: ${{ failure() }} @@ -77,28 +79,24 @@ jobs: config: - name: Fast PyTorch Pipeline CPU tests framework: pytorch_pipelines - runner: [ self-hosted, intel-cpu, 32-cpu, 256-ram, ci ] + runner: aws-highmemory-32-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu_pipelines - name: Fast PyTorch Models & Schedulers CPU tests framework: pytorch_models - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] + runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu_models_schedulers - - name: Fast Flax CPU tests - framework: flax - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] - image: diffusers/diffusers-flax-cpu - report: flax_cpu - name: PyTorch Example CPU tests framework: pytorch_examples - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] + runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_example_cpu name: ${{ matrix.config.name }} - runs-on: ${{ matrix.config.runner }} + runs-on: + group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} @@ -118,7 +116,8 @@ jobs: run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install accelerate + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps - name: Environment run: | @@ -143,21 +142,11 @@ jobs: --make-reports=tests_${{ matrix.config.report }} \ tests/models tests/schedulers tests/others - - name: Run fast Flax TPU tests - if: ${{ matrix.config.framework == 'flax' }} - run: | - apt-get update && apt-get install libsndfile1-dev libgl1 -y - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Flax" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests - - name: Run example PyTorch CPU tests if: ${{ matrix.config.framework == 'pytorch_examples' }} run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install peft + python -m uv pip install peft timm python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_${{ matrix.config.report }} \ examples @@ -168,9 +157,9 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: - name: pr_${{ matrix.config.report }}_test_reports + name: pr_${{ matrix.config.framework }}_${{ matrix.config.report }}_test_reports path: reports run_staging_tests: @@ -181,7 +170,8 @@ jobs: config: - name: Hub tests for models, schedulers, and pipelines framework: hub_tests_pytorch - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] + runner: + group: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_hub @@ -228,7 +218,72 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pr_${{ matrix.config.report }}_test_reports path: reports + + run_lora_tests: + needs: [check_code_quality, check_repository_consistency] + strategy: + fail-fast: false + + name: LoRA tests with PEFT main + + runs-on: + group: aws-general-8-plus + + container: + image: diffusers/diffusers-pytorch-cpu + options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ + + defaults: + run: + shell: bash + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + # TODO (sayakpaul, DN6): revisit `--no-deps` + python -m pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps + python -m uv pip install -U tokenizers + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + + - name: Environment + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python utils/print_env.py + + - name: Run fast PyTorch LoRA tests with PEFT + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ + -s -v \ + --make-reports=tests_peft_main \ + tests/lora/ + python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ + -s -v \ + --make-reports=tests_models_lora_peft_main \ + tests/models/ -k "lora" + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_peft_main_failures_short.txt + cat reports/tests_models_lora_peft_main_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: pr_main_test_reports + path: reports + diff --git a/.github/workflows/pr_tests_gpu.yml b/.github/workflows/pr_tests_gpu.yml new file mode 100644 index 000000000000..89b6abe20d1e --- /dev/null +++ b/.github/workflows/pr_tests_gpu.yml @@ -0,0 +1,297 @@ +name: Fast GPU Tests on PR + +on: + pull_request: + branches: main + paths: + - "src/diffusers/models/modeling_utils.py" + - "src/diffusers/models/model_loading_utils.py" + - "src/diffusers/pipelines/pipeline_utils.py" + - "src/diffusers/pipeline_loading_utils.py" + - "src/diffusers/loaders/lora_base.py" + - "src/diffusers/loaders/lora_pipeline.py" + - "src/diffusers/loaders/peft.py" + - "tests/pipelines/test_pipelines_common.py" + - "tests/models/test_modeling_common.py" + - "examples/**/*.py" + workflow_dispatch: + +concurrency: + group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} + cancel-in-progress: true + +env: + DIFFUSERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + HF_HUB_ENABLE_HF_TRANSFER: 1 + PYTEST_TIMEOUT: 600 + PIPELINE_USAGE_CUTOFF: 1000000000 # set high cutoff so that only always-test pipelines run + +jobs: + check_code_quality: + runs-on: ubuntu-22.04 + steps: + - uses: actions/checkout@v3 + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: "3.8" + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install .[quality] + - name: Check quality + run: make quality + - name: Check if failure + if: ${{ failure() }} + run: | + echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY + + check_repository_consistency: + needs: check_code_quality + runs-on: ubuntu-22.04 + steps: + - uses: actions/checkout@v3 + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: "3.8" + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install .[quality] + - name: Check repo consistency + run: | + python utils/check_copies.py + python utils/check_dummies.py + python utils/check_support_list.py + make deps_table_check_updated + - name: Check if failure + if: ${{ failure() }} + run: | + echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY + + setup_torch_cuda_pipeline_matrix: + needs: [check_code_quality, check_repository_consistency] + name: Setup Torch Pipelines CUDA Slow Tests Matrix + runs-on: + group: aws-general-8-plus + container: + image: diffusers/diffusers-pytorch-cpu + outputs: + pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + - name: Environment + run: | + python utils/print_env.py + - name: Fetch Pipeline Matrix + id: fetch_pipeline_matrix + run: | + matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) + echo $matrix + echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT + - name: Pipeline Tests Artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: test-pipelines.json + path: reports + + torch_pipelines_cuda_tests: + name: Torch Pipelines CUDA Tests + needs: setup_torch_cuda_pipeline_matrix + strategy: + fail-fast: false + max-parallel: 8 + matrix: + module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "16gb" --ipc host --gpus all + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + + - name: Environment + run: | + python utils/print_env.py + - name: Extract tests + id: extract_tests + run: | + pattern=$(python utils/extract_tests_from_mixin.py --type pipeline) + echo "$pattern" > /tmp/test_pattern.txt + echo "pattern_file=/tmp/test_pattern.txt" >> $GITHUB_OUTPUT + + - name: PyTorch CUDA checkpoint tests on Ubuntu + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + if [ "${{ matrix.module }}" = "ip_adapters" ]; then + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ + --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ + tests/pipelines/${{ matrix.module }} + else + pattern=$(cat ${{ steps.extract_tests.outputs.pattern_file }}) + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx and $pattern" \ + --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ + tests/pipelines/${{ matrix.module }} + fi + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt + cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: pipeline_${{ matrix.module }}_test_reports + path: reports + + torch_cuda_tests: + name: Torch CUDA Tests + needs: [check_code_quality, check_repository_consistency] + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "16gb" --ipc host --gpus all + defaults: + run: + shell: bash + strategy: + fail-fast: false + max-parallel: 4 + matrix: + module: [models, schedulers, lora, others] + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + + - name: Environment + run: | + python utils/print_env.py + + - name: Extract tests + id: extract_tests + run: | + pattern=$(python utils/extract_tests_from_mixin.py --type ${{ matrix.module }}) + echo "$pattern" > /tmp/test_pattern.txt + echo "pattern_file=/tmp/test_pattern.txt" >> $GITHUB_OUTPUT + + - name: Run PyTorch CUDA tests + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + pattern=$(cat ${{ steps.extract_tests.outputs.pattern_file }}) + if [ -z "$pattern" ]; then + python -m pytest -n 1 -sv --max-worker-restart=0 --dist=loadfile -k "not Flax and not Onnx" tests/${{ matrix.module }} \ + --make-reports=tests_torch_cuda_${{ matrix.module }} + else + python -m pytest -n 1 -sv --max-worker-restart=0 --dist=loadfile -k "not Flax and not Onnx and $pattern" tests/${{ matrix.module }} \ + --make-reports=tests_torch_cuda_${{ matrix.module }} + fi + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_torch_cuda_${{ matrix.module }}_stats.txt + cat reports/tests_torch_cuda_${{ matrix.module }}_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_cuda_test_reports_${{ matrix.module }} + path: reports + + run_examples_tests: + name: Examples PyTorch CUDA tests on Ubuntu + needs: [check_code_quality, check_repository_consistency] + runs-on: + group: aws-g4dn-2xlarge + + container: + image: diffusers/diffusers-pytorch-cuda + options: --gpus all --shm-size "16gb" --ipc host + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + pip uninstall transformers -y && pip uninstall huggingface_hub -y && python -m uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git + python -m uv pip install -e [quality,test,training] + + - name: Environment + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python utils/print_env.py + + - name: Run example tests on GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install timm + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/ + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/examples_torch_cuda_stats.txt + cat reports/examples_torch_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: examples_test_reports + path: reports + diff --git a/.github/workflows/pr_torch_dependency_test.yml b/.github/workflows/pr_torch_dependency_test.yml index 16a7724fe744..c39d5eca2d9a 100644 --- a/.github/workflows/pr_torch_dependency_test.yml +++ b/.github/workflows/pr_torch_dependency_test.yml @@ -16,7 +16,7 @@ concurrency: jobs: check_torch_dependencies: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 - name: Set up Python diff --git a/.github/workflows/push_tests.yml b/.github/workflows/push_tests.yml index 36f011407901..6896e0145cbb 100644 --- a/.github/workflows/push_tests.yml +++ b/.github/workflows/push_tests.yml @@ -1,6 +1,7 @@ -name: Slow Tests on main +name: Fast GPU Tests on main on: + workflow_dispatch: push: branches: - main @@ -11,17 +12,19 @@ on: env: DIFFUSERS_IS_CI: yes - HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 + HF_HUB_ENABLE_HF_TRANSFER: 1 PYTEST_TIMEOUT: 600 - RUN_SLOW: yes PIPELINE_USAGE_CUTOFF: 50000 jobs: setup_torch_cuda_pipeline_matrix: name: Setup Torch Pipelines CUDA Slow Tests Matrix - runs-on: ubuntu-latest + runs-on: + group: aws-general-8-plus + container: + image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: @@ -29,14 +32,13 @@ jobs: uses: actions/checkout@v3 with: fetch-depth: 2 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - name: Install dependencies run: | - pip install -e . - pip install huggingface_hub + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + - name: Environment + run: | + python utils/print_env.py - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix run: | @@ -45,22 +47,24 @@ jobs: echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - name: Pipeline Tests Artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: test-pipelines.json path: reports torch_pipelines_cuda_tests: - name: Torch Pipelines CUDA Slow Tests + name: Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false + max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} - runs-on: [single-gpu, nvidia-gpu, t4, ci] + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 + options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -71,16 +75,15 @@ jobs: nvidia-smi - name: Install dependencies run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - name: Environment run: | python utils/print_env.py - - name: Slow PyTorch CUDA checkpoint tests on Ubuntu + - name: PyTorch CUDA checkpoint tests on Ubuntu env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | @@ -93,26 +96,28 @@ jobs: run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pipeline_${{ matrix.module }}_test_reports path: reports torch_cuda_tests: name: Torch CUDA Tests - runs-on: docker-gpu + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 + options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: + fail-fast: false + max-parallel: 2 matrix: - module: [models, schedulers, lora, others] + module: [models, schedulers, lora, others, single_file] steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -121,194 +126,48 @@ jobs: - name: Install dependencies run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - name: Environment run: | python utils/print_env.py - - name: Run slow PyTorch CUDA tests + - name: Run PyTorch CUDA tests env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_torch_cuda \ + --make-reports=tests_torch_cuda_${{ matrix.module }} \ tests/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | - cat reports/tests_torch_cuda_stats.txt - cat reports/tests_torch_cuda_failures_short.txt + cat reports/tests_torch_cuda_${{ matrix.module }}_stats.txt + cat reports/tests_torch_cuda_${{ matrix.module }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: - name: torch_cuda_test_reports - path: reports - - peft_cuda_tests: - name: PEFT CUDA Tests - runs-on: docker-gpu - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run slow PEFT CUDA tests - env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx and not PEFTLoRALoading" \ - --make-reports=tests_peft_cuda \ - tests/lora/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_peft_cuda_stats.txt - cat reports/tests_peft_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v2 - with: - name: torch_peft_test_reports - path: reports - - flax_tpu_tests: - name: Flax TPU Tests - runs-on: docker-tpu - container: - image: diffusers/diffusers-flax-tpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --privileged - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run slow Flax TPU tests - env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - run: | - python -m pytest -n 0 \ - -s -v -k "Flax" \ - --make-reports=tests_flax_tpu \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_flax_tpu_stats.txt - cat reports/tests_flax_tpu_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v2 - with: - name: flax_tpu_test_reports - path: reports - - onnx_cuda_tests: - name: ONNX CUDA Tests - runs-on: docker-gpu - container: - image: diffusers/diffusers-onnxruntime-cuda - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0 - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run slow ONNXRuntime CUDA tests - env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_onnx_cuda \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_onnx_cuda_stats.txt - cat reports/tests_onnx_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v2 - with: - name: onnx_cuda_test_reports + name: torch_cuda_test_reports_${{ matrix.module }} path: reports run_torch_compile_tests: name: PyTorch Compile CUDA tests - runs-on: docker-gpu + runs-on: + group: aws-g4dn-2xlarge container: - image: diffusers/diffusers-pytorch-compile-cuda - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ + image: diffusers/diffusers-pytorch-cuda + options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers @@ -328,7 +187,8 @@ jobs: python utils/print_env.py - name: Run example tests on GPU env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + RUN_COMPILE: yes run: | python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/ - name: Failure short reports @@ -337,7 +197,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: torch_compile_test_reports path: reports @@ -345,11 +205,12 @@ jobs: run_xformers_tests: name: PyTorch xformers CUDA tests - runs-on: docker-gpu + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-xformers-cuda - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ + options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers @@ -369,7 +230,7 @@ jobs: python utils/print_env.py - name: Run example tests on GPU env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ - name: Failure short reports @@ -378,7 +239,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: torch_xformers_test_reports path: reports @@ -386,12 +247,12 @@ jobs: run_examples_tests: name: Examples PyTorch CUDA tests on Ubuntu - runs-on: docker-gpu + runs-on: + group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - + options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v3 @@ -401,7 +262,6 @@ jobs: - name: NVIDIA-SMI run: | nvidia-smi - - name: Install dependencies run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" @@ -414,9 +274,10 @@ jobs: - name: Run example tests on GPU env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install timm python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/ - name: Failure short reports @@ -427,7 +288,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: examples_test_reports - path: reports \ No newline at end of file + path: reports diff --git a/.github/workflows/push_tests_fast.yml b/.github/workflows/push_tests_fast.yml index 7c50da7b5c34..e274cb021892 100644 --- a/.github/workflows/push_tests_fast.yml +++ b/.github/workflows/push_tests_fast.yml @@ -18,6 +18,7 @@ env: HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 + HF_HUB_ENABLE_HF_TRANSFER: 1 PYTEST_TIMEOUT: 600 RUN_SLOW: no @@ -29,28 +30,19 @@ jobs: config: - name: Fast PyTorch CPU tests on Ubuntu framework: pytorch - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] + runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu - - name: Fast Flax CPU tests on Ubuntu - framework: flax - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] - image: diffusers/diffusers-flax-cpu - report: flax_cpu - - name: Fast ONNXRuntime CPU tests on Ubuntu - framework: onnxruntime - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] - image: diffusers/diffusers-onnxruntime-cpu - report: onnx_cpu - name: PyTorch Example CPU tests on Ubuntu framework: pytorch_examples - runner: [ self-hosted, intel-cpu, 8-cpu, ci ] + runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_example_cpu name: ${{ matrix.config.name }} - runs-on: ${{ matrix.config.runner }} + runs-on: + group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} @@ -85,29 +77,11 @@ jobs: --make-reports=tests_${{ matrix.config.report }} \ tests/ - - name: Run fast Flax TPU tests - if: ${{ matrix.config.framework == 'flax' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Flax" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/ - - - name: Run fast ONNXRuntime CPU tests - if: ${{ matrix.config.framework == 'onnxruntime' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/ - - name: Run example PyTorch CPU tests if: ${{ matrix.config.framework == 'pytorch_examples' }} run: | python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install peft + python -m uv pip install peft timm python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_${{ matrix.config.report }} \ examples @@ -118,7 +92,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pr_${{ matrix.config.report }}_test_reports path: reports diff --git a/.github/workflows/push_tests_mps.yml b/.github/workflows/push_tests_mps.yml index 3a14f856346b..eb6c0da22541 100644 --- a/.github/workflows/push_tests_mps.yml +++ b/.github/workflows/push_tests_mps.yml @@ -1,18 +1,14 @@ name: Fast mps tests on main on: - push: - branches: - - main - paths: - - "src/diffusers/**.py" - - "tests/**.py" + workflow_dispatch: env: DIFFUSERS_IS_CI: yes HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 + HF_HUB_ENABLE_HF_TRANSFER: 1 PYTEST_TIMEOUT: 600 RUN_SLOW: no @@ -23,7 +19,7 @@ concurrency: jobs: run_fast_tests_apple_m1: name: Fast PyTorch MPS tests on MacOS - runs-on: [ self-hosted, apple-m1 ] + runs-on: macos-13-xlarge steps: - name: Checkout diffusers @@ -45,7 +41,7 @@ jobs: shell: arch -arch arm64 bash {0} run: | ${CONDA_RUN} python -m pip install --upgrade pip uv - ${CONDA_RUN} python -m uv pip install -e [quality,test] + ${CONDA_RUN} python -m uv pip install -e ".[quality,test]" ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git ${CONDA_RUN} python -m uv pip install transformers --upgrade @@ -59,7 +55,7 @@ jobs: shell: arch -arch arm64 bash {0} env: HF_HOME: /System/Volumes/Data/mnt/cache - HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }} + HF_TOKEN: ${{ secrets.HF_TOKEN }} run: | ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/ @@ -69,7 +65,7 @@ jobs: - name: Test suite reports artifacts if: ${{ always() }} - uses: actions/upload-artifact@v2 + uses: actions/upload-artifact@v4 with: name: pr_torch_mps_test_reports path: reports diff --git a/.github/workflows/pypi_publish.yaml b/.github/workflows/pypi_publish.yaml index 54e9afe6d9b7..dc36b6b024c5 100644 --- a/.github/workflows/pypi_publish.yaml +++ b/.github/workflows/pypi_publish.yaml @@ -10,7 +10,7 @@ on: jobs: find-and-checkout-latest-branch: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 outputs: latest_branch: ${{ steps.set_latest_branch.outputs.latest_branch }} steps: @@ -29,46 +29,46 @@ jobs: LATEST_BRANCH=$(python utils/fetch_latest_release_branch.py) echo "Latest branch: $LATEST_BRANCH" echo "latest_branch=$LATEST_BRANCH" >> $GITHUB_ENV - + - name: Set latest branch output id: set_latest_branch run: echo "::set-output name=latest_branch::${{ env.latest_branch }}" release: needs: find-and-checkout-latest-branch - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - name: Checkout Repo uses: actions/checkout@v3 with: ref: ${{ needs.find-and-checkout-latest-branch.outputs.latest_branch }} - + - name: Setup Python uses: actions/setup-python@v4 with: python-version: "3.8" - + - name: Install dependencies run: | python -m pip install --upgrade pip pip install -U setuptools wheel twine pip install -U torch --index-url https://download.pytorch.org/whl/cpu pip install -U transformers - + - name: Build the dist files run: python setup.py bdist_wheel && python setup.py sdist - + - name: Publish to the test PyPI env: TWINE_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }} TWINE_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }} - run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ + run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ - name: Test installing diffusers and importing run: | pip install diffusers && pip uninstall diffusers -y - pip install -i https://testpypi.python.org/pypi diffusers + pip install -i https://test.pypi.org/simple/ diffusers python -c "from diffusers import __version__; print(__version__)" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')" diff --git a/.github/workflows/release_tests_fast.yml b/.github/workflows/release_tests_fast.yml new file mode 100644 index 000000000000..81a34f7a464d --- /dev/null +++ b/.github/workflows/release_tests_fast.yml @@ -0,0 +1,351 @@ +# Duplicate workflow to push_tests.yml that is meant to run on release/patch branches as a final check +# Creating a duplicate workflow here is simpler than adding complex path/branch parsing logic to push_tests.yml +# Needs to be updated if push_tests.yml updated +name: (Release) Fast GPU Tests on main + +on: + push: + branches: + - "v*.*.*-release" + - "v*.*.*-patch" + +env: + DIFFUSERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + PYTEST_TIMEOUT: 600 + PIPELINE_USAGE_CUTOFF: 50000 + +jobs: + setup_torch_cuda_pipeline_matrix: + name: Setup Torch Pipelines CUDA Slow Tests Matrix + runs-on: + group: aws-general-8-plus + container: + image: diffusers/diffusers-pytorch-cpu + outputs: + pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + - name: Environment + run: | + python utils/print_env.py + - name: Fetch Pipeline Matrix + id: fetch_pipeline_matrix + run: | + matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) + echo $matrix + echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT + - name: Pipeline Tests Artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: test-pipelines.json + path: reports + + torch_pipelines_cuda_tests: + name: Torch Pipelines CUDA Tests + needs: setup_torch_cuda_pipeline_matrix + strategy: + fail-fast: false + max-parallel: 8 + matrix: + module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "16gb" --ipc host --gpus all + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + - name: NVIDIA-SMI + run: | + nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + - name: Environment + run: | + python utils/print_env.py + - name: Slow PyTorch CUDA checkpoint tests on Ubuntu + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ + --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ + tests/pipelines/${{ matrix.module }} + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt + cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: pipeline_${{ matrix.module }}_test_reports + path: reports + + torch_cuda_tests: + name: Torch CUDA Tests + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-cuda + options: --shm-size "16gb" --ipc host --gpus all + defaults: + run: + shell: bash + strategy: + fail-fast: false + max-parallel: 2 + matrix: + module: [models, schedulers, lora, others, single_file] + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + + - name: Environment + run: | + python utils/print_env.py + + - name: Run PyTorch CUDA tests + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ + --make-reports=tests_torch_${{ matrix.module }}_cuda \ + tests/${{ matrix.module }} + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt + cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_cuda_${{ matrix.module }}_test_reports + path: reports + + torch_minimum_version_cuda_tests: + name: Torch Minimum Version CUDA Tests + runs-on: + group: aws-g4dn-2xlarge + container: + image: diffusers/diffusers-pytorch-minimum-cuda + options: --shm-size "16gb" --ipc host --gpus all + defaults: + run: + shell: bash + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft@git+https://github.com/huggingface/peft.git + pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git + + - name: Environment + run: | + python utils/print_env.py + + - name: Run PyTorch CUDA tests + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms + CUBLAS_WORKSPACE_CONFIG: :16:8 + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ + -s -v -k "not Flax and not Onnx" \ + --make-reports=tests_torch_minimum_cuda \ + tests/models/test_modeling_common.py \ + tests/pipelines/test_pipelines_common.py \ + tests/pipelines/test_pipeline_utils.py \ + tests/pipelines/test_pipelines.py \ + tests/pipelines/test_pipelines_auto.py \ + tests/schedulers/test_schedulers.py \ + tests/others + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/tests_torch_minimum_version_cuda_stats.txt + cat reports/tests_torch_minimum_version_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_minimum_version_cuda_test_reports + path: reports + + run_torch_compile_tests: + name: PyTorch Compile CUDA tests + + runs-on: + group: aws-g4dn-2xlarge + + container: + image: diffusers/diffusers-pytorch-cuda + options: --gpus all --shm-size "16gb" --ipc host + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test,training] + - name: Environment + run: | + python utils/print_env.py + - name: Run torch compile tests on GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + RUN_COMPILE: yes + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/ + - name: Failure short reports + if: ${{ failure() }} + run: cat reports/tests_torch_compile_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_compile_test_reports + path: reports + + run_xformers_tests: + name: PyTorch xformers CUDA tests + + runs-on: + group: aws-g4dn-2xlarge + + container: + image: diffusers/diffusers-pytorch-xformers-cuda + options: --gpus all --shm-size "16gb" --ipc host + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test,training] + - name: Environment + run: | + python utils/print_env.py + - name: Run example tests on GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + run: | + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ + - name: Failure short reports + if: ${{ failure() }} + run: cat reports/tests_torch_xformers_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: torch_xformers_test_reports + path: reports + + run_examples_tests: + name: Examples PyTorch CUDA tests on Ubuntu + + runs-on: + group: aws-g4dn-2xlarge + + container: + image: diffusers/diffusers-pytorch-cuda + options: --gpus all --shm-size "16gb" --ipc host + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + + - name: Install dependencies + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test,training] + + - name: Environment + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python utils/print_env.py + + - name: Run example tests on GPU + env: + HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install timm + python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/ + + - name: Failure short reports + if: ${{ failure() }} + run: | + cat reports/examples_torch_cuda_stats.txt + cat reports/examples_torch_cuda_failures_short.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v4 + with: + name: examples_test_reports + path: reports diff --git a/.github/workflows/run_tests_from_a_pr.yml b/.github/workflows/run_tests_from_a_pr.yml new file mode 100644 index 000000000000..c8eee8dbbc33 --- /dev/null +++ b/.github/workflows/run_tests_from_a_pr.yml @@ -0,0 +1,74 @@ +name: Check running SLOW tests from a PR (only GPU) + +on: + workflow_dispatch: + inputs: + docker_image: + default: 'diffusers/diffusers-pytorch-cuda' + description: 'Name of the Docker image' + required: true + pr_number: + description: 'PR number to test on' + required: true + test: + description: 'Tests to run (e.g.: `tests/models`).' + required: true + +env: + DIFFUSERS_IS_CI: yes + IS_GITHUB_CI: "1" + HF_HOME: /mnt/cache + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + PYTEST_TIMEOUT: 600 + RUN_SLOW: yes + +jobs: + run_tests: + name: "Run a test on our runner from a PR" + runs-on: + group: aws-g4dn-2xlarge + container: + image: ${{ github.event.inputs.docker_image }} + options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + + steps: + - name: Validate test files input + id: validate_test_files + env: + PY_TEST: ${{ github.event.inputs.test }} + run: | + if [[ ! "$PY_TEST" =~ ^tests/ ]]; then + echo "Error: The input string must start with 'tests/'." + exit 1 + fi + + if [[ ! "$PY_TEST" =~ ^tests/(models|pipelines|lora) ]]; then + echo "Error: The input string must contain either 'models', 'pipelines', or 'lora' after 'tests/'." + exit 1 + fi + + if [[ "$PY_TEST" == *";"* ]]; then + echo "Error: The input string must not contain ';'." + exit 1 + fi + echo "$PY_TEST" + + shell: bash -e {0} + + - name: Checkout PR branch + uses: actions/checkout@v4 + with: + ref: refs/pull/${{ inputs.pr_number }}/head + + - name: Install pytest + run: | + python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" + python -m uv pip install -e [quality,test] + python -m uv pip install peft + + - name: Run tests + env: + PY_TEST: ${{ github.event.inputs.test }} + run: | + pytest "$PY_TEST" diff --git a/.github/workflows/ssh-pr-runner.yml b/.github/workflows/ssh-pr-runner.yml new file mode 100644 index 000000000000..49fa9c0ad24d --- /dev/null +++ b/.github/workflows/ssh-pr-runner.yml @@ -0,0 +1,40 @@ +name: SSH into PR runners + +on: + workflow_dispatch: + inputs: + docker_image: + description: 'Name of the Docker image' + required: true + +env: + IS_GITHUB_CI: "1" + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} + HF_HOME: /mnt/cache + DIFFUSERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + RUN_SLOW: yes + +jobs: + ssh_runner: + name: "SSH" + runs-on: + group: aws-highmemory-32-plus + container: + image: ${{ github.event.inputs.docker_image }} + options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --privileged + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Tailscale # In order to be able to SSH when a test fails + uses: huggingface/tailscale-action@main + with: + authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} + slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} + slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} + waitForSSH: true diff --git a/.github/workflows/ssh-runner.yml b/.github/workflows/ssh-runner.yml new file mode 100644 index 000000000000..917eb5b1b31a --- /dev/null +++ b/.github/workflows/ssh-runner.yml @@ -0,0 +1,52 @@ +name: SSH into GPU runners + +on: + workflow_dispatch: + inputs: + runner_type: + description: 'Type of runner to test (aws-g6-4xlarge-plus: a10, aws-g4dn-2xlarge: t4, aws-g6e-xlarge-plus: L40)' + type: choice + required: true + options: + - aws-g6-4xlarge-plus + - aws-g4dn-2xlarge + - aws-g6e-xlarge-plus + docker_image: + description: 'Name of the Docker image' + required: true + +env: + IS_GITHUB_CI: "1" + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} + HF_HOME: /mnt/cache + DIFFUSERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + RUN_SLOW: yes + +jobs: + ssh_runner: + name: "SSH" + runs-on: + group: "${{ github.event.inputs.runner_type }}" + container: + image: ${{ github.event.inputs.docker_image }} + options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus all --privileged + + steps: + - name: Checkout diffusers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: NVIDIA-SMI + run: | + nvidia-smi + + - name: Tailscale # In order to be able to SSH when a test fails + uses: huggingface/tailscale-action@main + with: + authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} + slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} + slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} + waitForSSH: true diff --git a/.github/workflows/stale.yml b/.github/workflows/stale.yml index ff609ee76946..27450ed4c7f2 100644 --- a/.github/workflows/stale.yml +++ b/.github/workflows/stale.yml @@ -8,7 +8,10 @@ jobs: close_stale_issues: name: Close Stale Issues if: github.repository == 'huggingface/diffusers' - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 + permissions: + issues: write + pull-requests: write env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} steps: diff --git a/.github/workflows/trufflehog.yml b/.github/workflows/trufflehog.yml new file mode 100644 index 000000000000..4743dc352455 --- /dev/null +++ b/.github/workflows/trufflehog.yml @@ -0,0 +1,18 @@ +on: + push: + +name: Secret Leaks + +jobs: + trufflehog: + runs-on: ubuntu-22.04 + steps: + - name: Checkout code + uses: actions/checkout@v4 + with: + fetch-depth: 0 + - name: Secret Scanning + uses: trufflesecurity/trufflehog@main + with: + extra_args: --results=verified,unknown + diff --git a/.github/workflows/typos.yml b/.github/workflows/typos.yml index fbd051b4da0d..6d2f2fc8dd9a 100644 --- a/.github/workflows/typos.yml +++ b/.github/workflows/typos.yml @@ -5,7 +5,7 @@ on: jobs: build: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 diff --git a/.github/workflows/update_metadata.yml b/.github/workflows/update_metadata.yml index 33d162ef8d1f..92aea0369ba8 100644 --- a/.github/workflows/update_metadata.yml +++ b/.github/workflows/update_metadata.yml @@ -25,6 +25,6 @@ jobs: - name: Update metadata env: - HUGGING_FACE_HUB_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }} + HF_TOKEN: ${{ secrets.SAYAK_HF_TOKEN }} run: | python utils/update_metadata.py --commit_sha ${{ github.sha }} diff --git a/.gitignore b/.gitignore index 9d74fe840449..15617d5fdc74 100644 --- a/.gitignore +++ b/.gitignore @@ -175,4 +175,4 @@ tags .ruff_cache # wandb -wandb +wandb \ No newline at end of file diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 887e4dd43c45..ec18df882641 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,4 +1,4 @@ - + +# Outpainting + +Outpainting extends an image beyond its original boundaries, allowing you to add, replace, or modify visual elements in an image while preserving the original image. Like [inpainting](../using-diffusers/inpaint), you want to fill the white area (in this case, the area outside of the original image) with new visual elements while keeping the original image (represented by a mask of black pixels). There are a couple of ways to outpaint, such as with a [ControlNet](https://hf.co/blog/OzzyGT/outpainting-controlnet) or with [Differential Diffusion](https://hf.co/blog/OzzyGT/outpainting-differential-diffusion). + +This guide will show you how to outpaint with an inpainting model, ControlNet, and a ZoeDepth estimator. + +Before you begin, make sure you have the [controlnet_aux](https://github.com/huggingface/controlnet_aux) library installed so you can use the ZoeDepth estimator. + +```py +!pip install -q controlnet_aux +``` + +## Image preparation + +Start by picking an image to outpaint with and remove the background with a Space like [BRIA-RMBG-1.4](https://hf.co/spaces/briaai/BRIA-RMBG-1.4). + + + +For example, remove the background from this image of a pair of shoes. + +
+
+ +
original image
+
+
+ +
background removed
+
+
+ +[Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) models work best with 1024x1024 images, but you can resize the image to any size as long as your hardware has enough memory to support it. The transparent background in the image should also be replaced with a white background. Create a function (like the one below) that scales and pastes the image onto a white background. + +```py +import random + +import requests +import torch +from controlnet_aux import ZoeDetector +from PIL import Image, ImageOps + +from diffusers import ( + AutoencoderKL, + ControlNetModel, + StableDiffusionXLControlNetPipeline, + StableDiffusionXLInpaintPipeline, +) + +def scale_and_paste(original_image): + aspect_ratio = original_image.width / original_image.height + + if original_image.width > original_image.height: + new_width = 1024 + new_height = round(new_width / aspect_ratio) + else: + new_height = 1024 + new_width = round(new_height * aspect_ratio) + + resized_original = original_image.resize((new_width, new_height), Image.LANCZOS) + white_background = Image.new("RGBA", (1024, 1024), "white") + x = (1024 - new_width) // 2 + y = (1024 - new_height) // 2 + white_background.paste(resized_original, (x, y), resized_original) + + return resized_original, white_background + +original_image = Image.open( + requests.get( + "https://huggingface.co/datasets/stevhliu/testing-images/resolve/main/no-background-jordan.png", + stream=True, + ).raw +).convert("RGBA") +resized_img, white_bg_image = scale_and_paste(original_image) +``` + +To avoid adding unwanted extra details, use the ZoeDepth estimator to provide additional guidance during generation and to ensure the shoes remain consistent with the original image. + +```py +zoe = ZoeDetector.from_pretrained("lllyasviel/Annotators") +image_zoe = zoe(white_bg_image, detect_resolution=512, image_resolution=1024) +image_zoe +``` + +
+ +
+ +## Outpaint + +Once your image is ready, you can generate content in the white area around the shoes with [controlnet-inpaint-dreamer-sdxl](https://hf.co/destitech/controlnet-inpaint-dreamer-sdxl), a SDXL ControlNet trained for inpainting. + +Load the inpainting ControlNet, ZoeDepth model, VAE and pass them to the [`StableDiffusionXLControlNetPipeline`]. Then you can create an optional `generate_image` function (for convenience) to outpaint an initial image. + +```py +controlnets = [ + ControlNetModel.from_pretrained( + "destitech/controlnet-inpaint-dreamer-sdxl", torch_dtype=torch.float16, variant="fp16" + ), + ControlNetModel.from_pretrained( + "diffusers/controlnet-zoe-depth-sdxl-1.0", torch_dtype=torch.float16 + ), +] +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda") +pipeline = StableDiffusionXLControlNetPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnets, vae=vae +).to("cuda") + +def generate_image(prompt, negative_prompt, inpaint_image, zoe_image, seed: int = None): + if seed is None: + seed = random.randint(0, 2**32 - 1) + + generator = torch.Generator(device="cpu").manual_seed(seed) + + image = pipeline( + prompt, + negative_prompt=negative_prompt, + image=[inpaint_image, zoe_image], + guidance_scale=6.5, + num_inference_steps=25, + generator=generator, + controlnet_conditioning_scale=[0.5, 0.8], + control_guidance_end=[0.9, 0.6], + ).images[0] + + return image + +prompt = "nike air jordans on a basketball court" +negative_prompt = "" + +temp_image = generate_image(prompt, negative_prompt, white_bg_image, image_zoe, 908097) +``` + +Paste the original image over the initial outpainted image. You'll improve the outpainted background in a later step. + +```py +x = (1024 - resized_img.width) // 2 +y = (1024 - resized_img.height) // 2 +temp_image.paste(resized_img, (x, y), resized_img) +temp_image +``` + +
+ +
+ +> [!TIP] +> Now is a good time to free up some memory if you're running low! +> +> ```py +> pipeline=None +> torch.cuda.empty_cache() +> ``` + +Now that you have an initial outpainted image, load the [`StableDiffusionXLInpaintPipeline`] with the [RealVisXL](https://hf.co/SG161222/RealVisXL_V4.0) model to generate the final outpainted image with better quality. + +```py +pipeline = StableDiffusionXLInpaintPipeline.from_pretrained( + "OzzyGT/RealVisXL_V4.0_inpainting", + torch_dtype=torch.float16, + variant="fp16", + vae=vae, +).to("cuda") +``` + +Prepare a mask for the final outpainted image. To create a more natural transition between the original image and the outpainted background, blur the mask to help it blend better. + +```py +mask = Image.new("L", temp_image.size) +mask.paste(resized_img.split()[3], (x, y)) +mask = ImageOps.invert(mask) +final_mask = mask.point(lambda p: p > 128 and 255) +mask_blurred = pipeline.mask_processor.blur(final_mask, blur_factor=20) +mask_blurred +``` + +
+ +
+ +Create a better prompt and pass it to the `generate_outpaint` function to generate the final outpainted image. Again, paste the original image over the final outpainted background. + +```py +def generate_outpaint(prompt, negative_prompt, image, mask, seed: int = None): + if seed is None: + seed = random.randint(0, 2**32 - 1) + + generator = torch.Generator(device="cpu").manual_seed(seed) + + image = pipeline( + prompt, + negative_prompt=negative_prompt, + image=image, + mask_image=mask, + guidance_scale=10.0, + strength=0.8, + num_inference_steps=30, + generator=generator, + ).images[0] + + return image + +prompt = "high quality photo of nike air jordans on a basketball court, highly detailed" +negative_prompt = "" + +final_image = generate_outpaint(prompt, negative_prompt, temp_image, mask_blurred, 7688778) +x = (1024 - resized_img.width) // 2 +y = (1024 - resized_img.height) // 2 +final_image.paste(resized_img, (x, y), resized_img) +final_image +``` + +
+ +
diff --git a/docs/source/en/api/activations.md b/docs/source/en/api/activations.md index 3bef28a5ab0d..3d8e14dd50e2 100644 --- a/docs/source/en/api/activations.md +++ b/docs/source/en/api/activations.md @@ -1,4 +1,4 @@ - + +# Caching methods + +Cache methods speedup diffusion transformers by storing and reusing intermediate outputs of specific layers, such as attention and feedforward layers, instead of recalculating them at each inference step. + +## CacheMixin + +[[autodoc]] CacheMixin + +## PyramidAttentionBroadcastConfig + +[[autodoc]] PyramidAttentionBroadcastConfig + +[[autodoc]] apply_pyramid_attention_broadcast + +## FasterCacheConfig + +[[autodoc]] FasterCacheConfig + +[[autodoc]] apply_faster_cache + +### FirstBlockCacheConfig + +[[autodoc]] FirstBlockCacheConfig + +[[autodoc]] apply_first_block_cache diff --git a/docs/source/en/api/configuration.md b/docs/source/en/api/configuration.md index 31d70232a95c..328e109e1e4c 100644 --- a/docs/source/en/api/configuration.md +++ b/docs/source/en/api/configuration.md @@ -1,4 +1,4 @@ - + +# SD3Transformer2D + +This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead. + +The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs. + +> [!TIP] +> To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide. + +## SD3Transformer2DLoadersMixin + +[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin + - all + - _load_ip_adapter_weights \ No newline at end of file diff --git a/docs/source/en/api/loaders/unet.md b/docs/source/en/api/loaders/unet.md index d8cfab64221b..7450e03e580b 100644 --- a/docs/source/en/api/loaders/unet.md +++ b/docs/source/en/api/loaders/unet.md @@ -1,4 +1,4 @@ - + +# AllegroTransformer3DModel + +A Diffusion Transformer model for 3D data from [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AllegroTransformer3DModel + +transformer = AllegroTransformer3DModel.from_pretrained("rhymes-ai/Allegro", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") +``` + +## AllegroTransformer3DModel + +[[autodoc]] AllegroTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/asymmetricautoencoderkl.md b/docs/source/en/api/models/asymmetricautoencoderkl.md index 2023dcf97f9d..fbadf9bd4009 100644 --- a/docs/source/en/api/models/asymmetricautoencoderkl.md +++ b/docs/source/en/api/models/asymmetricautoencoderkl.md @@ -1,4 +1,4 @@ - + +# AuraFlowTransformer2DModel + +A Transformer model for image-like data from [AuraFlow](https://blog.fal.ai/auraflow/). + +## AuraFlowTransformer2DModel + +[[autodoc]] AuraFlowTransformer2DModel diff --git a/docs/source/en/api/models/auto_model.md b/docs/source/en/api/models/auto_model.md new file mode 100644 index 000000000000..376dd12d12c4 --- /dev/null +++ b/docs/source/en/api/models/auto_model.md @@ -0,0 +1,29 @@ + + +# AutoModel + +The `AutoModel` is designed to make it easy to load a checkpoint without needing to know the specific model class. `AutoModel` automatically retrieves the correct model class from the checkpoint `config.json` file. + +```python +from diffusers import AutoModel, AutoPipelineForText2Image + +unet = AutoModel.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet") +pipe = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", unet=unet) +``` + + +## AutoModel + +[[autodoc]] AutoModel + - all + - from_pretrained diff --git a/docs/source/en/api/models/autoencoder_dc.md b/docs/source/en/api/models/autoencoder_dc.md new file mode 100644 index 000000000000..fd53ec0ef66f --- /dev/null +++ b/docs/source/en/api/models/autoencoder_dc.md @@ -0,0 +1,72 @@ + + +# AutoencoderDC + +The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab. + +The abstract from the paper is: + +*We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).* + +The following DCAE models are released and supported in Diffusers. + +| Diffusers format | Original format | +|:----------------:|:---------------:| +| [`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-sana-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0) +| [`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0) +| [`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0) +| [`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0) +| [`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0) +| [`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0) +| [`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0) + +This model was contributed by [lawrence-cj](https://github.com/lawrence-cj). + +Load a model in Diffusers format with [`~ModelMixin.from_pretrained`]. + +```python +from diffusers import AutoencoderDC + +ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda") +``` + +## Load a model in Diffusers via `from_single_file` + +```python +from difusers import AutoencoderDC + +ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0/blob/main/model.safetensors" +model = AutoencoderDC.from_single_file(ckpt_path) + +``` + +The `AutoencoderDC` model has `in` and `mix` single file checkpoint variants that have matching checkpoint keys, but use different scaling factors. It is not possible for Diffusers to automatically infer the correct config file to use with the model based on just the checkpoint and will default to configuring the model using the `mix` variant config file. To override the automatically determined config, please use the `config` argument when using single file loading with `in` variant checkpoints. + +```python +from diffusers import AutoencoderDC + +ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0/blob/main/model.safetensors" +model = AutoencoderDC.from_single_file(ckpt_path, config="mit-han-lab/dc-ae-f128c512-in-1.0-diffusers") +``` + + +## AutoencoderDC + +[[autodoc]] AutoencoderDC + - encode + - decode + - all + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput + diff --git a/docs/source/en/api/models/autoencoder_kl_hunyuan_video.md b/docs/source/en/api/models/autoencoder_kl_hunyuan_video.md new file mode 100644 index 000000000000..b173f9f51e36 --- /dev/null +++ b/docs/source/en/api/models/autoencoder_kl_hunyuan_video.md @@ -0,0 +1,32 @@ + + +# AutoencoderKLHunyuanVideo + +The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLHunyuanVideo + +vae = AutoencoderKLHunyuanVideo.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="vae", torch_dtype=torch.float16) +``` + +## AutoencoderKLHunyuanVideo + +[[autodoc]] AutoencoderKLHunyuanVideo + - decode + - all + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoder_kl_wan.md b/docs/source/en/api/models/autoencoder_kl_wan.md new file mode 100644 index 000000000000..341e5e9a8736 --- /dev/null +++ b/docs/source/en/api/models/autoencoder_kl_wan.md @@ -0,0 +1,32 @@ + + +# AutoencoderKLWan + +The 3D variational autoencoder (VAE) model with KL loss used in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLWan + +vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32) +``` + +## AutoencoderKLWan + +[[autodoc]] AutoencoderKLWan + - decode + - all + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoder_oobleck.md b/docs/source/en/api/models/autoencoder_oobleck.md new file mode 100644 index 000000000000..2f9184ad7301 --- /dev/null +++ b/docs/source/en/api/models/autoencoder_oobleck.md @@ -0,0 +1,38 @@ + + +# AutoencoderOobleck + +The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms. + +The abstract from the paper is: + +*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* + +## AutoencoderOobleck + +[[autodoc]] AutoencoderOobleck + - decode + - encode + - all + +## OobleckDecoderOutput + +[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput + +## OobleckDecoderOutput + +[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput + +## AutoencoderOobleckOutput + +[[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput diff --git a/docs/source/en/api/models/autoencoder_tiny.md b/docs/source/en/api/models/autoencoder_tiny.md index 25fe2b7a8ab9..19603f3a88a5 100644 --- a/docs/source/en/api/models/autoencoder_tiny.md +++ b/docs/source/en/api/models/autoencoder_tiny.md @@ -1,4 +1,4 @@ - + +# AutoencoderKLAllegro + +The 3D variational autoencoder (VAE) model with KL loss used in [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLAllegro + +vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32).to("cuda") +``` + +## AutoencoderKLAllegro + +[[autodoc]] AutoencoderKLAllegro + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_cogvideox.md b/docs/source/en/api/models/autoencoderkl_cogvideox.md new file mode 100644 index 000000000000..2c5411a0647c --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_cogvideox.md @@ -0,0 +1,37 @@ + + +# AutoencoderKLCogVideoX + +The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLCogVideoX + +vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda") +``` + +## AutoencoderKLCogVideoX + +[[autodoc]] AutoencoderKLCogVideoX + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_cosmos.md b/docs/source/en/api/models/autoencoderkl_cosmos.md new file mode 100644 index 000000000000..24f070ac4ef3 --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_cosmos.md @@ -0,0 +1,40 @@ + + +# AutoencoderKLCosmos + +[Cosmos Tokenizers](https://github.com/NVIDIA/Cosmos-Tokenizer). + +Supported models: +- [nvidia/Cosmos-1.0-Tokenizer-CV8x8x8](https://huggingface.co/nvidia/Cosmos-1.0-Tokenizer-CV8x8x8) + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLCosmos + +vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae") +``` + +## AutoencoderKLCosmos + +[[autodoc]] AutoencoderKLCosmos + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_ltx_video.md b/docs/source/en/api/models/autoencoderkl_ltx_video.md new file mode 100644 index 000000000000..9c2384ca53a1 --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_ltx_video.md @@ -0,0 +1,37 @@ + + +# AutoencoderKLLTXVideo + +The 3D variational autoencoder (VAE) model with KL loss used in [LTX](https://huggingface.co/Lightricks/LTX-Video) was introduced by Lightricks. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLLTXVideo + +vae = AutoencoderKLLTXVideo.from_pretrained("Lightricks/LTX-Video", subfolder="vae", torch_dtype=torch.float32).to("cuda") +``` + +## AutoencoderKLLTXVideo + +[[autodoc]] AutoencoderKLLTXVideo + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_magvit.md b/docs/source/en/api/models/autoencoderkl_magvit.md new file mode 100644 index 000000000000..7c1060ddd435 --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_magvit.md @@ -0,0 +1,37 @@ + + +# AutoencoderKLMagvit + +The 3D variational autoencoder (VAE) model with KL loss used in [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLMagvit + +vae = AutoencoderKLMagvit.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="vae", torch_dtype=torch.float16).to("cuda") +``` + +## AutoencoderKLMagvit + +[[autodoc]] AutoencoderKLMagvit + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_mochi.md b/docs/source/en/api/models/autoencoderkl_mochi.md new file mode 100644 index 000000000000..fef6645a18fa --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_mochi.md @@ -0,0 +1,32 @@ + + +# AutoencoderKLMochi + +The 3D variational autoencoder (VAE) model with KL loss used in [Mochi](https://github.com/genmoai/models) was introduced in [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Tsinghua University & ZhipuAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLMochi + +vae = AutoencoderKLMochi.from_pretrained("genmo/mochi-1-preview", subfolder="vae", torch_dtype=torch.float32).to("cuda") +``` + +## AutoencoderKLMochi + +[[autodoc]] AutoencoderKLMochi + - decode + - all + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/autoencoderkl_qwenimage.md b/docs/source/en/api/models/autoencoderkl_qwenimage.md new file mode 100644 index 000000000000..0e176448e158 --- /dev/null +++ b/docs/source/en/api/models/autoencoderkl_qwenimage.md @@ -0,0 +1,35 @@ + + +# AutoencoderKLQwenImage + +The model can be loaded with the following code snippet. + +```python +from diffusers import AutoencoderKLQwenImage + +vae = AutoencoderKLQwenImage.from_pretrained("Qwen/QwenImage-20B", subfolder="vae") +``` + +## AutoencoderKLQwenImage + +[[autodoc]] AutoencoderKLQwenImage + - decode + - encode + - all + +## AutoencoderKLOutput + +[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput + +## DecoderOutput + +[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/docs/source/en/api/models/bria_transformer.md b/docs/source/en/api/models/bria_transformer.md new file mode 100644 index 000000000000..9df7eeb6ffcd --- /dev/null +++ b/docs/source/en/api/models/bria_transformer.md @@ -0,0 +1,19 @@ + + +# BriaTransformer2DModel + +A modified flux Transformer model from [Bria](https://huggingface.co/briaai/BRIA-3.2) + +## BriaTransformer2DModel + +[[autodoc]] BriaTransformer2DModel diff --git a/docs/source/en/api/models/chroma_transformer.md b/docs/source/en/api/models/chroma_transformer.md new file mode 100644 index 000000000000..681e81f7a584 --- /dev/null +++ b/docs/source/en/api/models/chroma_transformer.md @@ -0,0 +1,19 @@ + + +# ChromaTransformer2DModel + +A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma) + +## ChromaTransformer2DModel + +[[autodoc]] ChromaTransformer2DModel diff --git a/docs/source/en/api/models/cogvideox_transformer3d.md b/docs/source/en/api/models/cogvideox_transformer3d.md new file mode 100644 index 000000000000..5d50e5dca651 --- /dev/null +++ b/docs/source/en/api/models/cogvideox_transformer3d.md @@ -0,0 +1,30 @@ + + +# CogVideoXTransformer3DModel + +A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import CogVideoXTransformer3DModel + +transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda") +``` + +## CogVideoXTransformer3DModel + +[[autodoc]] CogVideoXTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/cogview3plus_transformer2d.md b/docs/source/en/api/models/cogview3plus_transformer2d.md new file mode 100644 index 000000000000..1fe574a7fb2f --- /dev/null +++ b/docs/source/en/api/models/cogview3plus_transformer2d.md @@ -0,0 +1,30 @@ + + +# CogView3PlusTransformer2DModel + +A Diffusion Transformer model for 2D data from [CogView3Plus](https://github.com/THUDM/CogView3) was introduced in [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) by Tsinghua University & ZhipuAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import CogView3PlusTransformer2DModel + +transformer = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") +``` + +## CogView3PlusTransformer2DModel + +[[autodoc]] CogView3PlusTransformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/cogview4_transformer2d.md b/docs/source/en/api/models/cogview4_transformer2d.md new file mode 100644 index 000000000000..e87fbc680968 --- /dev/null +++ b/docs/source/en/api/models/cogview4_transformer2d.md @@ -0,0 +1,30 @@ + + +# CogView4Transformer2DModel + +A Diffusion Transformer model for 2D data from [CogView4]() + +The model can be loaded with the following code snippet. + +```python +from diffusers import CogView4Transformer2DModel + +transformer = CogView4Transformer2DModel.from_pretrained("THUDM/CogView4-6B", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") +``` + +## CogView4Transformer2DModel + +[[autodoc]] CogView4Transformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/consisid_transformer3d.md b/docs/source/en/api/models/consisid_transformer3d.md new file mode 100644 index 000000000000..0531d475d2fb --- /dev/null +++ b/docs/source/en/api/models/consisid_transformer3d.md @@ -0,0 +1,30 @@ + + +# ConsisIDTransformer3DModel + +A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://huggingface.co/papers/2411.17440) by Peking University & University of Rochester & etc. + +The model can be loaded with the following code snippet. + +```python +from diffusers import ConsisIDTransformer3DModel + +transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") +``` + +## ConsisIDTransformer3DModel + +[[autodoc]] ConsisIDTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/consistency_decoder_vae.md b/docs/source/en/api/models/consistency_decoder_vae.md index 94a64820ebb1..fe039df7f9bf 100644 --- a/docs/source/en/api/models/consistency_decoder_vae.md +++ b/docs/source/en/api/models/consistency_decoder_vae.md @@ -1,4 +1,4 @@ - -# ControlNet +# ControlNetModel The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. @@ -21,7 +21,7 @@ The abstract from the paper is: ## Loading from the original format By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded -from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows: +from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: ```py from diffusers import StableDiffusionControlNetPipeline, ControlNetModel @@ -29,7 +29,7 @@ from diffusers import StableDiffusionControlNetPipeline, ControlNetModel url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path controlnet = ControlNetModel.from_single_file(url) -url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path +url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) ``` @@ -39,12 +39,4 @@ pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=contro ## ControlNetOutput -[[autodoc]] models.controlnet.ControlNetOutput - -## FlaxControlNetModel - -[[autodoc]] FlaxControlNetModel - -## FlaxControlNetOutput - -[[autodoc]] models.controlnet_flax.FlaxControlNetOutput +[[autodoc]] models.controlnets.controlnet.ControlNetOutput diff --git a/docs/source/en/api/models/controlnet_flux.md b/docs/source/en/api/models/controlnet_flux.md new file mode 100644 index 000000000000..6b230d90fba3 --- /dev/null +++ b/docs/source/en/api/models/controlnet_flux.md @@ -0,0 +1,45 @@ + + +# FluxControlNetModel + +FluxControlNetModel is an implementation of ControlNet for Flux.1. + +The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +## Loading from the original format + +By default the [`FluxControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. + +```py +from diffusers import FluxControlNetPipeline +from diffusers.models import FluxControlNetModel, FluxMultiControlNetModel + +controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") +pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) + +controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") +controlnet = FluxMultiControlNetModel([controlnet]) +pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) +``` + +## FluxControlNetModel + +[[autodoc]] FluxControlNetModel + +## FluxControlNetOutput + +[[autodoc]] models.controlnet_flux.FluxControlNetOutput \ No newline at end of file diff --git a/docs/source/en/api/models/controlnet_hunyuandit.md b/docs/source/en/api/models/controlnet_hunyuandit.md new file mode 100644 index 000000000000..2ea5ab4b88d4 --- /dev/null +++ b/docs/source/en/api/models/controlnet_hunyuandit.md @@ -0,0 +1,37 @@ + + +# HunyuanDiT2DControlNetModel + +HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://huggingface.co/papers/2405.08748). + +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). + +## Example For Loading HunyuanDiT2DControlNetModel + +```py +from diffusers import HunyuanDiT2DControlNetModel +import torch +controlnet = HunyuanDiT2DControlNetModel.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Pose", torch_dtype=torch.float16) +``` + +## HunyuanDiT2DControlNetModel + +[[autodoc]] HunyuanDiT2DControlNetModel \ No newline at end of file diff --git a/docs/source/en/api/models/controlnet_sana.md b/docs/source/en/api/models/controlnet_sana.md new file mode 100644 index 000000000000..fe7b184c901f --- /dev/null +++ b/docs/source/en/api/models/controlnet_sana.md @@ -0,0 +1,29 @@ + + +# SanaControlNetModel + +The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This model was contributed by [ishan24](https://huggingface.co/ishan24). ❤️ +The original codebase can be found at [NVlabs/Sana](https://github.com/NVlabs/Sana), and you can find official ControlNet checkpoints on [Efficient-Large-Model's](https://huggingface.co/Efficient-Large-Model) Hub profile. + +## SanaControlNetModel +[[autodoc]] SanaControlNetModel + +## SanaControlNetOutput +[[autodoc]] models.controlnets.controlnet_sana.SanaControlNetOutput + diff --git a/docs/source/en/api/models/controlnet_sd3.md b/docs/source/en/api/models/controlnet_sd3.md new file mode 100644 index 000000000000..f665dde3a007 --- /dev/null +++ b/docs/source/en/api/models/controlnet_sd3.md @@ -0,0 +1,42 @@ + + +# SD3ControlNetModel + +SD3ControlNetModel is an implementation of ControlNet for Stable Diffusion 3. + +The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +## Loading from the original format + +By default the [`SD3ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. + +```py +from diffusers import StableDiffusion3ControlNetPipeline +from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel + +controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny") +pipe = StableDiffusion3ControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet) +``` + +## SD3ControlNetModel + +[[autodoc]] SD3ControlNetModel + +## SD3ControlNetOutput + +[[autodoc]] models.controlnets.controlnet_sd3.SD3ControlNetOutput + diff --git a/docs/source/en/api/models/controlnet_sparsectrl.md b/docs/source/en/api/models/controlnet_sparsectrl.md new file mode 100644 index 000000000000..b9e81dc57eeb --- /dev/null +++ b/docs/source/en/api/models/controlnet_sparsectrl.md @@ -0,0 +1,46 @@ + + +# SparseControlNetModel + +SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://huggingface.co/papers/2307.04725). + +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://huggingface.co/papers/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. + +The abstract from the paper is: + +*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).* + +## Example for loading SparseControlNetModel + +```python +import torch +from diffusers import SparseControlNetModel + +# fp32 variant in float16 +# 1. Scribble checkpoint +controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16) + +# 2. RGB checkpoint +controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-rgb", torch_dtype=torch.float16) + +# For loading fp16 variant, pass `variant="fp16"` as an additional parameter +``` + +## SparseControlNetModel + +[[autodoc]] SparseControlNetModel + +## SparseControlNetOutput + +[[autodoc]] models.controlnet_sparsectrl.SparseControlNetOutput diff --git a/docs/source/en/api/models/controlnet_union.md b/docs/source/en/api/models/controlnet_union.md new file mode 100644 index 000000000000..466718269758 --- /dev/null +++ b/docs/source/en/api/models/controlnet_union.md @@ -0,0 +1,35 @@ + + +# ControlNetUnionModel + +ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL. + +The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation. + +*We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.* + +## Loading + +By default the [`ControlNetUnionModel`] should be loaded with [`~ModelMixin.from_pretrained`]. + +```py +from diffusers import StableDiffusionXLControlNetUnionPipeline, ControlNetUnionModel + +controlnet = ControlNetUnionModel.from_pretrained("xinsir/controlnet-union-sdxl-1.0") +pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet) +``` + +## ControlNetUnionModel + +[[autodoc]] ControlNetUnionModel + diff --git a/docs/source/en/api/models/cosmos_transformer3d.md b/docs/source/en/api/models/cosmos_transformer3d.md new file mode 100644 index 000000000000..eb52c82e5f05 --- /dev/null +++ b/docs/source/en/api/models/cosmos_transformer3d.md @@ -0,0 +1,30 @@ + + +# CosmosTransformer3DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA. + +The model can be loaded with the following code snippet. + +```python +from diffusers import CosmosTransformer3DModel + +transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## CosmosTransformer3DModel + +[[autodoc]] CosmosTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/dit_transformer2d.md b/docs/source/en/api/models/dit_transformer2d.md new file mode 100644 index 000000000000..640bd31feeef --- /dev/null +++ b/docs/source/en/api/models/dit_transformer2d.md @@ -0,0 +1,19 @@ + + +# DiTTransformer2DModel + +A Transformer model for image-like data from [DiT](https://huggingface.co/papers/2212.09748). + +## DiTTransformer2DModel + +[[autodoc]] DiTTransformer2DModel diff --git a/docs/source/en/api/models/easyanimate_transformer3d.md b/docs/source/en/api/models/easyanimate_transformer3d.md new file mode 100644 index 000000000000..66670eb632d4 --- /dev/null +++ b/docs/source/en/api/models/easyanimate_transformer3d.md @@ -0,0 +1,30 @@ + + +# EasyAnimateTransformer3DModel + +A Diffusion Transformer model for 3D data from [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import EasyAnimateTransformer3DModel + +transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda") +``` + +## EasyAnimateTransformer3DModel + +[[autodoc]] EasyAnimateTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/flux_transformer.md b/docs/source/en/api/models/flux_transformer.md new file mode 100644 index 000000000000..d1ccb1a242b3 --- /dev/null +++ b/docs/source/en/api/models/flux_transformer.md @@ -0,0 +1,19 @@ + + +# FluxTransformer2DModel + +A Transformer model for image-like data from [Flux](https://blackforestlabs.ai/announcing-black-forest-labs/). + +## FluxTransformer2DModel + +[[autodoc]] FluxTransformer2DModel diff --git a/docs/source/en/api/models/hidream_image_transformer.md b/docs/source/en/api/models/hidream_image_transformer.md new file mode 100644 index 000000000000..b562c2f8543c --- /dev/null +++ b/docs/source/en/api/models/hidream_image_transformer.md @@ -0,0 +1,46 @@ + + +# HiDreamImageTransformer2DModel + +A Transformer model for image-like data from [HiDream-I1](https://huggingface.co/HiDream-ai). + +The model can be loaded with the following code snippet. + +```python +from diffusers import HiDreamImageTransformer2DModel + +transformer = HiDreamImageTransformer2DModel.from_pretrained("HiDream-ai/HiDream-I1-Full", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## Loading GGUF quantized checkpoints for HiDream-I1 + +GGUF checkpoints for the `HiDreamImageTransformer2DModel` can be loaded using `~FromOriginalModelMixin.from_single_file` + +```python +import torch +from diffusers import GGUFQuantizationConfig, HiDreamImageTransformer2DModel + +ckpt_path = "https://huggingface.co/city96/HiDream-I1-Dev-gguf/blob/main/hidream-i1-dev-Q2_K.gguf" +transformer = HiDreamImageTransformer2DModel.from_single_file( + ckpt_path, + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + torch_dtype=torch.bfloat16 +) +``` + +## HiDreamImageTransformer2DModel + +[[autodoc]] HiDreamImageTransformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/hunyuan_transformer2d.md b/docs/source/en/api/models/hunyuan_transformer2d.md new file mode 100644 index 000000000000..4e2d38f3233a --- /dev/null +++ b/docs/source/en/api/models/hunyuan_transformer2d.md @@ -0,0 +1,20 @@ + + +# HunyuanDiT2DModel + +A Diffusion Transformer model for 2D data from [Hunyuan-DiT](https://github.com/Tencent/HunyuanDiT). + +## HunyuanDiT2DModel + +[[autodoc]] HunyuanDiT2DModel + diff --git a/docs/source/en/api/models/hunyuan_video_transformer_3d.md b/docs/source/en/api/models/hunyuan_video_transformer_3d.md new file mode 100644 index 000000000000..77d30e5553bc --- /dev/null +++ b/docs/source/en/api/models/hunyuan_video_transformer_3d.md @@ -0,0 +1,30 @@ + + +# HunyuanVideoTransformer3DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent. + +The model can be loaded with the following code snippet. + +```python +from diffusers import HunyuanVideoTransformer3DModel + +transformer = HunyuanVideoTransformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## HunyuanVideoTransformer3DModel + +[[autodoc]] HunyuanVideoTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/latte_transformer3d.md b/docs/source/en/api/models/latte_transformer3d.md new file mode 100644 index 000000000000..6182f403ea48 --- /dev/null +++ b/docs/source/en/api/models/latte_transformer3d.md @@ -0,0 +1,19 @@ + + +## LatteTransformer3DModel + +A Diffusion Transformer model for 3D data from [Latte](https://github.com/Vchitect/Latte). + +## LatteTransformer3DModel + +[[autodoc]] LatteTransformer3DModel diff --git a/docs/source/en/api/models/ltx_video_transformer3d.md b/docs/source/en/api/models/ltx_video_transformer3d.md new file mode 100644 index 000000000000..5a2a1af9d821 --- /dev/null +++ b/docs/source/en/api/models/ltx_video_transformer3d.md @@ -0,0 +1,30 @@ + + +# LTXVideoTransformer3DModel + +A Diffusion Transformer model for 3D data from [LTX](https://huggingface.co/Lightricks/LTX-Video) was introduced by Lightricks. + +The model can be loaded with the following code snippet. + +```python +from diffusers import LTXVideoTransformer3DModel + +transformer = LTXVideoTransformer3DModel.from_pretrained("Lightricks/LTX-Video", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") +``` + +## LTXVideoTransformer3DModel + +[[autodoc]] LTXVideoTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/lumina2_transformer2d.md b/docs/source/en/api/models/lumina2_transformer2d.md new file mode 100644 index 000000000000..2c85325f73f5 --- /dev/null +++ b/docs/source/en/api/models/lumina2_transformer2d.md @@ -0,0 +1,30 @@ + + +# Lumina2Transformer2DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [Lumina Image 2.0](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) by Alpha-VLLM. + +The model can be loaded with the following code snippet. + +```python +from diffusers import Lumina2Transformer2DModel + +transformer = Lumina2Transformer2DModel.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## Lumina2Transformer2DModel + +[[autodoc]] Lumina2Transformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/lumina_nextdit2d.md b/docs/source/en/api/models/lumina_nextdit2d.md new file mode 100644 index 000000000000..1b898b2cda76 --- /dev/null +++ b/docs/source/en/api/models/lumina_nextdit2d.md @@ -0,0 +1,20 @@ + + +# LuminaNextDiT2DModel + +A Next Version of Diffusion Transformer model for 2D data from [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X). + +## LuminaNextDiT2DModel + +[[autodoc]] LuminaNextDiT2DModel + diff --git a/docs/source/en/api/models/mochi_transformer3d.md b/docs/source/en/api/models/mochi_transformer3d.md new file mode 100644 index 000000000000..6c8c21ce91cd --- /dev/null +++ b/docs/source/en/api/models/mochi_transformer3d.md @@ -0,0 +1,30 @@ + + +# MochiTransformer3DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [Mochi-1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Genmo. + +The model can be loaded with the following code snippet. + +```python +from diffusers import MochiTransformer3DModel + +transformer = MochiTransformer3DModel.from_pretrained("genmo/mochi-1-preview", subfolder="transformer", torch_dtype=torch.float16).to("cuda") +``` + +## MochiTransformer3DModel + +[[autodoc]] MochiTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/omnigen_transformer.md b/docs/source/en/api/models/omnigen_transformer.md new file mode 100644 index 000000000000..a2584bc8e76d --- /dev/null +++ b/docs/source/en/api/models/omnigen_transformer.md @@ -0,0 +1,30 @@ + + +# OmniGenTransformer2DModel + +A Transformer model that accepts multimodal instructions to generate images for [OmniGen](https://github.com/VectorSpaceLab/OmniGen/). + +The abstract from the paper is: + +*The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.* + +```python +import torch +from diffusers import OmniGenTransformer2DModel + +transformer = OmniGenTransformer2DModel.from_pretrained("Shitao/OmniGen-v1-diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## OmniGenTransformer2DModel + +[[autodoc]] OmniGenTransformer2DModel diff --git a/docs/source/en/api/models/overview.md b/docs/source/en/api/models/overview.md index 62e75f26b5b0..eb9722739f99 100644 --- a/docs/source/en/api/models/overview.md +++ b/docs/source/en/api/models/overview.md @@ -1,4 +1,4 @@ - + +# PixArtTransformer2DModel + +A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692). + +## PixArtTransformer2DModel + +[[autodoc]] PixArtTransformer2DModel diff --git a/docs/source/en/api/models/prior_transformer.md b/docs/source/en/api/models/prior_transformer.md index 21b3918571de..72bb418abb0a 100644 --- a/docs/source/en/api/models/prior_transformer.md +++ b/docs/source/en/api/models/prior_transformer.md @@ -1,4 +1,4 @@ - -# Prior Transformer +# PriorTransformer The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process. diff --git a/docs/source/en/api/models/qwenimage_transformer2d.md b/docs/source/en/api/models/qwenimage_transformer2d.md new file mode 100644 index 000000000000..c78623084e1c --- /dev/null +++ b/docs/source/en/api/models/qwenimage_transformer2d.md @@ -0,0 +1,28 @@ + + +# QwenImageTransformer2DModel + +The model can be loaded with the following code snippet. + +```python +from diffusers import QwenImageTransformer2DModel + +transformer = QwenImageTransformer2DModel.from_pretrained("Qwen/QwenImage-20B", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## QwenImageTransformer2DModel + +[[autodoc]] QwenImageTransformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/sana_transformer2d.md b/docs/source/en/api/models/sana_transformer2d.md new file mode 100644 index 000000000000..e3e5fde3a79e --- /dev/null +++ b/docs/source/en/api/models/sana_transformer2d.md @@ -0,0 +1,34 @@ + + +# SanaTransformer2DModel + +A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han. + +The abstract from the paper is: + +*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.* + +The model can be loaded with the following code snippet. + +```python +from diffusers import SanaTransformer2DModel + +transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## SanaTransformer2DModel + +[[autodoc]] SanaTransformer2DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/sd3_transformer2d.md b/docs/source/en/api/models/sd3_transformer2d.md new file mode 100644 index 000000000000..f4fc4c65826c --- /dev/null +++ b/docs/source/en/api/models/sd3_transformer2d.md @@ -0,0 +1,19 @@ + + +# SD3 Transformer Model + +The Transformer model introduced in [Stable Diffusion 3](https://hf.co/papers/2403.03206). Its novelty lies in the MMDiT transformer block. + +## SD3Transformer2DModel + +[[autodoc]] SD3Transformer2DModel \ No newline at end of file diff --git a/docs/source/en/api/models/skyreels_v2_transformer_3d.md b/docs/source/en/api/models/skyreels_v2_transformer_3d.md new file mode 100644 index 000000000000..c1c8c2c7bcce --- /dev/null +++ b/docs/source/en/api/models/skyreels_v2_transformer_3d.md @@ -0,0 +1,30 @@ + + +# SkyReelsV2Transformer3DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [SkyReels-V2](https://github.com/SkyworkAI/SkyReels-V2) by the Skywork AI. + +The model can be loaded with the following code snippet. + +```python +from diffusers import SkyReelsV2Transformer3DModel + +transformer = SkyReelsV2Transformer3DModel.from_pretrained("Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## SkyReelsV2Transformer3DModel + +[[autodoc]] SkyReelsV2Transformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/stable_audio_transformer.md b/docs/source/en/api/models/stable_audio_transformer.md new file mode 100644 index 000000000000..50a936b43ef4 --- /dev/null +++ b/docs/source/en/api/models/stable_audio_transformer.md @@ -0,0 +1,19 @@ + + +# StableAudioDiTModel + +A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358). + +## StableAudioDiTModel + +[[autodoc]] StableAudioDiTModel diff --git a/docs/source/en/api/models/stable_cascade_unet.md b/docs/source/en/api/models/stable_cascade_unet.md new file mode 100644 index 000000000000..36bb38a8cb04 --- /dev/null +++ b/docs/source/en/api/models/stable_cascade_unet.md @@ -0,0 +1,19 @@ + + +# StableCascadeUNet + +A UNet model from the [Stable Cascade pipeline](../pipelines/stable_cascade.md). + +## StableCascadeUNet + +[[autodoc]] models.unets.unet_stable_cascade.StableCascadeUNet diff --git a/docs/source/en/api/models/transformer2d.md b/docs/source/en/api/models/transformer2d.md index a56a1379a27d..d8e0a858b0e7 100644 --- a/docs/source/en/api/models/transformer2d.md +++ b/docs/source/en/api/models/transformer2d.md @@ -1,4 +1,4 @@ - -# Transformer2D +# Transformer2DModel A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs. @@ -22,11 +22,8 @@ When the input is **continuous**: When the input is **discrete**: - - -It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked. - - +> [!TIP] +> It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked. 1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings. 2. Apply the Transformer blocks in the standard way. @@ -38,4 +35,4 @@ It is assumed one of the input classes is the masked latent pixel. The predicted ## Transformer2DModelOutput -[[autodoc]] models.transformers.transformer_2d.Transformer2DModelOutput +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/models/transformer_temporal.md b/docs/source/en/api/models/transformer_temporal.md index ff5a497c03f5..e89afeeffeb3 100644 --- a/docs/source/en/api/models/transformer_temporal.md +++ b/docs/source/en/api/models/transformer_temporal.md @@ -1,4 +1,4 @@ - -# Transformer Temporal +# TransformerTemporalModel A Transformer model for video-like data. diff --git a/docs/source/en/api/models/unet-motion.md b/docs/source/en/api/models/unet-motion.md index 9396f6477bf1..5977be7d9a4e 100644 --- a/docs/source/en/api/models/unet-motion.md +++ b/docs/source/en/api/models/unet-motion.md @@ -1,4 +1,4 @@ - + +# WanTransformer3DModel + +A Diffusion Transformer model for 3D video-like data was introduced in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team. + +The model can be loaded with the following code snippet. + +```python +from diffusers import WanTransformer3DModel + +transformer = WanTransformer3DModel.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) +``` + +## WanTransformer3DModel + +[[autodoc]] WanTransformer3DModel + +## Transformer2DModelOutput + +[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/docs/source/en/api/modular_diffusers/guiders.md b/docs/source/en/api/modular_diffusers/guiders.md new file mode 100644 index 000000000000..a24eb7220749 --- /dev/null +++ b/docs/source/en/api/modular_diffusers/guiders.md @@ -0,0 +1,39 @@ +# Guiders + +Guiders are components in Modular Diffusers that control how the diffusion process is guided during generation. They implement various guidance techniques to improve generation quality and control. + +## BaseGuidance + +[[autodoc]] diffusers.guiders.guider_utils.BaseGuidance + +## ClassifierFreeGuidance + +[[autodoc]] diffusers.guiders.classifier_free_guidance.ClassifierFreeGuidance + +## ClassifierFreeZeroStarGuidance + +[[autodoc]] diffusers.guiders.classifier_free_zero_star_guidance.ClassifierFreeZeroStarGuidance + +## SkipLayerGuidance + +[[autodoc]] diffusers.guiders.skip_layer_guidance.SkipLayerGuidance + +## SmoothedEnergyGuidance + +[[autodoc]] diffusers.guiders.smoothed_energy_guidance.SmoothedEnergyGuidance + +## PerturbedAttentionGuidance + +[[autodoc]] diffusers.guiders.perturbed_attention_guidance.PerturbedAttentionGuidance + +## AdaptiveProjectedGuidance + +[[autodoc]] diffusers.guiders.adaptive_projected_guidance.AdaptiveProjectedGuidance + +## AutoGuidance + +[[autodoc]] diffusers.guiders.auto_guidance.AutoGuidance + +## TangentialClassifierFreeGuidance + +[[autodoc]] diffusers.guiders.tangential_classifier_free_guidance.TangentialClassifierFreeGuidance diff --git a/docs/source/en/api/modular_diffusers/pipeline.md b/docs/source/en/api/modular_diffusers/pipeline.md new file mode 100644 index 000000000000..f60261ea6672 --- /dev/null +++ b/docs/source/en/api/modular_diffusers/pipeline.md @@ -0,0 +1,5 @@ +# Pipeline + +## ModularPipeline + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.ModularPipeline diff --git a/docs/source/en/api/modular_diffusers/pipeline_blocks.md b/docs/source/en/api/modular_diffusers/pipeline_blocks.md new file mode 100644 index 000000000000..8ad581e679ac --- /dev/null +++ b/docs/source/en/api/modular_diffusers/pipeline_blocks.md @@ -0,0 +1,17 @@ +# Pipeline blocks + +## ModularPipelineBlocks + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.ModularPipelineBlocks + +## SequentialPipelineBlocks + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.SequentialPipelineBlocks + +## LoopSequentialPipelineBlocks + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.LoopSequentialPipelineBlocks + +## AutoPipelineBlocks + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.AutoPipelineBlocks \ No newline at end of file diff --git a/docs/source/en/api/modular_diffusers/pipeline_components.md b/docs/source/en/api/modular_diffusers/pipeline_components.md new file mode 100644 index 000000000000..2d8e10aef6d8 --- /dev/null +++ b/docs/source/en/api/modular_diffusers/pipeline_components.md @@ -0,0 +1,17 @@ +# Components and configs + +## ComponentSpec + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.ComponentSpec + +## ConfigSpec + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.ConfigSpec + +## ComponentsManager + +[[autodoc]] diffusers.modular_pipelines.components_manager.ComponentsManager + +## InsertableDict + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline_utils.InsertableDict \ No newline at end of file diff --git a/docs/source/en/api/modular_diffusers/pipeline_states.md b/docs/source/en/api/modular_diffusers/pipeline_states.md new file mode 100644 index 000000000000..341d18ecb41c --- /dev/null +++ b/docs/source/en/api/modular_diffusers/pipeline_states.md @@ -0,0 +1,9 @@ +# Pipeline states + +## PipelineState + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.PipelineState + +## BlockState + +[[autodoc]] diffusers.modular_pipelines.modular_pipeline.BlockState \ No newline at end of file diff --git a/docs/source/en/api/normalization.md b/docs/source/en/api/normalization.md index ef4b694a4d85..fa703b19871b 100644 --- a/docs/source/en/api/normalization.md +++ b/docs/source/en/api/normalization.md @@ -1,4 +1,4 @@ - + +# Parallelism + +Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more. + +## ParallelConfig + +[[autodoc]] ParallelConfig + +## ContextParallelConfig + +[[autodoc]] ContextParallelConfig + +[[autodoc]] hooks.apply_context_parallel diff --git a/docs/source/en/api/pipelines/allegro.md b/docs/source/en/api/pipelines/allegro.md new file mode 100644 index 000000000000..a981fb1f94f7 --- /dev/null +++ b/docs/source/en/api/pipelines/allegro.md @@ -0,0 +1,76 @@ + + +# Allegro + +[Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) from RhymesAI, by Yuan Zhou, Qiuyue Wang, Yuxuan Cai, Huan Yang. + +The abstract from the paper is: + +*Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: https://github.com/rhymes-ai/Allegro , Model: https://huggingface.co/rhymes-ai/Allegro , Gallery: https://rhymes.ai/allegro_gallery .* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AllegroPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AllegroTransformer3DModel, AllegroPipeline +from diffusers.utils import export_to_video +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "rhymes-ai/Allegro", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = AllegroTransformer3DModel.from_pretrained( + "rhymes-ai/Allegro", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = AllegroPipeline.from_pretrained( + "rhymes-ai/Allegro", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = ( + "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, " + "the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this " + "location might be a popular spot for docking fishing boats." +) +video = pipeline(prompt, guidance_scale=7.5, max_sequence_length=512).frames[0] +export_to_video(video, "harbor.mp4", fps=15) +``` + +## AllegroPipeline + +[[autodoc]] AllegroPipeline + - all + - __call__ + +## AllegroPipelineOutput + +[[autodoc]] pipelines.allegro.pipeline_output.AllegroPipelineOutput diff --git a/docs/source/en/api/pipelines/amused.md b/docs/source/en/api/pipelines/amused.md index d25e20dec55e..ad292abca2cc 100644 --- a/docs/source/en/api/pipelines/amused.md +++ b/docs/source/en/api/pipelines/amused.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # aMUSEd aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. -Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once. +Amused is a lightweight text to image model based off of the [MUSE](https://huggingface.co/papers/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once. -Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes. +Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes. The abstract from the paper is: diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index 913529e6ebdc..f0188f3c36fb 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Attend-and-Excite Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. @@ -20,11 +23,8 @@ The abstract from the paper is: You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionAttendAndExcitePipeline diff --git a/docs/source/en/api/pipelines/audioldm.md b/docs/source/en/api/pipelines/audioldm.md index 95d41b9569f5..c8073a14ef0a 100644 --- a/docs/source/en/api/pipelines/audioldm.md +++ b/docs/source/en/api/pipelines/audioldm.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # AudioLDM AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM @@ -35,11 +38,8 @@ During inference: * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## AudioLDMPipeline [[autodoc]] AudioLDMPipeline diff --git a/docs/source/en/api/pipelines/audioldm2.md b/docs/source/en/api/pipelines/audioldm2.md index ac4459c60706..45a9002ea070 100644 --- a/docs/source/en/api/pipelines/audioldm2.md +++ b/docs/source/en/api/pipelines/audioldm2.md @@ -1,4 +1,4 @@ - + +# AuraFlow + +AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark. + +It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/). + +> [!TIP] +> AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AuraFlowPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AuraFlowTransformer2DModel, AuraFlowPipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "fal/AuraFlow", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = AuraFlowTransformer2DModel.from_pretrained( + "fal/AuraFlow", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = AuraFlowPipeline.from_pretrained( + "fal/AuraFlow", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt).images[0] +image.save("auraflow.png") +``` + +Loading [GGUF checkpoints](https://huggingface.co/docs/diffusers/quantization/gguf) are also supported: + +```py +import torch +from diffusers import ( + AuraFlowPipeline, + GGUFQuantizationConfig, + AuraFlowTransformer2DModel, +) + +transformer = AuraFlowTransformer2DModel.from_single_file( + "https://huggingface.co/city96/AuraFlow-v0.3-gguf/blob/main/aura_flow_0.3-Q2_K.gguf", + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + torch_dtype=torch.bfloat16, +) + +pipeline = AuraFlowPipeline.from_pretrained( + "fal/AuraFlow-v0.3", + transformer=transformer, + torch_dtype=torch.bfloat16, +) + +prompt = "a cute pony in a field of flowers" +image = pipeline(prompt).images[0] +image.save("auraflow.png") +``` + +## Support for `torch.compile()` + +AuraFlow can be compiled with `torch.compile()` to speed up inference latency even for different resolutions. First, install PyTorch nightly following the instructions from [here](https://pytorch.org/). The snippet below shows the changes needed to enable this: + +```diff ++ torch.fx.experimental._config.use_duck_shape = False ++ pipeline.transformer = torch.compile( + pipeline.transformer, fullgraph=True, dynamic=True +) +``` + +Specifying `use_duck_shape` to be `False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out [this comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790). + +This enables from 100% (on low resolutions) to a 30% (on 1536x1536 resolution) speed improvements. + +Thanks to [AstraliteHeart](https://github.com/huggingface/diffusers/pull/11297/) who helped us rewrite the [`AuraFlowTransformer2DModel`] class so that the above works for different resolutions ([PR](https://github.com/huggingface/diffusers/pull/11297/)). + +## AuraFlowPipeline + +[[autodoc]] AuraFlowPipeline + - all + - __call__ \ No newline at end of file diff --git a/docs/source/en/api/pipelines/auto_pipeline.md b/docs/source/en/api/pipelines/auto_pipeline.md index ce1e18ed5ea1..a2bd1c5c3a72 100644 --- a/docs/source/en/api/pipelines/auto_pipeline.md +++ b/docs/source/en/api/pipelines/auto_pipeline.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # BLIP-Diffusion -BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. +BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. The abstract from the paper is: @@ -23,11 +26,8 @@ The original codebase can be found at [salesforce/LAVIS](https://github.com/sale `BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## BlipDiffusionPipeline diff --git a/docs/source/en/api/pipelines/bria_3_2.md b/docs/source/en/api/pipelines/bria_3_2.md new file mode 100644 index 000000000000..059fa01f9f83 --- /dev/null +++ b/docs/source/en/api/pipelines/bria_3_2.md @@ -0,0 +1,44 @@ + + +# Bria 3.2 + +Bria 3.2 is the next-generation commercial-ready text-to-image model. With just 4 billion parameters, it provides exceptional aesthetics and text rendering, evaluated to provide on par results to leading open-source models, and outperforming other licensed models. +In addition to being built entirely on licensed data, 3.2 provides several advantages for enterprise and commercial use: + +- Efficient Compute - the model is X3 smaller than the equivalent models in the market (4B parameters vs 12B parameters other open source models) +- Architecture Consistency: Same architecture as 3.1—ideal for users looking to upgrade without disruption. +- Fine-tuning Speedup: 2x faster fine-tuning on L40S and A100. + +Original model checkpoints for Bria 3.2 can be found [here](https://huggingface.co/briaai/BRIA-3.2). +Github repo for Bria 3.2 can be found [here](https://github.com/Bria-AI/BRIA-3.2). + +If you want to learn more about the Bria platform, and get free traril access, please visit [bria.ai](https://bria.ai). + + +## Usage + +_As the model is gated, before using it with diffusers you first need to go to the [Bria 3.2 Hugging Face page](https://huggingface.co/briaai/BRIA-3.2), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ + +Use the command below to log in: + +```bash +hf auth login +``` + + +## BriaPipeline + +[[autodoc]] BriaPipeline + - all + - __call__ + diff --git a/docs/source/en/api/pipelines/chroma.md b/docs/source/en/api/pipelines/chroma.md new file mode 100644 index 000000000000..df03fbb325d7 --- /dev/null +++ b/docs/source/en/api/pipelines/chroma.md @@ -0,0 +1,100 @@ + + +# Chroma + +
+ LoRA + MPS +
+ +Chroma is a text to image generation model based on Flux. + +Original model checkpoints for Chroma can be found [here](https://huggingface.co/lodestones/Chroma). + +> [!TIP] +> Chroma can use all the same optimizations as Flux. + +## Inference + +The Diffusers version of Chroma is based on the [`unlocked-v37`](https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors) version of the original model, which is available in the [Chroma repository](https://huggingface.co/lodestones/Chroma). + +```python +import torch +from diffusers import ChromaPipeline + +pipe = ChromaPipeline.from_pretrained("lodestones/Chroma", torch_dtype=torch.bfloat16) +pipe.enable_model_cpu_offload() + +prompt = [ + "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done." +] +negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"] + +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + generator=torch.Generator("cpu").manual_seed(433), + num_inference_steps=40, + guidance_scale=3.0, + num_images_per_prompt=1, +).images[0] +image.save("chroma.png") +``` + +## Loading from a single file + +To use updated model checkpoints that are not in the Diffusers format, you can use the `ChromaTransformer2DModel` class to load the model from a single file in the original format. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community. + +The following example demonstrates how to run Chroma from a single file. + +Then run the following example + +```python +import torch +from diffusers import ChromaTransformer2DModel, ChromaPipeline + +model_id = "lodestones/Chroma" +dtype = torch.bfloat16 + +transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors", torch_dtype=dtype) + +pipe = ChromaPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=dtype) +pipe.enable_model_cpu_offload() + +prompt = [ + "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done." +] +negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"] + +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + generator=torch.Generator("cpu").manual_seed(433), + num_inference_steps=40, + guidance_scale=3.0, +).images[0] + +image.save("chroma-single-file.png") +``` + +## ChromaPipeline + +[[autodoc]] ChromaPipeline + - all + - __call__ + +## ChromaImg2ImgPipeline + +[[autodoc]] ChromaImg2ImgPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/cogvideox.md b/docs/source/en/api/pipelines/cogvideox.md new file mode 100644 index 000000000000..ec673e0763c5 --- /dev/null +++ b/docs/source/en/api/pipelines/cogvideox.md @@ -0,0 +1,217 @@ + + +
+
+ + LoRA + +
+
+ +# CogVideoX + +[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos. + +You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection. + +> [!TIP] +> Click on the CogVideoX models in the right sidebar for more examples of other video generation tasks. + +The example below demonstrates how to generate a video optimized for memory or inference speed. + + + + +Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. + +The quantized CogVideoX 5B model below requires ~16GB of VRAM. + +```py +import torch +from diffusers import CogVideoXPipeline, AutoModel +from diffusers.quantizers import PipelineQuantizationConfig +from diffusers.hooks import apply_group_offloading +from diffusers.utils import export_to_video + +# quantize weights to int8 with torchao +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="torchao", + quant_kwargs={"quant_type": "int8wo"}, + components_to_quantize="transformer" +) + +# fp8 layerwise weight-casting +transformer = AutoModel.from_pretrained( + "THUDM/CogVideoX-5b", + subfolder="transformer", + torch_dtype=torch.bfloat16 +) +transformer.enable_layerwise_casting( + storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16 +) + +pipeline = CogVideoXPipeline.from_pretrained( + "THUDM/CogVideoX-5b", + transformer=transformer, + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16 +) +pipeline.to("cuda") + +# model-offloading +pipeline.enable_model_cpu_offload() + +prompt = """ +A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. +The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. +Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, +with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting. +""" + +video = pipeline( + prompt=prompt, + guidance_scale=6, + num_inference_steps=50 +).frames[0] +export_to_video(video, "output.mp4", fps=8) +``` + + + + +[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. + +The average inference time with torch.compile on a 80GB A100 is 76.27 seconds compared to 96.89 seconds for an uncompiled model. + +```py +import torch +from diffusers import CogVideoXPipeline +from diffusers.utils import export_to_video + +pipeline = CogVideoXPipeline.from_pretrained( + "THUDM/CogVideoX-2b", + torch_dtype=torch.float16 +).to("cuda") + +# torch.compile +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.transformer = torch.compile( + pipeline.transformer, mode="max-autotune", fullgraph=True +) + +prompt = """ +A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. +The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. +Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, +with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting. +""" + +video = pipeline( + prompt=prompt, + guidance_scale=6, + num_inference_steps=50 +).frames[0] +export_to_video(video, "output.mp4", fps=8) +``` + + + + +## Notes + +- CogVideoX supports LoRAs with [`~loaders.CogVideoXLoraLoaderMixin.load_lora_weights`]. + +
+ Show example code + + ```py + import torch + from diffusers import CogVideoXPipeline + from diffusers.hooks import apply_group_offloading + from diffusers.utils import export_to_video + + pipeline = CogVideoXPipeline.from_pretrained( + "THUDM/CogVideoX-5b", + torch_dtype=torch.bfloat16 + ) + pipeline.to("cuda") + + # load LoRA weights + pipeline.load_lora_weights("finetrainers/CogVideoX-1.5-crush-smol-v0", adapter_name="crush-lora") + pipeline.set_adapters("crush-lora", 0.9) + + # model-offloading + pipeline.enable_model_cpu_offload() + + prompt = """ + PIKA_CRUSH A large metal cylinder is seen pressing down on a pile of Oreo cookies, flattening them as if they were under a hydraulic press. + """ + negative_prompt = "inconsistent motion, blurry motion, worse quality, degenerate outputs, deformed outputs" + + video = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_frames=81, + height=480, + width=768, + num_inference_steps=50 + ).frames[0] + export_to_video(video, "output.mp4", fps=16) + ``` + +
+ +- The text-to-video (T2V) checkpoints work best with a resolution of 1360x768 because that was the resolution it was pretrained on. + +- The image-to-video (I2V) checkpoints work with multiple resolutions. The width can vary from 768 to 1360, but the height must be 758. Both height and width must be divisible by 16. + +- Both T2V and I2V checkpoints work best with 81 and 161 frames. It is recommended to export the generated video at 16fps. + +- Refer to the table below to view memory usage when various memory-saving techniques are enabled. + + | method | memory usage (enabled) | memory usage (disabled) | + |---|---|---| + | enable_model_cpu_offload | 19GB | 33GB | + | enable_sequential_cpu_offload | <4GB | ~33GB (very slow inference speed) | + | enable_tiling | 11GB (with enable_model_cpu_offload) | --- | + +## CogVideoXPipeline + +[[autodoc]] CogVideoXPipeline + - all + - __call__ + +## CogVideoXImageToVideoPipeline + +[[autodoc]] CogVideoXImageToVideoPipeline + - all + - __call__ + +## CogVideoXVideoToVideoPipeline + +[[autodoc]] CogVideoXVideoToVideoPipeline + - all + - __call__ + +## CogVideoXFunControlPipeline + +[[autodoc]] CogVideoXFunControlPipeline + - all + - __call__ + +## CogVideoXPipelineOutput + +[[autodoc]] pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput diff --git a/docs/source/en/api/pipelines/cogview3.md b/docs/source/en/api/pipelines/cogview3.md new file mode 100644 index 000000000000..5ee02e1a7039 --- /dev/null +++ b/docs/source/en/api/pipelines/cogview3.md @@ -0,0 +1,37 @@ + + +# CogView3Plus + +[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang. + +The abstract from the paper is: + +*Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). + +## CogView3PlusPipeline + +[[autodoc]] CogView3PlusPipeline + - all + - __call__ + +## CogView3PipelineOutput + +[[autodoc]] pipelines.cogview3.pipeline_output.CogView3PipelineOutput diff --git a/docs/source/en/api/pipelines/cogview4.md b/docs/source/en/api/pipelines/cogview4.md new file mode 100644 index 000000000000..7857dc8c9476 --- /dev/null +++ b/docs/source/en/api/pipelines/cogview4.md @@ -0,0 +1,31 @@ + + +# CogView4 + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). + +## CogView4Pipeline + +[[autodoc]] CogView4Pipeline + - all + - __call__ + +## CogView4PipelineOutput + +[[autodoc]] pipelines.cogview4.pipeline_output.CogView4PipelineOutput diff --git a/docs/source/en/api/pipelines/consisid.md b/docs/source/en/api/pipelines/consisid.md new file mode 100644 index 000000000000..bba047292413 --- /dev/null +++ b/docs/source/en/api/pipelines/consisid.md @@ -0,0 +1,61 @@ + + +# ConsisID + +
+ LoRA +
+ +[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://huggingface.co/papers/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan. + +The abstract from the paper is: + +*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh). + +There are two official ConsisID checkpoints for identity-preserving text-to-video. + +| checkpoints | recommended inference dtype | +|:---:|:---:| +| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | +| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | + +### Memory optimization + +ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script. + +| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved | +| :----------------------------- | :------------------- | :------------------ | +| - | 37 GB | 44 GB | +| enable_model_cpu_offload | 22 GB | 25 GB | +| enable_sequential_cpu_offload | 16 GB | 22 GB | +| vae.enable_slicing | 16 GB | 22 GB | +| vae.enable_tiling | 5 GB | 7 GB | + +## ConsisIDPipeline + +[[autodoc]] ConsisIDPipeline + + - all + - __call__ + +## ConsisIDPipelineOutput + +[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput diff --git a/docs/source/en/api/pipelines/consistency_models.md b/docs/source/en/api/pipelines/consistency_models.md index 680abaad420b..4f7b2f0fb501 100644 --- a/docs/source/en/api/pipelines/consistency_models.md +++ b/docs/source/en/api/pipelines/consistency_models.md @@ -1,4 +1,4 @@ - + +# FluxControlInpaint + +
+ LoRA +
+ +FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image. + +FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**. + +| Control type | Developer | Link | +| -------- | ---------- | ---- | +| Depth | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) | +| Canny | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) | + + +> [!TIP] +> Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c). + +```python +import torch +from diffusers import FluxControlInpaintPipeline +from diffusers.models.transformers import FluxTransformer2DModel +from transformers import T5EncoderModel +from diffusers.utils import load_image, make_image_grid +from image_gen_aux import DepthPreprocessor # https://github.com/huggingface/image_gen_aux +from PIL import Image +import numpy as np + +pipe = FluxControlInpaintPipeline.from_pretrained( + "black-forest-labs/FLUX.1-Depth-dev", + torch_dtype=torch.bfloat16, +) +# use following lines if you have GPU constraints +# --------------------------------------------------------------- +transformer = FluxTransformer2DModel.from_pretrained( + "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="transformer", torch_dtype=torch.bfloat16 +) +text_encoder_2 = T5EncoderModel.from_pretrained( + "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="text_encoder_2", torch_dtype=torch.bfloat16 +) +pipe.transformer = transformer +pipe.text_encoder_2 = text_encoder_2 +pipe.enable_model_cpu_offload() +# --------------------------------------------------------------- +pipe.to("cuda") + +prompt = "a blue robot singing opera with human-like expressions" +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +head_mask = np.zeros_like(image) +head_mask[65:580,300:642] = 255 +mask_image = Image.fromarray(head_mask) + +processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") +control_image = processor(image)[0].convert("RGB") + +output = pipe( + prompt=prompt, + image=image, + control_image=control_image, + mask_image=mask_image, + num_inference_steps=30, + strength=0.9, + guidance_scale=10.0, + generator=torch.Generator().manual_seed(42), +).images[0] +make_image_grid([image, control_image, mask_image, output.resize(image.size)], rows=1, cols=4).save("output.png") +``` + +## FluxControlInpaintPipeline +[[autodoc]] FluxControlInpaintPipeline + - all + - __call__ + + +## FluxPipelineOutput +[[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/controlnet.md b/docs/source/en/api/pipelines/controlnet.md index 6b00902cf296..afc0a4653e07 100644 --- a/docs/source/en/api/pipelines/controlnet.md +++ b/docs/source/en/api/pipelines/controlnet.md @@ -1,4 +1,4 @@ - + +# ControlNet with Flux.1 + +
+ LoRA +
+ +FluxControlNetPipeline is an implementation of ControlNet for Flux.1. + +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This controlnet code is implemented by [The InstantX Team](https://huggingface.co/InstantX). You can find pre-trained checkpoints for Flux-ControlNet in the table below: + + +| ControlNet type | Developer | Link | +| -------- | ---------- | ---- | +| Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny) | +| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth) | +| Union | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union) | + +XLabs ControlNets are also supported, which was contributed by the [XLabs team](https://huggingface.co/XLabs-AI). + +| ControlNet type | Developer | Link | +| -------- | ---------- | ---- | +| Canny | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-canny-diffusers) | +| Depth | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-depth-diffusers) | +| HED | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-hed-diffusers) | + + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## FluxControlNetPipeline +[[autodoc]] FluxControlNetPipeline + - all + - __call__ + + +## FluxPipelineOutput +[[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/controlnet_hunyuandit.md b/docs/source/en/api/pipelines/controlnet_hunyuandit.md new file mode 100644 index 000000000000..88dc2de10a64 --- /dev/null +++ b/docs/source/en/api/pipelines/controlnet_hunyuandit.md @@ -0,0 +1,33 @@ + + +# ControlNet with Hunyuan-DiT + +HunyuanDiTControlNetPipeline is an implementation of ControlNet for [Hunyuan-DiT](https://huggingface.co/papers/2405.08748). + +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## HunyuanDiTControlNetPipeline +[[autodoc]] HunyuanDiTControlNetPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/controlnet_sana.md b/docs/source/en/api/pipelines/controlnet_sana.md new file mode 100644 index 000000000000..d170d28e63ed --- /dev/null +++ b/docs/source/en/api/pipelines/controlnet_sana.md @@ -0,0 +1,36 @@ + + +# ControlNet + +
+ LoRA +
+ +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This pipeline was contributed by [ishan24](https://huggingface.co/ishan24). ❤️ +The original codebase can be found at [NVlabs/Sana](https://github.com/NVlabs/Sana), and you can find official ControlNet checkpoints on [Efficient-Large-Model's](https://huggingface.co/Efficient-Large-Model) Hub profile. + +## SanaControlNetPipeline +[[autodoc]] SanaControlNetPipeline + - all + - __call__ + +## SanaPipelineOutput +[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/controlnet_sd3.md b/docs/source/en/api/pipelines/controlnet_sd3.md new file mode 100644 index 000000000000..8cdada9edf43 --- /dev/null +++ b/docs/source/en/api/pipelines/controlnet_sd3.md @@ -0,0 +1,55 @@ + + +# ControlNet with Stable Diffusion 3 + +
+ LoRA +
+ +StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3. + +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. + +With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. + +The abstract from the paper is: + +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* + +This controlnet code is mainly implemented by [The InstantX Team](https://huggingface.co/InstantX). The inpainting-related code was developed by [The Alimama Creative Team](https://huggingface.co/alimama-creative). You can find pre-trained checkpoints for SD3-ControlNet in the table below: + + +| ControlNet type | Developer | Link | +| -------- | ---------- | ---- | +| Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Canny) | +| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Depth) | +| Pose | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Pose) | +| Tile | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Tile) | +| Inpainting | [The AlimamaCreative Team](https://huggingface.co/alimama-creative) | [link](https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting) | + + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## StableDiffusion3ControlNetPipeline +[[autodoc]] StableDiffusion3ControlNetPipeline + - all + - __call__ + +## StableDiffusion3ControlNetInpaintingPipeline +[[autodoc]] pipelines.controlnet_sd3.pipeline_stable_diffusion_3_controlnet_inpainting.StableDiffusion3ControlNetInpaintingPipeline + - all + - __call__ + +## StableDiffusion3PipelineOutput +[[autodoc]] pipelines.stable_diffusion_3.pipeline_output.StableDiffusion3PipelineOutput diff --git a/docs/source/en/api/pipelines/controlnet_sdxl.md b/docs/source/en/api/pipelines/controlnet_sdxl.md index 2de7cbff6ebc..89fc1c389798 100644 --- a/docs/source/en/api/pipelines/controlnet_sdxl.md +++ b/docs/source/en/api/pipelines/controlnet_sdxl.md @@ -1,4 +1,4 @@ - + +# ControlNetUnion + +
+ LoRA +
+ +ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL. + +The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation. + +*We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.* + + +## StableDiffusionXLControlNetUnionPipeline +[[autodoc]] StableDiffusionXLControlNetUnionPipeline + - all + - __call__ + +## StableDiffusionXLControlNetUnionImg2ImgPipeline +[[autodoc]] StableDiffusionXLControlNetUnionImg2ImgPipeline + - all + - __call__ + +## StableDiffusionXLControlNetUnionInpaintPipeline +[[autodoc]] StableDiffusionXLControlNetUnionInpaintPipeline + - all + - __call__ diff --git a/examples/research_projects/controlnetxs/README.md b/docs/source/en/api/pipelines/controlnetxs.md similarity index 55% rename from examples/research_projects/controlnetxs/README.md rename to docs/source/en/api/pipelines/controlnetxs.md index 72ed91c01db2..d44fb0cf0fdf 100644 --- a/examples/research_projects/controlnetxs/README.md +++ b/docs/source/en/api/pipelines/controlnetxs.md @@ -1,5 +1,24 @@ + + +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # ControlNet-XS +
+ LoRA +
+ ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. @@ -12,5 +31,13 @@ Here's the overview from the [project page](https://vislearn.github.io/ControlNe This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## StableDiffusionControlNetXSPipeline +[[autodoc]] StableDiffusionControlNetXSPipeline + - all + - __call__ -> 🧠 Make sure to check out the Schedulers [guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. \ No newline at end of file +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/examples/research_projects/controlnetxs/README_sdxl.md b/docs/source/en/api/pipelines/controlnetxs_sdxl.md similarity index 53% rename from examples/research_projects/controlnetxs/README_sdxl.md rename to docs/source/en/api/pipelines/controlnetxs_sdxl.md index d401c1e76698..7ae0e2a2a178 100644 --- a/examples/research_projects/controlnetxs/README_sdxl.md +++ b/docs/source/en/api/pipelines/controlnetxs_sdxl.md @@ -1,3 +1,18 @@ + + +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # ControlNet-XS with Stable Diffusion XL ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. @@ -12,4 +27,16 @@ Here's the overview from the [project page](https://vislearn.github.io/ControlNe This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ -> 🧠 Make sure to check out the Schedulers [guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. \ No newline at end of file +> [!WARNING] +> 🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve! + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## StableDiffusionXLControlNetXSPipeline +[[autodoc]] StableDiffusionXLControlNetXSPipeline + - all + - __call__ + +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/cosmos.md b/docs/source/en/api/pipelines/cosmos.md new file mode 100644 index 000000000000..fb9453480e74 --- /dev/null +++ b/docs/source/en/api/pipelines/cosmos.md @@ -0,0 +1,79 @@ + + +# Cosmos + +[Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA. + +*Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## Loading original format checkpoints + +Original format checkpoints that have not been converted to diffusers-expected format can be loaded using the `from_single_file` method. + +```python +import torch +from diffusers import Cosmos2TextToImagePipeline, CosmosTransformer3DModel + +model_id = "nvidia/Cosmos-Predict2-2B-Text2Image" +transformer = CosmosTransformer3DModel.from_single_file( + "https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/blob/main/model.pt", + torch_dtype=torch.bfloat16, +).to("cuda") +pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16) +pipe.to("cuda") + +prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess." +negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality." + +output = pipe( + prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1) +).images[0] +output.save("output.png") +``` + +## CosmosTextToWorldPipeline + +[[autodoc]] CosmosTextToWorldPipeline + - all + - __call__ + +## CosmosVideoToWorldPipeline + +[[autodoc]] CosmosVideoToWorldPipeline + - all + - __call__ + +## Cosmos2TextToImagePipeline + +[[autodoc]] Cosmos2TextToImagePipeline + - all + - __call__ + +## Cosmos2VideoToWorldPipeline + +[[autodoc]] Cosmos2VideoToWorldPipeline + - all + - __call__ + +## CosmosPipelineOutput + +[[autodoc]] pipelines.cosmos.pipeline_output.CosmosPipelineOutput + +## CosmosImagePipelineOutput + +[[autodoc]] pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput diff --git a/docs/source/en/api/pipelines/dance_diffusion.md b/docs/source/en/api/pipelines/dance_diffusion.md index efba3c3763a4..0434f6319592 100644 --- a/docs/source/en/api/pipelines/dance_diffusion.md +++ b/docs/source/en/api/pipelines/dance_diffusion.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Dance Diffusion [Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans. @@ -17,11 +20,8 @@ specific language governing permissions and limitations under the License. Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## DanceDiffusionPipeline [[autodoc]] DanceDiffusionPipeline diff --git a/docs/source/en/api/pipelines/ddim.md b/docs/source/en/api/pipelines/ddim.md index 6802da739cd5..3e8cbae4fb60 100644 --- a/docs/source/en/api/pipelines/ddim.md +++ b/docs/source/en/api/pipelines/ddim.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # DiffEdit [DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. diff --git a/docs/source/en/api/pipelines/dit.md b/docs/source/en/api/pipelines/dit.md index 1d04458d9cb9..16d0c999619d 100644 --- a/docs/source/en/api/pipelines/dit.md +++ b/docs/source/en/api/pipelines/dit.md @@ -1,4 +1,4 @@ - + +# EasyAnimate +[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI. + +The description from it's GitHub page: +*EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.* + +This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai). + +There are two official EasyAnimate checkpoints for text-to-video and video-to-video. + +| checkpoints | recommended inference dtype | +|:---:|:---:| +| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 | +| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 | + +There is one official EasyAnimate checkpoints available for image-to-video and video-to-video. + +| checkpoints | recommended inference dtype | +|:---:|:---:| +| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 | + +There are two official EasyAnimate checkpoints available for control-to-video. + +| checkpoints | recommended inference dtype | +|:---:|:---:| +| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 | +| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 | + +For the EasyAnimateV5.1 series: +- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024. +- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended. + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`EasyAnimatePipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline +from diffusers.utils import export_to_video + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained( + "alibaba-pai/EasyAnimateV5.1-12b-zh", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = EasyAnimatePipeline.from_pretrained( + "alibaba-pai/EasyAnimateV5.1-12b-zh", + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "A cat walks on the grass, realistic style." +negative_prompt = "bad detailed" +video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0] +export_to_video(video, "cat.mp4", fps=8) +``` + +## EasyAnimatePipeline + +[[autodoc]] EasyAnimatePipeline + - all + - __call__ + +## EasyAnimatePipelineOutput + +[[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput diff --git a/docs/source/en/api/pipelines/flux.md b/docs/source/en/api/pipelines/flux.md new file mode 100644 index 000000000000..1a89de98e48f --- /dev/null +++ b/docs/source/en/api/pipelines/flux.md @@ -0,0 +1,714 @@ + + +# Flux + +
+ LoRA + MPS +
+ +Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs. + +Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux). + +> [!TIP] +> Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c). +> +> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. + +Flux comes in the following variants: + +| model type | model id | +|:----------:|:--------:| +| Timestep-distilled | [`black-forest-labs/FLUX.1-schnell`](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | +| Guidance-distilled | [`black-forest-labs/FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev) | +| Fill Inpainting/Outpainting (Guidance-distilled) | [`black-forest-labs/FLUX.1-Fill-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) | +| Canny Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Canny-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) | +| Depth Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Depth-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) | +| Canny Control (LoRA) | [`black-forest-labs/FLUX.1-Canny-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev-lora) | +| Depth Control (LoRA) | [`black-forest-labs/FLUX.1-Depth-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev-lora) | +| Redux (Adapter) | [`black-forest-labs/FLUX.1-Redux-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev) | +| Kontext | [`black-forest-labs/FLUX.1-kontext`](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev) | + +All checkpoints have different usage which we detail below. + +### Timestep-distilled + +* `max_sequence_length` cannot be more than 256. +* `guidance_scale` needs to be 0. +* As this is a timestep-distilled model, it benefits from fewer sampling steps. + +```python +import torch +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) +pipe.enable_model_cpu_offload() + +prompt = "A cat holding a sign that says hello world" +out = pipe( + prompt=prompt, + guidance_scale=0., + height=768, + width=1360, + num_inference_steps=4, + max_sequence_length=256, +).images[0] +out.save("image.png") +``` + +### Guidance-distilled + +* The guidance-distilled variant takes about 50 sampling steps for good-quality generation. +* It doesn't have any limitations around the `max_sequence_length`. + +```python +import torch +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16) +pipe.enable_model_cpu_offload() + +prompt = "a tiny astronaut hatching from an egg on the moon" +out = pipe( + prompt=prompt, + guidance_scale=3.5, + height=768, + width=1360, + num_inference_steps=50, +).images[0] +out.save("image.png") +``` + +### Fill Inpainting/Outpainting + +* Flux Fill pipeline does not require `strength` as an input like regular inpainting pipelines. +* It supports both inpainting and outpainting. + +```python +import torch +from diffusers import FluxFillPipeline +from diffusers.utils import load_image + +image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup.png") +mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup_mask.png") + +repo_id = "black-forest-labs/FLUX.1-Fill-dev" +pipe = FluxFillPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to("cuda") + +image = pipe( + prompt="a white paper cup", + image=image, + mask_image=mask, + height=1632, + width=1232, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0) +).images[0] +image.save(f"output.png") +``` + +### Canny Control + +**Note:** `black-forest-labs/Flux.1-Canny-dev` is _not_ a [`ControlNetModel`] model. ControlNet models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Canny Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible. + +```python +# !pip install -U controlnet-aux +import torch +from controlnet_aux import CannyDetector +from diffusers import FluxControlPipeline +from diffusers.utils import load_image + +pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", torch_dtype=torch.bfloat16).to("cuda") + +prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." +control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +processor = CannyDetector() +control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024) + +image = pipe( + prompt=prompt, + control_image=control_image, + height=1024, + width=1024, + num_inference_steps=50, + guidance_scale=30.0, +).images[0] +image.save("output.png") +``` + +Canny Control is also possible with a LoRA variant of this condition. The usage is as follows: + +```python +# !pip install -U controlnet-aux +import torch +from controlnet_aux import CannyDetector +from diffusers import FluxControlPipeline +from diffusers.utils import load_image + +pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda") +pipe.load_lora_weights("black-forest-labs/FLUX.1-Canny-dev-lora") + +prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." +control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +processor = CannyDetector() +control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024) + +image = pipe( + prompt=prompt, + control_image=control_image, + height=1024, + width=1024, + num_inference_steps=50, + guidance_scale=30.0, +).images[0] +image.save("output.png") +``` + +### Depth Control + +**Note:** `black-forest-labs/Flux.1-Depth-dev` is _not_ a ControlNet model. [`ControlNetModel`] models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Depth Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible. + +```python +# !pip install git+https://github.com/huggingface/image_gen_aux +import torch +from diffusers import FluxControlPipeline, FluxTransformer2DModel +from diffusers.utils import load_image +from image_gen_aux import DepthPreprocessor + +pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Depth-dev", torch_dtype=torch.bfloat16).to("cuda") + +prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." +control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") +control_image = processor(control_image)[0].convert("RGB") + +image = pipe( + prompt=prompt, + control_image=control_image, + height=1024, + width=1024, + num_inference_steps=30, + guidance_scale=10.0, + generator=torch.Generator().manual_seed(42), +).images[0] +image.save("output.png") +``` + +Depth Control is also possible with a LoRA variant of this condition. The usage is as follows: + +```python +# !pip install git+https://github.com/huggingface/image_gen_aux +import torch +from diffusers import FluxControlPipeline, FluxTransformer2DModel +from diffusers.utils import load_image +from image_gen_aux import DepthPreprocessor + +pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda") +pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora") + +prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." +control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") +control_image = processor(control_image)[0].convert("RGB") + +image = pipe( + prompt=prompt, + control_image=control_image, + height=1024, + width=1024, + num_inference_steps=30, + guidance_scale=10.0, + generator=torch.Generator().manual_seed(42), +).images[0] +image.save("output.png") +``` + +### Redux + +* Flux Redux pipeline is an adapter for FLUX.1 base models. It can be used with both flux-dev and flux-schnell, for image-to-image generation. +* You can first use the `FluxPriorReduxPipeline` to get the `prompt_embeds` and `pooled_prompt_embeds`, and then feed them into the `FluxPipeline` for image-to-image generation. +* When use `FluxPriorReduxPipeline` with a base pipeline, you can set `text_encoder=None` and `text_encoder_2=None` in the base pipeline, in order to save VRAM. + +```python +import torch +from diffusers import FluxPriorReduxPipeline, FluxPipeline +from diffusers.utils import load_image +device = "cuda" +dtype = torch.bfloat16 + + +repo_redux = "black-forest-labs/FLUX.1-Redux-dev" +repo_base = "black-forest-labs/FLUX.1-dev" +pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(repo_redux, torch_dtype=dtype).to(device) +pipe = FluxPipeline.from_pretrained( + repo_base, + text_encoder=None, + text_encoder_2=None, + torch_dtype=torch.bfloat16 +).to(device) + +image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy/img5.png") +pipe_prior_output = pipe_prior_redux(image) +images = pipe( + guidance_scale=2.5, + num_inference_steps=50, + generator=torch.Generator("cpu").manual_seed(0), + **pipe_prior_output, +).images +images[0].save("flux-redux.png") +``` + +### Kontext + +Flux Kontext is a model that allows in-context control of the image generation process, allowing for editing, refinement, relighting, style transfer, character customization, and more. + +```python +import torch +from diffusers import FluxKontextPipeline +from diffusers.utils import load_image + +pipe = FluxKontextPipeline.from_pretrained( + "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png").convert("RGB") +prompt = "Make Pikachu hold a sign that says 'Black Forest Labs is awesome', yarn art style, detailed, vibrant colors" +image = pipe( + image=image, + prompt=prompt, + guidance_scale=2.5, + generator=torch.Generator().manual_seed(42), +).images[0] +image.save("flux-kontext.png") +``` + +Flux Kontext comes with an integrity safety checker, which should be run after the image generation step. To run the safety checker, install the official repository from [black-forest-labs/flux](https://github.com/black-forest-labs/flux) and add the following code: + +```python +from flux.content_filters import PixtralContentFilter + +# ... pipeline invocation to generate images + +integrity_checker = PixtralContentFilter(torch.device("cuda")) +image_ = np.array(image) / 255.0 +image_ = 2 * image_ - 1 +image_ = torch.from_numpy(image_).to("cuda", dtype=torch.float32).unsqueeze(0).permute(0, 3, 1, 2) +if integrity_checker.test_image(image_): + raise ValueError("Your image has been flagged. Choose another prompt/image or try again.") +``` + +### Kontext Inpainting +`FluxKontextInpaintPipeline` enables image modification within a fixed mask region. It currently supports both text-based conditioning and image-reference conditioning. + + + + +```python +import torch +from diffusers import FluxKontextInpaintPipeline +from diffusers.utils import load_image + +prompt = "Change the yellow dinosaur to green one" +img_url = ( + "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/dinosaur_input.jpeg?raw=true" +) +mask_url = ( + "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/dinosaur_mask.png?raw=true" +) + +source = load_image(img_url) +mask = load_image(mask_url) + +pipe = FluxKontextInpaintPipeline.from_pretrained( + "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +image = pipe(prompt=prompt, image=source, mask_image=mask, strength=1.0).images[0] +image.save("kontext_inpainting_normal.png") +``` + + + +```python +import torch +from diffusers import FluxKontextInpaintPipeline +from diffusers.utils import load_image + +pipe = FluxKontextInpaintPipeline.from_pretrained( + "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +prompt = "Replace this ball" +img_url = "https://images.pexels.com/photos/39362/the-ball-stadion-football-the-pitch-39362.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500" +mask_url = "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/ball_mask.png?raw=true" +image_reference_url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTah3x6OL_ECMBaZ5ZlJJhNsyC-OSMLWAI-xw&s" + +source = load_image(img_url) +mask = load_image(mask_url) +image_reference = load_image(image_reference_url) + +mask = pipe.mask_processor.blur(mask, blur_factor=12) +image = pipe( + prompt=prompt, image=source, mask_image=mask, image_reference=image_reference, strength=1.0 +).images[0] +image.save("kontext_inpainting_ref.png") +``` + + + +## Combining Flux Turbo LoRAs with Flux Control, Fill, and Redux + +We can combine Flux Turbo LoRAs with Flux Control and other pipelines like Fill and Redux to enable few-steps' inference. The example below shows how to do that for Flux Control LoRA for depth and turbo LoRA from [`ByteDance/Hyper-SD`](https://hf.co/ByteDance/Hyper-SD). + +```py +from diffusers import FluxControlPipeline +from image_gen_aux import DepthPreprocessor +from diffusers.utils import load_image +from huggingface_hub import hf_hub_download +import torch + +control_pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16) +control_pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora", adapter_name="depth") +control_pipe.load_lora_weights( + hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd" +) +control_pipe.set_adapters(["depth", "hyper-sd"], adapter_weights=[0.85, 0.125]) +control_pipe.enable_model_cpu_offload() + +prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." +control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") + +processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") +control_image = processor(control_image)[0].convert("RGB") + +image = control_pipe( + prompt=prompt, + control_image=control_image, + height=1024, + width=1024, + num_inference_steps=8, + guidance_scale=10.0, + generator=torch.Generator().manual_seed(42), +).images[0] +image.save("output.png") +``` + +## Note about `unload_lora_weights()` when using Flux LoRAs + +When unloading the Control LoRA weights, call `pipe.unload_lora_weights(reset_to_overwritten_params=True)` to reset the `pipe.transformer` completely back to its original form. The resultant pipeline can then be used with methods like [`DiffusionPipeline.from_pipe`]. More details about this argument are available in [this PR](https://github.com/huggingface/diffusers/pull/10397). + +## IP-Adapter + +> [!TIP] +> Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work. + +An IP-Adapter lets you prompt Flux with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. + +```python +import torch +from diffusers import FluxPipeline +from diffusers.utils import load_image + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 +).to("cuda") + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flux_ip_adapter_input.jpg").resize((1024, 1024)) + +pipe.load_ip_adapter( + "XLabs-AI/flux-ip-adapter", + weight_name="ip_adapter.safetensors", + image_encoder_pretrained_model_name_or_path="openai/clip-vit-large-patch14" +) +pipe.set_ip_adapter_scale(1.0) + +image = pipe( + width=1024, + height=1024, + prompt="wearing sunglasses", + negative_prompt="", + true_cfg_scale=4.0, + generator=torch.Generator().manual_seed(4444), + ip_adapter_image=image, +).images[0] + +image.save('flux_ip_adapter_output.jpg') +``` + +
+ +
IP-Adapter examples with prompt "wearing sunglasses"
+
+ +## Optimize + +Flux is a very large model and requires ~50GB of RAM/VRAM to load all the modeling components. Enable some of the optimizations below to lower the memory requirements. + +### Group offloading + +[Group offloading](../../optimization/memory#group-offloading) lowers VRAM usage by offloading groups of internal layers rather than the whole model or weights. You need to use [`~hooks.apply_group_offloading`] on all the model components of a pipeline. The `offload_type` parameter allows you to toggle between block and leaf-level offloading. Setting it to `leaf_level` offloads the lowest leaf-level parameters to the CPU instead of offloading at the module-level. + +On CUDA devices that support asynchronous data streaming, set `use_stream=True` to overlap data transfer and computation to accelerate inference. + +> [!TIP] +> It is possible to mix block and leaf-level offloading for different components in a pipeline. + +```py +import torch +from diffusers import FluxPipeline +from diffusers.hooks import apply_group_offloading + +model_id = "black-forest-labs/FLUX.1-dev" +dtype = torch.bfloat16 +pipe = FluxPipeline.from_pretrained( + model_id, + torch_dtype=dtype, +) + +apply_group_offloading( + pipe.transformer, + offload_type="leaf_level", + offload_device=torch.device("cpu"), + onload_device=torch.device("cuda"), + use_stream=True, +) +apply_group_offloading( + pipe.text_encoder, + offload_device=torch.device("cpu"), + onload_device=torch.device("cuda"), + offload_type="leaf_level", + use_stream=True, +) +apply_group_offloading( + pipe.text_encoder_2, + offload_device=torch.device("cpu"), + onload_device=torch.device("cuda"), + offload_type="leaf_level", + use_stream=True, +) +apply_group_offloading( + pipe.vae, + offload_device=torch.device("cpu"), + onload_device=torch.device("cuda"), + offload_type="leaf_level", + use_stream=True, +) + +prompt="A cat wearing sunglasses and working as a lifeguard at pool." + +generator = torch.Generator().manual_seed(181201) +image = pipe( + prompt, + width=576, + height=1024, + num_inference_steps=30, + generator=generator +).images[0] +image +``` + +### Running FP16 inference + +Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details. + +FP16 inference code: +```python +import torch +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # can replace schnell with dev +# to run on low vram GPUs (i.e. between 4 and 32 GB VRAM) +pipe.enable_sequential_cpu_offload() +pipe.vae.enable_slicing() +pipe.vae.enable_tiling() + +pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once + +prompt = "A cat holding a sign that says hello world" +out = pipe( + prompt=prompt, + guidance_scale=0., + height=768, + width=1360, + num_inference_steps=4, + max_sequence_length=256, +).images[0] +out.save("image.png") +``` + +### Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`FluxPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = FluxTransformer2DModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + text_encoder_2=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0] +image.save("flux.png") +``` + +## Single File Loading for the `FluxTransformer2DModel` + +The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community. + +> [!TIP] +> `FP8` inference can be brittle depending on the GPU type, CUDA version, and `torch` version that you are using. It is recommended that you use the `optimum-quanto` library in order to run FP8 inference on your machine. + +The following example demonstrates how to run Flux with less than 16GB of VRAM. + +First install `optimum-quanto` + +```shell +pip install optimum-quanto +``` + +Then run the following example + +```python +import torch +from diffusers import FluxTransformer2DModel, FluxPipeline +from transformers import T5EncoderModel, CLIPTextModel +from optimum.quanto import freeze, qfloat8, quantize + +bfl_repo = "black-forest-labs/FLUX.1-dev" +dtype = torch.bfloat16 + +transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype) +quantize(transformer, weights=qfloat8) +freeze(transformer) + +text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype) +quantize(text_encoder_2, weights=qfloat8) +freeze(text_encoder_2) + +pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype) +pipe.transformer = transformer +pipe.text_encoder_2 = text_encoder_2 + +pipe.enable_model_cpu_offload() + +prompt = "A cat holding a sign that says hello world" +image = pipe( + prompt, + guidance_scale=3.5, + output_type="pil", + num_inference_steps=20, + generator=torch.Generator("cpu").manual_seed(0) +).images[0] + +image.save("flux-fp8-dev.png") +``` + +## FluxPipeline + +[[autodoc]] FluxPipeline + - all + - __call__ + +## FluxImg2ImgPipeline + +[[autodoc]] FluxImg2ImgPipeline + - all + - __call__ + +## FluxInpaintPipeline + +[[autodoc]] FluxInpaintPipeline + - all + - __call__ + + +## FluxControlNetInpaintPipeline + +[[autodoc]] FluxControlNetInpaintPipeline + - all + - __call__ + +## FluxControlNetImg2ImgPipeline + +[[autodoc]] FluxControlNetImg2ImgPipeline + - all + - __call__ + +## FluxControlPipeline + +[[autodoc]] FluxControlPipeline + - all + - __call__ + +## FluxControlImg2ImgPipeline + +[[autodoc]] FluxControlImg2ImgPipeline + - all + - __call__ + +## FluxPriorReduxPipeline + +[[autodoc]] FluxPriorReduxPipeline + - all + - __call__ + +## FluxFillPipeline + +[[autodoc]] FluxFillPipeline + - all + - __call__ + +## FluxKontextPipeline + +[[autodoc]] FluxKontextPipeline + - all + - __call__ + +## FluxKontextInpaintPipeline + +[[autodoc]] FluxKontextInpaintPipeline + - all + - __call__ \ No newline at end of file diff --git a/docs/source/en/api/pipelines/framepack.md b/docs/source/en/api/pipelines/framepack.md new file mode 100644 index 000000000000..a25cfe24a4ba --- /dev/null +++ b/docs/source/en/api/pipelines/framepack.md @@ -0,0 +1,206 @@ + + +# Framepack + +
+ LoRA +
+ +[Packing Input Frame Context in Next-Frame Prediction Models for Video Generation](https://huggingface.co/papers/2504.12626) by Lvmin Zhang and Maneesh Agrawala. + +*We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## Available models + +| Model name | Description | +|:---|:---| +- [`lllyasviel/FramePackI2V_HY`](https://huggingface.co/lllyasviel/FramePackI2V_HY) | Trained with the "inverted anti-drifting" strategy as described in the paper. Inference requires setting `sampling_type="inverted_anti_drifting"` when running the pipeline. | +- [`lllyasviel/FramePack_F1_I2V_HY_20250503`](https://huggingface.co/lllyasviel/FramePack_F1_I2V_HY_20250503) | Trained with a novel anti-drifting strategy but inference is performed in "vanilla" strategy as described in the paper. Inference requires setting `sampling_type="vanilla"` when running the pipeline. | + +## Usage + +Refer to the pipeline documentation for basic usage examples. The following section contains examples of offloading, different sampling methods, quantization, and more. + +### First and last frame to video + +The following example shows how to use Framepack with start and end image controls, using the inverted anti-drifiting sampling model. + +```python +import torch +from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel +from diffusers.utils import export_to_video, load_image +from transformers import SiglipImageProcessor, SiglipVisionModel + +transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( + "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16 +) +feature_extractor = SiglipImageProcessor.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" +) +image_encoder = SiglipVisionModel.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 +) +pipe = HunyuanVideoFramepackPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + transformer=transformer, + feature_extractor=feature_extractor, + image_encoder=image_encoder, + torch_dtype=torch.float16, +) + +# Enable memory optimizations +pipe.enable_model_cpu_offload() +pipe.vae.enable_tiling() + +prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." +first_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png" +) +last_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png" +) +output = pipe( + image=first_image, + last_image=last_image, + prompt=prompt, + height=512, + width=512, + num_frames=91, + num_inference_steps=30, + guidance_scale=9.0, + generator=torch.Generator().manual_seed(0), + sampling_type="inverted_anti_drifting", +).frames[0] +export_to_video(output, "output.mp4", fps=30) +``` + +### Vanilla sampling + +The following example shows how to use Framepack with the F1 model trained with vanilla sampling but new regulation approach for anti-drifting. + +```python +import torch +from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel +from diffusers.utils import export_to_video, load_image +from transformers import SiglipImageProcessor, SiglipVisionModel + +transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( + "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16 +) +feature_extractor = SiglipImageProcessor.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" +) +image_encoder = SiglipVisionModel.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 +) +pipe = HunyuanVideoFramepackPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + transformer=transformer, + feature_extractor=feature_extractor, + image_encoder=image_encoder, + torch_dtype=torch.float16, +) + +# Enable memory optimizations +pipe.enable_model_cpu_offload() +pipe.vae.enable_tiling() + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" +) +output = pipe( + image=image, + prompt="A penguin dancing in the snow", + height=832, + width=480, + num_frames=91, + num_inference_steps=30, + guidance_scale=9.0, + generator=torch.Generator().manual_seed(0), + sampling_type="vanilla", +).frames[0] +export_to_video(output, "output.mp4", fps=30) +``` + +### Group offloading + +Group offloading ([`~hooks.apply_group_offloading`]) provides aggressive memory optimizations for offloading internal parts of any model to the CPU, with possibly no additional overhead to generation time. If you have very low VRAM available, this approach may be suitable for you depending on the amount of CPU RAM available. + +```python +import torch +from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel +from diffusers.hooks import apply_group_offloading +from diffusers.utils import export_to_video, load_image +from transformers import SiglipImageProcessor, SiglipVisionModel + +transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( + "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16 +) +feature_extractor = SiglipImageProcessor.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" +) +image_encoder = SiglipVisionModel.from_pretrained( + "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 +) +pipe = HunyuanVideoFramepackPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + transformer=transformer, + feature_extractor=feature_extractor, + image_encoder=image_encoder, + torch_dtype=torch.float16, +) + +# Enable group offloading +onload_device = torch.device("cuda") +offload_device = torch.device("cpu") +list(map( + lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True, low_cpu_mem_usage=True), + [pipe.text_encoder, pipe.text_encoder_2, pipe.transformer] +)) +pipe.image_encoder.to(onload_device) +pipe.vae.to(onload_device) +pipe.vae.enable_tiling() + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" +) +output = pipe( + image=image, + prompt="A penguin dancing in the snow", + height=832, + width=480, + num_frames=91, + num_inference_steps=30, + guidance_scale=9.0, + generator=torch.Generator().manual_seed(0), + sampling_type="vanilla", +).frames[0] +print(f"Max memory: {torch.cuda.max_memory_allocated() / 1024**3:.3f} GB") +export_to_video(output, "output.mp4", fps=30) +``` + +## HunyuanVideoFramepackPipeline + +[[autodoc]] HunyuanVideoFramepackPipeline + - all + - __call__ + +## HunyuanVideoPipelineOutput + +[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput + diff --git a/docs/source/en/api/pipelines/hidream.md b/docs/source/en/api/pipelines/hidream.md new file mode 100644 index 000000000000..acfcef93e0ad --- /dev/null +++ b/docs/source/en/api/pipelines/hidream.md @@ -0,0 +1,40 @@ + + +# HiDreamImage + +[HiDream-I1](https://huggingface.co/HiDream-ai) by HiDream.ai + +> [!TIP] +> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. + +## Available models + +The following models are available for the [`HiDreamImagePipeline`](text-to-image) pipeline: + +| Model name | Description | +|:---|:---| +| [`HiDream-ai/HiDream-I1-Full`](https://huggingface.co/HiDream-ai/HiDream-I1-Full) | - | +| [`HiDream-ai/HiDream-I1-Dev`](https://huggingface.co/HiDream-ai/HiDream-I1-Dev) | - | +| [`HiDream-ai/HiDream-I1-Fast`](https://huggingface.co/HiDream-ai/HiDream-I1-Fast) | - | + +## HiDreamImagePipeline + +[[autodoc]] HiDreamImagePipeline + - all + - __call__ + +## HiDreamImagePipelineOutput + +[[autodoc]] pipelines.hidream_image.pipeline_output.HiDreamImagePipelineOutput diff --git a/docs/source/en/api/pipelines/hunyuan_video.md b/docs/source/en/api/pipelines/hunyuan_video.md new file mode 100644 index 000000000000..cdd81495b621 --- /dev/null +++ b/docs/source/en/api/pipelines/hunyuan_video.md @@ -0,0 +1,188 @@ + + +
+
+ + LoRA + +
+
+ +# HunyuanVideo + +[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate. + +You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization. + +> [!TIP] +> Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks. +> +> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers. + +The example below demonstrates how to generate a video optimized for memory or inference speed. + + + + +Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. + +The quantized HunyuanVideo model below requires ~14GB of VRAM. + +```py +import torch +from diffusers import AutoModel, HunyuanVideoPipeline +from diffusers.quantizers import PipelineQuantizationConfig +from diffusers.utils import export_to_video + +# quantize weights to int4 with bitsandbytes +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={ + "load_in_4bit": True, + "bnb_4bit_quant_type": "nf4", + "bnb_4bit_compute_dtype": torch.bfloat16 + }, + components_to_quantize="transformer" +) + +pipeline = HunyuanVideoPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +) + +# model-offloading and tiling +pipeline.enable_model_cpu_offload() +pipeline.vae.enable_tiling() + +prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." +video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] +export_to_video(video, "output.mp4", fps=15) +``` + + + + +[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. + +```py +import torch +from diffusers import AutoModel, HunyuanVideoPipeline +from diffusers.quantizers import PipelineQuantizationConfig +from diffusers.utils import export_to_video + +# quantize weights to int4 with bitsandbytes +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={ + "load_in_4bit": True, + "bnb_4bit_quant_type": "nf4", + "bnb_4bit_compute_dtype": torch.bfloat16 + }, + components_to_quantize="transformer" +) + +pipeline = HunyuanVideoPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +) + +# model-offloading and tiling +pipeline.enable_model_cpu_offload() +pipeline.vae.enable_tiling() + +# torch.compile +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.transformer = torch.compile( + pipeline.transformer, mode="max-autotune", fullgraph=True +) + +prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." +video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] +export_to_video(video, "output.mp4", fps=15) +``` + + + + +## Notes + +- HunyuanVideo supports LoRAs with [`~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights`]. + +
+ Show example code + + ```py + import torch + from diffusers import AutoModel, HunyuanVideoPipeline + from diffusers.quantizers import PipelineQuantizationConfig + from diffusers.utils import export_to_video + + # quantize weights to int4 with bitsandbytes + pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={ + "load_in_4bit": True, + "bnb_4bit_quant_type": "nf4", + "bnb_4bit_compute_dtype": torch.bfloat16 + }, + components_to_quantize="transformer" + ) + + pipeline = HunyuanVideoPipeline.from_pretrained( + "hunyuanvideo-community/HunyuanVideo", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, + ) + + # load LoRA weights + pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie") + pipeline.set_adapters("steamboat-willie", 0.9) + + # model-offloading and tiling + pipeline.enable_model_cpu_offload() + pipeline.vae.enable_tiling() + + # use "In the style of SWR" to trigger the LoRA + prompt = """ + In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys. + """ + video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] + export_to_video(video, "output.mp4", fps=15) + ``` + +
+ +- Refer to the table below for recommended inference values. + + | parameter | recommended value | + |---|---| + | text encoder dtype | `torch.float16` | + | transformer dtype | `torch.bfloat16` | + | vae dtype | `torch.float16` | + | `num_frames (k)` | 4 * `k` + 1 | + +- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. + +## HunyuanVideoPipeline + +[[autodoc]] HunyuanVideoPipeline + - all + - __call__ + +## HunyuanVideoPipelineOutput + +[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput diff --git a/docs/source/en/api/pipelines/hunyuandit.md b/docs/source/en/api/pipelines/hunyuandit.md new file mode 100644 index 000000000000..3f4db66c6c94 --- /dev/null +++ b/docs/source/en/api/pipelines/hunyuandit.md @@ -0,0 +1,95 @@ + + +# Hunyuan-DiT +![chinese elements understanding](https://github.com/gnobitab/diffusers-hunyuan/assets/1157982/39b99036-c3cb-4f16-bb1a-40ec25eda573) + +[Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding](https://huggingface.co/papers/2405.08748) from Tencent Hunyuan. + +The abstract from the paper is: + +*We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.* + + +You can find the original codebase at [Tencent/HunyuanDiT](https://github.com/Tencent/HunyuanDiT) and all the available checkpoints at [Tencent-Hunyuan](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). + +**Highlights**: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation. + +HunyuanDiT has the following components: +* It uses a diffusion transformer as the backbone +* It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +> [!TIP] +> You can further improve generation quality by passing the generated image from [`HungyuanDiTPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. + +## Optimization + +You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides. + +### Inference + +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. + +First, load the pipeline: + +```python +from diffusers import HunyuanDiTPipeline +import torch + +pipeline = HunyuanDiTPipeline.from_pretrained( + "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16 +).to("cuda") +``` + +Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: + +```python +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.vae.to(memory_format=torch.channels_last) +``` + +Finally, compile the components and run inference: + +```python +pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) +pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) + +image = pipeline(prompt="一个宇航员在骑马").images[0] +``` + +The [benchmark](https://gist.github.com/sayakpaul/29d3a14905cfcbf611fe71ebd22e9b23) results on a 80GB A100 machine are: + +```bash +With torch.compile(): Average inference time: 12.470 seconds. +Without torch.compile(): Average inference time: 20.570 seconds. +``` + +### Memory optimization + +By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details. + +Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime. + +```diff ++ pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1) +``` + + +## HunyuanDiTPipeline + +[[autodoc]] HunyuanDiTPipeline + - all + - __call__ + diff --git a/docs/source/en/api/pipelines/i2vgenxl.md b/docs/source/en/api/pipelines/i2vgenxl.md index cafffaac3bd6..711a5625f99c 100644 --- a/docs/source/en/api/pipelines/i2vgenxl.md +++ b/docs/source/en/api/pipelines/i2vgenxl.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # I2VGen-XL [I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. @@ -20,11 +23,8 @@ The abstract from the paper is: The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). Sample output with I2VGenXL: @@ -47,6 +47,7 @@ Sample output with I2VGenXL: * Unlike SVD, it additionally accepts text prompts as inputs. * It can generate higher resolution videos. * When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results. +* This implementation is 1-stage variant of I2VGenXL. The main figure in the [I2VGen-XL](https://huggingface.co/papers/2311.04145) paper shows a 2-stage variant, however, 1-stage variant works well. See [this discussion](https://github.com/huggingface/diffusers/discussions/7952) for more details. ## I2VGenXLPipeline [[autodoc]] I2VGenXLPipeline diff --git a/docs/source/en/api/pipelines/kandinsky.md b/docs/source/en/api/pipelines/kandinsky.md index 9ea3cd4a1718..7717f2db69a5 100644 --- a/docs/source/en/api/pipelines/kandinsky.md +++ b/docs/source/en/api/pipelines/kandinsky.md @@ -1,4 +1,4 @@ - + +# Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis + +
+ LoRA + MPS +
+ +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png) + +Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf). + +The abstract from the technical report is: + +*We present Kolors, a latent diffusion model for text-to-image synthesis, characterized by its profound understanding of both English and Chinese, as well as an impressive degree of photorealism. There are three key insights contributing to the development of Kolors. Firstly, unlike large language model T5 used in Imagen and Stable Diffusion 3, Kolors is built upon the General Language Model (GLM), which enhances its comprehension capabilities in both English and Chinese. Moreover, we employ a multimodal large language model to recaption the extensive training dataset for fine-grained text understanding. These strategies significantly improve Kolors’ ability to comprehend intricate semantics, particularly those involving multiple entities, and enable its advanced text rendering capabilities. Secondly, we divide the training of Kolors into two phases: the concept learning phase with broad knowledge and the quality improvement phase with specifically curated high-aesthetic data. Furthermore, we investigate the critical role of the noise schedule and introduce a novel schedule to optimize high-resolution image generation. These strategies collectively enhance the visual appeal of the generated high-resolution images. Lastly, we propose a category-balanced benchmark KolorsPrompts, which serves as a guide for the training and evaluation of Kolors. Consequently, even when employing the commonly used U-Net backbone, Kolors has demonstrated remarkable performance in human evaluations, surpassing the existing open-source models and achieving Midjourney-v6 level performance, especially in terms of visual appeal. We will release the code and weights of Kolors at , and hope that it will benefit future research and applications in the visual generation community.* + +## Usage Example + +```python +import torch + +from diffusers import DPMSolverMultistepScheduler, KolorsPipeline + +pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16") +pipe.to("cuda") +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) + +image = pipe( + prompt='一张瓢虫的照片,微距,变焦,高质量,电影,拿着一个牌子,写着"可图"', + negative_prompt="", + guidance_scale=6.5, + num_inference_steps=25, +).images[0] + +image.save("kolors_sample.png") +``` + +### IP Adapter + +Kolors needs a different IP Adapter to work, and it uses [Openai-CLIP-336](https://huggingface.co/openai/clip-vit-large-patch14-336) as an image encoder. + +> [!TIP] +> Using an IP Adapter with Kolors requires more than 24GB of VRAM. To use it, we recommend using [`~DiffusionPipeline.enable_model_cpu_offload`] on consumer GPUs. + +> [!TIP] +> While Kolors is integrated in Diffusers, you need to load the image encoder from a revision to use the safetensor files. You can still use the main branch of the original repository if you're comfortable loading pickle checkpoints. + +```python +import torch +from transformers import CLIPVisionModelWithProjection + +from diffusers import DPMSolverMultistepScheduler, KolorsPipeline +from diffusers.utils import load_image + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "Kwai-Kolors/Kolors-IP-Adapter-Plus", + subfolder="image_encoder", + low_cpu_mem_usage=True, + torch_dtype=torch.float16, + revision="refs/pr/4", +) + +pipe = KolorsPipeline.from_pretrained( + "Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16" +) +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) + +pipe.load_ip_adapter( + "Kwai-Kolors/Kolors-IP-Adapter-Plus", + subfolder="", + weight_name="ip_adapter_plus_general.safetensors", + revision="refs/pr/4", + image_encoder_folder=None, +) +pipe.enable_model_cpu_offload() + +ipa_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/cat_square.png") + +image = pipe( + prompt="best quality, high quality", + negative_prompt="", + guidance_scale=6.5, + num_inference_steps=25, + ip_adapter_image=ipa_image, +).images[0] + +image.save("kolors_ipa_sample.png") +``` + +## KolorsPipeline + +[[autodoc]] KolorsPipeline + +- all +- __call__ + +## KolorsImg2ImgPipeline + +[[autodoc]] KolorsImg2ImgPipeline + +- all +- __call__ + diff --git a/docs/source/en/api/pipelines/latent_consistency_models.md b/docs/source/en/api/pipelines/latent_consistency_models.md index 4d944510445c..54e81fbe2519 100644 --- a/docs/source/en/api/pipelines/latent_consistency_models.md +++ b/docs/source/en/api/pipelines/latent_consistency_models.md @@ -1,4 +1,4 @@ - + +# Latte + +![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true) + +[Latte: Latent Diffusion Transformer for Video Generation](https://huggingface.co/papers/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University. + +The abstract from the paper is: + +*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.* + +**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - [FaceForensics](https://huggingface.co/papers/1803.09179), [SkyTimelapse](https://huggingface.co/papers/1709.07592), [UCF101](https://huggingface.co/papers/1212.0402) and [Taichi-HD](https://huggingface.co/papers/2003.00196). To prepare and download the datasets for evaluation, please refer to [this https URL](https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md). + +This pipeline was contributed by [maxin-cn](https://github.com/maxin-cn). The original codebase can be found [here](https://github.com/Vchitect/Latte). The original weights can be found under [hf.co/maxin-cn](https://huggingface.co/maxin-cn). + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +### Inference + +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. + +First, load the pipeline: + +```python +import torch +from diffusers import LattePipeline + +pipeline = LattePipeline.from_pretrained( + "maxin-cn/Latte-1", torch_dtype=torch.float16 +).to("cuda") +``` + +Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: + +```python +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.vae.to(memory_format=torch.channels_last) +``` + +Finally, compile the components and run inference: + +```python +pipeline.transformer = torch.compile(pipeline.transformer) +pipeline.vae.decode = torch.compile(pipeline.vae.decode) + +video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0] +``` + +The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are: + +``` +Without torch.compile(): Average inference time: 16.246 seconds. +With torch.compile(): Average inference time: 14.573 seconds. +``` + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LattePipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LatteTransformer3DModel, LattePipeline +from diffusers.utils import export_to_gif +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "maxin-cn/Latte-1", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = LatteTransformer3DModel.from_pretrained( + "maxin-cn/Latte-1", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = LattePipeline.from_pretrained( + "maxin-cn/Latte-1", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "A small cactus with a happy face in the Sahara desert." +video = pipeline(prompt).frames[0] +export_to_gif(video, "latte.gif") +``` + +## LattePipeline + +[[autodoc]] LattePipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/ledits_pp.md b/docs/source/en/api/pipelines/ledits_pp.md index 7b957b1a6337..103bcf379890 100644 --- a/docs/source/en/api/pipelines/ledits_pp.md +++ b/docs/source/en/api/pipelines/ledits_pp.md @@ -12,24 +12,24 @@ specific language governing permissions and limitations under the License. # LEDITS++ +
+ LoRA +
+ LEDITS++ was proposed in [LEDITS++: Limitless Image Editing using Text-to-Image Models](https://huggingface.co/papers/2311.16711) by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos. The abstract from the paper is: *Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .* - - -You can find additional information about LEDITS++ on the [project page](https://leditsplusplus-project.static.hf.space/index.html) and try it out in a [demo](https://huggingface.co/spaces/editing-images/leditsplusplus). - - +> [!TIP] +> You can find additional information about LEDITS++ on the [project page](https://leditsplusplus-project.static.hf.space/index.html) and try it out in a [demo](https://huggingface.co/spaces/editing-images/leditsplusplus). - -Due to some backward compatability issues with the current diffusers implementation of [`~schedulers.DPMSolverMultistepScheduler`] this implementation of LEdits++ can no longer guarantee perfect inversion. -This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated [GitHub repo](https://github.com/ml-research/ledits_pp). - +> [!WARNING] +> Due to some backward compatibility issues with the current diffusers implementation of [`~schedulers.DPMSolverMultistepScheduler`] this implementation of LEdits++ can no longer guarantee perfect inversion. +> This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated [GitHub repo](https://github.com/ml-research/ledits_pp). -We provide two distinct pipelines based on different pre-trained models. +We provide two distinct pipelines based on different pre-trained models. ## LEditsPPPipelineStableDiffusion [[autodoc]] pipelines.ledits_pp.LEditsPPPipelineStableDiffusion diff --git a/docs/source/en/api/pipelines/ltx_video.md b/docs/source/en/api/pipelines/ltx_video.md new file mode 100644 index 000000000000..d87c57ced790 --- /dev/null +++ b/docs/source/en/api/pipelines/ltx_video.md @@ -0,0 +1,414 @@ + + +
+
+ + LoRA + + MPS +
+
+ +# LTX-Video + +[LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step. + +You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization. + +> [!TIP] +> Click on the LTX-Video models in the right sidebar for more examples of other video generation tasks. + +The example below demonstrates how to generate a video optimized for memory or inference speed. + + + + +Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. + +The LTX-Video model below requires ~10GB of VRAM. + +```py +import torch +from diffusers import LTXPipeline, AutoModel +from diffusers.hooks import apply_group_offloading +from diffusers.utils import export_to_video + +# fp8 layerwise weight-casting +transformer = AutoModel.from_pretrained( + "Lightricks/LTX-Video", + subfolder="transformer", + torch_dtype=torch.bfloat16 +) +transformer.enable_layerwise_casting( + storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16 +) + +pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16) + +# group-offloading +onload_device = torch.device("cuda") +offload_device = torch.device("cpu") +pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True) +apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2) +apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level") + +prompt = """ +A woman with long brown hair and light skin smiles at another woman with long blonde hair. +The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. +The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and +natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage +""" +negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" + +video = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + width=768, + height=512, + num_frames=161, + decode_timestep=0.03, + decode_noise_scale=0.025, + num_inference_steps=50, +).frames[0] +export_to_video(video, "output.mp4", fps=24) +``` + + + + +[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. + +```py +import torch +from diffusers import LTXPipeline +from diffusers.utils import export_to_video + +pipeline = LTXPipeline.from_pretrained( + "Lightricks/LTX-Video", torch_dtype=torch.bfloat16 +) + +# torch.compile +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.transformer = torch.compile( + pipeline.transformer, mode="max-autotune", fullgraph=True +) + +prompt = """ +A woman with long brown hair and light skin smiles at another woman with long blonde hair. +The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. +The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and +natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage +""" +negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" + +video = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + width=768, + height=512, + num_frames=161, + decode_timestep=0.03, + decode_noise_scale=0.025, + num_inference_steps=50, +).frames[0] +export_to_video(video, "output.mp4", fps=24) +``` + + + + +## Notes + +- Refer to the following recommended settings for generation from the [LTX-Video](https://github.com/Lightricks/LTX-Video) repository. + + - The recommended dtype for the transformer, VAE, and text encoder is `torch.bfloat16`. The VAE and text encoder can also be `torch.float32` or `torch.float16`. + - For guidance-distilled variants of LTX-Video, set `guidance_scale` to `1.0`. The `guidance_scale` for any other model should be set higher, like `5.0`, for good generation quality. + - For timestep-aware VAE variants (LTX-Video 0.9.1 and above), set `decode_timestep` to `0.05` and `image_cond_noise_scale` to `0.025`. + - For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitionts in the generated video. + +- LTX-Video 0.9.7 includes a spatial latent upscaler and a 13B parameter transformer. During inference, a low resolution video is quickly generated first and then upscaled and refined. + +
+ Show example code + + ```py + import torch + from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline + from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition + from diffusers.utils import export_to_video, load_video + + pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16) + pipeline_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) + pipeline.to("cuda") + pipe_upsample.to("cuda") + pipeline.vae.enable_tiling() + + def round_to_nearest_resolution_acceptable_by_vae(height, width): + height = height - (height % pipeline.vae_temporal_compression_ratio) + width = width - (width % pipeline.vae_temporal_compression_ratio) + return height, width + + video = load_video( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" + )[:21] # only use the first 21 frames as conditioning + condition1 = LTXVideoCondition(video=video, frame_index=0) + + prompt = """ + The video depicts a winding mountain road covered in snow, with a single vehicle + traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. + The landscape is characterized by rugged terrain and a river visible in the distance. + The scene captures the solitude and beauty of a winter drive through a mountainous region. + """ + negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" + expected_height, expected_width = 768, 1152 + downscale_factor = 2 / 3 + num_frames = 161 + + # 1. Generate video at smaller resolution + # Text-only conditioning is also supported without the need to pass `conditions` + downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) + downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) + latents = pipeline( + conditions=[condition1], + prompt=prompt, + negative_prompt=negative_prompt, + width=downscaled_width, + height=downscaled_height, + num_frames=num_frames, + num_inference_steps=30, + decode_timestep=0.05, + decode_noise_scale=0.025, + image_cond_noise_scale=0.0, + guidance_scale=5.0, + guidance_rescale=0.7, + generator=torch.Generator().manual_seed(0), + output_type="latent", + ).frames + + # 2. Upscale generated video using latent upsampler with fewer inference steps + # The available latent upsampler upscales the height/width by 2x + upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 + upscaled_latents = pipe_upsample( + latents=latents, + output_type="latent" + ).frames + + # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) + video = pipeline( + conditions=[condition1], + prompt=prompt, + negative_prompt=negative_prompt, + width=upscaled_width, + height=upscaled_height, + num_frames=num_frames, + denoise_strength=0.4, # Effectively, 4 inference steps out of 10 + num_inference_steps=10, + latents=upscaled_latents, + decode_timestep=0.05, + decode_noise_scale=0.025, + image_cond_noise_scale=0.0, + guidance_scale=5.0, + guidance_rescale=0.7, + generator=torch.Generator().manual_seed(0), + output_type="pil", + ).frames[0] + + # 4. Downscale the video to the expected resolution + video = [frame.resize((expected_width, expected_height)) for frame in video] + + export_to_video(video, "output.mp4", fps=24) + ``` + +
+ +- LTX-Video 0.9.7 distilled model is guidance and timestep-distilled to speedup generation. It requires `guidance_scale` to be set to `1.0` and `num_inference_steps` should be set between `4` and `10` for good generation quality. You should also use the following custom timesteps for the best results. + + - Base model inference to prepare for upscaling: `[1000, 993, 987, 981, 975, 909, 725, 0.03]`. + - Upscaling: `[1000, 909, 725, 421, 0]`. + +
+ Show example code + + ```py + import torch + from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline + from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition + from diffusers.utils import export_to_video, load_video + + pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16) + pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) + pipeline.to("cuda") + pipe_upsample.to("cuda") + pipeline.vae.enable_tiling() + + def round_to_nearest_resolution_acceptable_by_vae(height, width): + height = height - (height % pipeline.vae_temporal_compression_ratio) + width = width - (width % pipeline.vae_temporal_compression_ratio) + return height, width + + prompt = """ + artistic anatomical 3d render, utlra quality, human half full male body with transparent + skin revealing structure instead of organs, muscular, intricate creative patterns, + monochromatic with backlighting, lightning mesh, scientific concept art, blending biology + with botany, surreal and ethereal quality, unreal engine 5, ray tracing, ultra realistic, + 16K UHD, rich details. camera zooms out in a rotating fashion + """ + negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" + expected_height, expected_width = 768, 1152 + downscale_factor = 2 / 3 + num_frames = 161 + + # 1. Generate video at smaller resolution + downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) + downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) + latents = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + width=downscaled_width, + height=downscaled_height, + num_frames=num_frames, + timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03], + decode_timestep=0.05, + decode_noise_scale=0.025, + image_cond_noise_scale=0.0, + guidance_scale=1.0, + guidance_rescale=0.7, + generator=torch.Generator().manual_seed(0), + output_type="latent", + ).frames + + # 2. Upscale generated video using latent upsampler with fewer inference steps + # The available latent upsampler upscales the height/width by 2x + upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 + upscaled_latents = pipe_upsample( + latents=latents, + adain_factor=1.0, + output_type="latent" + ).frames + + # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) + video = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + width=upscaled_width, + height=upscaled_height, + num_frames=num_frames, + denoise_strength=0.999, # Effectively, 4 inference steps out of 5 + timesteps=[1000, 909, 725, 421, 0], + latents=upscaled_latents, + decode_timestep=0.05, + decode_noise_scale=0.025, + image_cond_noise_scale=0.0, + guidance_scale=1.0, + guidance_rescale=0.7, + generator=torch.Generator().manual_seed(0), + output_type="pil", + ).frames[0] + + # 4. Downscale the video to the expected resolution + video = [frame.resize((expected_width, expected_height)) for frame in video] + + export_to_video(video, "output.mp4", fps=24) + ``` + +
+ +- LTX-Video supports LoRAs with [`~loaders.LTXVideoLoraLoaderMixin.load_lora_weights`]. + +
+ Show example code + + ```py + import torch + from diffusers import LTXConditionPipeline + from diffusers.utils import export_to_video, load_image + + pipeline = LTXConditionPipeline.from_pretrained( + "Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16 + ) + + pipeline.load_lora_weights("Lightricks/LTX-Video-Cakeify-LoRA", adapter_name="cakeify") + pipeline.set_adapters("cakeify") + + # use "CAKEIFY" to trigger the LoRA + prompt = "CAKEIFY a person using a knife to cut a cake shaped like a Pikachu plushie" + image = load_image("https://huggingface.co/Lightricks/LTX-Video-Cakeify-LoRA/resolve/main/assets/images/pikachu.png") + + video = pipeline( + prompt=prompt, + image=image, + width=576, + height=576, + num_frames=161, + decode_timestep=0.03, + decode_noise_scale=0.025, + num_inference_steps=50, + ).frames[0] + export_to_video(video, "output.mp4", fps=26) + ``` + +
+ +- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`loaders.FromOriginalModelMixin.from_single_file`] or [`loaders.FromSingleFileMixin.from_single_file`]. + +
+ Show example code + + ```py + import torch + from diffusers.utils import export_to_video + from diffusers import LTXPipeline, AutoModel, GGUFQuantizationConfig + + transformer = AutoModel.from_single_file( + "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf" + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + torch_dtype=torch.bfloat16 + ) + pipeline = LTXPipeline.from_pretrained( + "Lightricks/LTX-Video", + transformer=transformer, + torch_dtype=torch.bfloat16 + ) + ``` + +
+ +## LTXPipeline + +[[autodoc]] LTXPipeline + - all + - __call__ + +## LTXImageToVideoPipeline + +[[autodoc]] LTXImageToVideoPipeline + - all + - __call__ + +## LTXConditionPipeline + +[[autodoc]] LTXConditionPipeline + - all + - __call__ + +## LTXLatentUpsamplePipeline + +[[autodoc]] LTXLatentUpsamplePipeline + - all + - __call__ + +## LTXPipelineOutput + +[[autodoc]] pipelines.ltx.pipeline_output.LTXPipelineOutput diff --git a/docs/source/en/api/pipelines/lumina.md b/docs/source/en/api/pipelines/lumina.md new file mode 100644 index 000000000000..0a236d213d6c --- /dev/null +++ b/docs/source/en/api/pipelines/lumina.md @@ -0,0 +1,127 @@ + + +# Lumina-T2X +![concepts](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a) + +[Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. + +The abstract from the paper is: + +*Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.* + +**Highlights**: Lumina-Next is a next-generation Diffusion Transformer that significantly enhances text-to-image generation, multilingual generation, and multitask performance by introducing the Next-DiT architecture, 3D RoPE, and frequency- and time-aware RoPE, among other improvements. + +Lumina-Next has the following components: +* It improves sampling efficiency with fewer and faster Steps. +* It uses a Next-DiT as a transformer backbone with Sandwichnorm 3D RoPE, and Grouped-Query Attention. +* It uses a Frequency- and Time-Aware Scaled RoPE. + +--- + +[Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://huggingface.co/papers/2405.05945) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. + +The abstract from the paper is: + +*Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.* + + +You can find the original codebase at [Alpha-VLLM](https://github.com/Alpha-VLLM/Lumina-T2X) and all the available checkpoints at [Alpha-VLLM Lumina Family](https://huggingface.co/collections/Alpha-VLLM/lumina-family-66423205bedb81171fd0644b). + +**Highlights**: Lumina-T2X supports Any Modality, Resolution, and Duration. + +Lumina-T2X has the following components: +* It uses a Flow-based Large Diffusion Transformer as the backbone +* It supports different any modalities with one backbone and corresponding encoder, decoder. + +This pipeline was contributed by [PommesPeter](https://github.com/PommesPeter). The original codebase can be found [here](https://github.com/Alpha-VLLM/Lumina-T2X). The original weights can be found under [hf.co/Alpha-VLLM](https://huggingface.co/Alpha-VLLM). + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +### Inference (Text-to-Image) + +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. + +First, load the pipeline: + +```python +from diffusers import LuminaPipeline +import torch + +pipeline = LuminaPipeline.from_pretrained( + "Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16 +).to("cuda") +``` + +Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: + +```python +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.vae.to(memory_format=torch.channels_last) +``` + +Finally, compile the components and run inference: + +```python +pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) +pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) + +image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0] +``` + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LuminaPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, Transformer2DModel, LuminaPipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "Alpha-VLLM/Lumina-Next-SFT-diffusers", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = Transformer2DModel.from_pretrained( + "Alpha-VLLM/Lumina-Next-SFT-diffusers", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = LuminaPipeline.from_pretrained( + "Alpha-VLLM/Lumina-Next-SFT-diffusers", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt).images[0] +image.save("lumina.png") +``` + +## LuminaPipeline + +[[autodoc]] LuminaPipeline + - all + - __call__ + diff --git a/docs/source/en/api/pipelines/lumina2.md b/docs/source/en/api/pipelines/lumina2.md new file mode 100644 index 000000000000..0c4e793404fe --- /dev/null +++ b/docs/source/en/api/pipelines/lumina2.md @@ -0,0 +1,84 @@ + + +# Lumina2 + +
+ LoRA +
+ +[Lumina Image 2.0: A Unified and Efficient Image Generative Model](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) is a 2 billion parameter flow-based diffusion transformer capable of generating diverse images from text descriptions. + +The abstract from the paper is: + +*We introduce Lumina-Image 2.0, an advanced text-to-image model that surpasses previous state-of-the-art methods across multiple benchmarks, while also shedding light on its potential to evolve into a generalist vision intelligence model. Lumina-Image 2.0 exhibits three key properties: (1) Unification – it adopts a unified architecture that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and facilitating task expansion. Besides, since high-quality captioners can provide semantically better-aligned text-image training pairs, we introduce a unified captioning system, UniCaptioner, which generates comprehensive and precise captions for the model. This not only accelerates model convergence but also enhances prompt adherence, variable-length prompt handling, and task generalization via prompt templates. (2) Efficiency – to improve the efficiency of the unified architecture, we develop a set of optimization techniques that improve semantic learning and fine-grained texture generation during training while incorporating inference-time acceleration strategies without compromising image quality. (3) Transparency – we open-source all training details, code, and models to ensure full reproducibility, aiming to bridge the gap between well-resourced closed-source research teams and independent developers.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## Using Single File loading with Lumina Image 2.0 + +Single file loading for Lumina Image 2.0 is available for the `Lumina2Transformer2DModel` + +```python +import torch +from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline + +ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0/blob/main/consolidated.00-of-01.pth" +transformer = Lumina2Transformer2DModel.from_single_file( + ckpt_path, torch_dtype=torch.bfloat16 +) + +pipe = Lumina2Pipeline.from_pretrained( + "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16 +) +pipe.enable_model_cpu_offload() +image = pipe( + "a cat holding a sign that says hello", + generator=torch.Generator("cpu").manual_seed(0), +).images[0] +image.save("lumina-single-file.png") + +``` + +## Using GGUF Quantized Checkpoints with Lumina Image 2.0 + +GGUF Quantized checkpoints for the `Lumina2Transformer2DModel` can be loaded via `from_single_file` with the `GGUFQuantizationConfig` + +```python +from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline, GGUFQuantizationConfig + +ckpt_path = "https://huggingface.co/calcuis/lumina-gguf/blob/main/lumina2-q4_0.gguf" +transformer = Lumina2Transformer2DModel.from_single_file( + ckpt_path, + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + torch_dtype=torch.bfloat16, +) + +pipe = Lumina2Pipeline.from_pretrained( + "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16 +) +pipe.enable_model_cpu_offload() +image = pipe( + "a cat holding a sign that says hello", + generator=torch.Generator("cpu").manual_seed(0), +).images[0] +image.save("lumina-gguf.png") +``` + +## Lumina2Pipeline + +[[autodoc]] Lumina2Pipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/marigold.md b/docs/source/en/api/pipelines/marigold.md new file mode 100644 index 000000000000..81e103afeb64 --- /dev/null +++ b/docs/source/en/api/pipelines/marigold.md @@ -0,0 +1,122 @@ + + +# Marigold Computer Vision + +![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg) + +Marigold was proposed in +[Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), +a CVPR 2024 Oral paper by +[Bingxin Ke](http://www.kebingxin.com/), +[Anton Obukhov](https://www.obukhov.ai/), +[Shengyu Huang](https://shengyuh.github.io/), +[Nando Metzger](https://nandometzger.github.io/), +[Rodrigo Caye Daudt](https://rcdaudt.github.io/), and +[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). +The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional +computer vision tasks**. +This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the +teaser above. + +Marigold was later extended in the follow-up paper, +[Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145), +authored by +[Bingxin Ke](http://www.kebingxin.com/), +[Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US), +[Tianfu Wang](https://tianfwang.github.io/), +[Nando Metzger](https://nandometzger.github.io/), +[Shengyu Huang](https://shengyuh.github.io/), +[Bo Li](https://www.linkedin.com/in/bobboli0202/), +[Anton Obukhov](https://www.obukhov.ai/), and +[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). +This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition** +(IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR) +processing capability. + +> [!TIP] +> The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps. +> LCM models were later developed to enable high-quality inference in just 1 to 4 steps. +> Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal +> results in as few as 1 to 4 steps. + +## Available Pipelines + +Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a +corresponding prediction. +Currently, the following computer vision tasks are implemented: + +| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities | +|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | +| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | +| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),
[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) | + +## Available Checkpoints + +All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face. +They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train +new model checkpoints. +The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps. + +| Checkpoint | Modality | Comment | +|-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. | +| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. | +| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. | +| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image  \\(I\\)  is comprised of Albedo  \\(A\\), Diffuse shading  \\(S\\), and Non-diffuse residual  \\(R\\):  \\(I = A*S+R\\). | + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff +> between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to +> efficiently load the same components into multiple pipelines. +> Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section +> [here](../../using-diffusers/svd#reduce-memory-usage). + +> [!WARNING] +> Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint. +> The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases. +> To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the +> API reference). +> Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration +> file (`model_index.json`). +> This ensures high-quality predictions when invoking the pipeline with only the `image` argument. + +See also Marigold [usage examples](../../using-diffusers/marigold_usage). + +## Marigold Depth Prediction API + +[[autodoc]] MarigoldDepthPipeline + - __call__ + +[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput + +[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth + +## Marigold Normals Estimation API +[[autodoc]] MarigoldNormalsPipeline + - __call__ + +[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput + +[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals + +## Marigold Intrinsic Image Decomposition API + +[[autodoc]] MarigoldIntrinsicsPipeline + - __call__ + +[[autodoc]] pipelines.marigold.pipeline_marigold_intrinsics.MarigoldIntrinsicsOutput + +[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics diff --git a/docs/source/en/api/pipelines/mochi.md b/docs/source/en/api/pipelines/mochi.md new file mode 100644 index 000000000000..f19a9bd575c1 --- /dev/null +++ b/docs/source/en/api/pipelines/mochi.md @@ -0,0 +1,276 @@ + + +# Mochi 1 Preview + +
+ LoRA +
+ +> [!TIP] +> Only a research preview of the model weights is available at the moment. + +[Mochi 1](https://huggingface.co/genmo/mochi-1-preview) is a video generation model by Genmo with a strong focus on prompt adherence and motion quality. The model features a 10B parameter Asmmetric Diffusion Transformer (AsymmDiT) architecture, and uses non-square QKV and output projection layers to reduce inference memory requirements. A single T5-XXL model is used to encode prompts. + +*Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. The model is released under a permissive Apache 2.0 license.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`MochiPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, MochiTransformer3DModel, MochiPipeline +from diffusers.utils import export_to_video +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "genmo/mochi-1-preview", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = MochiTransformer3DModel.from_pretrained( + "genmo/mochi-1-preview", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = MochiPipeline.from_pretrained( + "genmo/mochi-1-preview", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +video = pipeline( + "Close-up of a cats eye, with the galaxy reflected in the cats eye. Ultra high resolution 4k.", + num_inference_steps=28, + guidance_scale=3.5 +).frames[0] +export_to_video(video, "cat.mp4") +``` + +## Generating videos with Mochi-1 Preview + +The following example will download the full precision `mochi-1-preview` weights and produce the highest quality results but will require at least 42GB VRAM to run. + +```python +import torch +from diffusers import MochiPipeline +from diffusers.utils import export_to_video + +pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview") + +# Enable memory savings +pipe.enable_model_cpu_offload() +pipe.enable_vae_tiling() + +prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k." + +with torch.autocast("cuda", torch.bfloat16, cache_enabled=False): + frames = pipe(prompt, num_frames=85).frames[0] + +export_to_video(frames, "mochi.mp4", fps=30) +``` + +## Using a lower precision variant to save memory + +The following example will use the `bfloat16` variant of the model and requires 22GB VRAM to run. There is a slight drop in the quality of the generated video as a result. + +```python +import torch +from diffusers import MochiPipeline +from diffusers.utils import export_to_video + +pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16) + +# Enable memory savings +pipe.enable_model_cpu_offload() +pipe.enable_vae_tiling() + +prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k." +frames = pipe(prompt, num_frames=85).frames[0] + +export_to_video(frames, "mochi.mp4", fps=30) +``` + +## Reproducing the results from the Genmo Mochi repo + +The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the original implementation, please refer to the following example. + +> [!TIP] +> The original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder. +> +> When enabling `force_zeros_for_empty_prompt`, it is recommended to run the text encoding step outside the autocast context in full precision. + +> [!TIP] +> Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`. + +```python +import torch +from torch.nn.attention import SDPBackend, sdpa_kernel + +from diffusers import MochiPipeline +from diffusers.utils import export_to_video +from diffusers.video_processor import VideoProcessor + +pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", force_zeros_for_empty_prompt=True) +pipe.enable_vae_tiling() +pipe.enable_model_cpu_offload() + +prompt = "An aerial shot of a parade of elephants walking across the African savannah. The camera showcases the herd and the surrounding landscape." + +with torch.no_grad(): + prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask = ( + pipe.encode_prompt(prompt=prompt) + ) + +with torch.autocast("cuda", torch.bfloat16): + with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): + frames = pipe( + prompt_embeds=prompt_embeds, + prompt_attention_mask=prompt_attention_mask, + negative_prompt_embeds=negative_prompt_embeds, + negative_prompt_attention_mask=negative_prompt_attention_mask, + guidance_scale=4.5, + num_inference_steps=64, + height=480, + width=848, + num_frames=163, + generator=torch.Generator("cuda").manual_seed(0), + output_type="latent", + return_dict=False, + )[0] + +video_processor = VideoProcessor(vae_scale_factor=8) +has_latents_mean = hasattr(pipe.vae.config, "latents_mean") and pipe.vae.config.latents_mean is not None +has_latents_std = hasattr(pipe.vae.config, "latents_std") and pipe.vae.config.latents_std is not None +if has_latents_mean and has_latents_std: + latents_mean = ( + torch.tensor(pipe.vae.config.latents_mean).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype) + ) + latents_std = ( + torch.tensor(pipe.vae.config.latents_std).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype) + ) + frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean +else: + frames = frames / pipe.vae.config.scaling_factor + +with torch.no_grad(): + video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0] + +video = video_processor.postprocess_video(video)[0] +export_to_video(video, "mochi.mp4", fps=30) +``` + +## Running inference with multiple GPUs + +It is possible to split the large Mochi transformer across multiple GPUs using the `device_map` and `max_memory` options in `from_pretrained`. In the following example we split the model across two GPUs, each with 24GB of VRAM. + +```python +import torch +from diffusers import MochiPipeline, MochiTransformer3DModel +from diffusers.utils import export_to_video + +model_id = "genmo/mochi-1-preview" +transformer = MochiTransformer3DModel.from_pretrained( + model_id, + subfolder="transformer", + device_map="auto", + max_memory={0: "24GB", 1: "24GB"} +) + +pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer) +pipe.enable_model_cpu_offload() +pipe.enable_vae_tiling() + +with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False): + frames = pipe( + prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.", + negative_prompt="", + height=480, + width=848, + num_frames=85, + num_inference_steps=50, + guidance_scale=4.5, + num_videos_per_prompt=1, + generator=torch.Generator(device="cuda").manual_seed(0), + max_sequence_length=256, + output_type="pil", + ).frames[0] + +export_to_video(frames, "output.mp4", fps=30) +``` + +## Using single file loading with the Mochi Transformer + +You can use `from_single_file` to load the Mochi transformer in its original format. + +> [!TIP] +> Diffusers currently doesn't support using the FP8 scaled versions of the Mochi single file checkpoints. + +```python +import torch +from diffusers import MochiPipeline, MochiTransformer3DModel +from diffusers.utils import export_to_video + +model_id = "genmo/mochi-1-preview" + +ckpt_path = "https://huggingface.co/Comfy-Org/mochi_preview_repackaged/blob/main/split_files/diffusion_models/mochi_preview_bf16.safetensors" + +transformer = MochiTransformer3DModel.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16) + +pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer) +pipe.enable_model_cpu_offload() +pipe.enable_vae_tiling() + +with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False): + frames = pipe( + prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.", + negative_prompt="", + height=480, + width=848, + num_frames=85, + num_inference_steps=50, + guidance_scale=4.5, + num_videos_per_prompt=1, + generator=torch.Generator(device="cuda").manual_seed(0), + max_sequence_length=256, + output_type="pil", + ).frames[0] + +export_to_video(frames, "output.mp4", fps=30) +``` + +## MochiPipeline + +[[autodoc]] MochiPipeline + - all + - __call__ + +## MochiPipelineOutput + +[[autodoc]] pipelines.mochi.pipeline_output.MochiPipelineOutput diff --git a/docs/source/en/api/pipelines/musicldm.md b/docs/source/en/api/pipelines/musicldm.md index 3ffb6541405d..1a83e5932ed4 100644 --- a/docs/source/en/api/pipelines/musicldm.md +++ b/docs/source/en/api/pipelines/musicldm.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # MusicLDM MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. @@ -40,11 +43,8 @@ During inference: * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. * The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument. - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## MusicLDMPipeline [[autodoc]] MusicLDMPipeline diff --git a/docs/source/en/api/pipelines/omnigen.md b/docs/source/en/api/pipelines/omnigen.md new file mode 100644 index 000000000000..4fac5c789a25 --- /dev/null +++ b/docs/source/en/api/pipelines/omnigen.md @@ -0,0 +1,77 @@ + + +# OmniGen + +[OmniGen: Unified Image Generation](https://huggingface.co/papers/2409.11340) from BAAI, by Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu. + +The abstract from the paper is: + +*The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [staoxiao](https://github.com/staoxiao). The original codebase can be found [here](https://github.com/VectorSpaceLab/OmniGen). The original weights can be found under [hf.co/shitao](https://huggingface.co/Shitao/OmniGen-v1). + +## Inference + +First, load the pipeline: + +```python +import torch +from diffusers import OmniGenPipeline + +pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16) +pipe.to("cuda") +``` + +For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image. +You can try setting the `height` and `width` parameters to generate images with different size. + +```python +prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD." +image = pipe( + prompt=prompt, + height=1024, + width=1024, + guidance_scale=3, + generator=torch.Generator(device="cpu").manual_seed(111), +).images[0] +image.save("output.png") +``` + +OmniGen supports multimodal inputs. +When the input includes an image, you need to add a placeholder `<|image_1|>` in the text prompt to represent the image. +It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image. + +```python +prompt="<|image_1|> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola." +input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")] +image = pipe( + prompt=prompt, + input_images=input_images, + guidance_scale=2, + img_guidance_scale=1.6, + use_input_image_size_as_output=True, + generator=torch.Generator(device="cpu").manual_seed(222)).images[0] +image.save("output.png") +``` + +## OmniGenPipeline + +[[autodoc]] OmniGenPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/overview.md b/docs/source/en/api/pipelines/overview.md index cd1232a90d6e..ce883931df21 100644 --- a/docs/source/en/api/pipelines/overview.md +++ b/docs/source/en/api/pipelines/overview.md @@ -1,4 +1,4 @@ - + +# Perturbed-Attention Guidance + +
+ LoRA +
+ +[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules. + +PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim. + +The abstract from the paper is: + +*Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.* + +PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers. + +- Full identifier as a normal string: `down_blocks.2.attentions.0.transformer_blocks.0.attn1.processor` +- Full identifier as a RegEx: `down_blocks.2.(attentions|motion_modules).0.transformer_blocks.0.attn1.processor` +- Partial identifier as a RegEx: `down_blocks.2`, or `attn1` +- List of identifiers (can be combo of strings and ReGex): `["blocks.1", "blocks.(14|20)", r"down_blocks\.(2,3)"]` + +> [!WARNING] +> Since RegEx is supported as a way for matching layer identifiers, it is crucial to use it correctly otherwise there might be unexpected behaviour. The recommended way to use PAG is by specifying layers as `blocks.{layer_index}` and `blocks.({layer_index_1|layer_index_2|...})`. Using it in any other way, while doable, may bypass our basic validation checks and give you unexpected results. + +## AnimateDiffPAGPipeline +[[autodoc]] AnimateDiffPAGPipeline + - all + - __call__ + +## HunyuanDiTPAGPipeline +[[autodoc]] HunyuanDiTPAGPipeline + - all + - __call__ + +## KolorsPAGPipeline +[[autodoc]] KolorsPAGPipeline + - all + - __call__ + +## StableDiffusionPAGInpaintPipeline +[[autodoc]] StableDiffusionPAGInpaintPipeline + - all + - __call__ + +## StableDiffusionPAGPipeline +[[autodoc]] StableDiffusionPAGPipeline + - all + - __call__ + +## StableDiffusionPAGImg2ImgPipeline +[[autodoc]] StableDiffusionPAGImg2ImgPipeline + - all + - __call__ + +## StableDiffusionControlNetPAGPipeline +[[autodoc]] StableDiffusionControlNetPAGPipeline + +## StableDiffusionControlNetPAGInpaintPipeline +[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline + - all + - __call__ + +## StableDiffusionXLPAGPipeline +[[autodoc]] StableDiffusionXLPAGPipeline + - all + - __call__ + +## StableDiffusionXLPAGImg2ImgPipeline +[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline + - all + - __call__ + +## StableDiffusionXLPAGInpaintPipeline +[[autodoc]] StableDiffusionXLPAGInpaintPipeline + - all + - __call__ + +## StableDiffusionXLControlNetPAGPipeline +[[autodoc]] StableDiffusionXLControlNetPAGPipeline + - all + - __call__ + +## StableDiffusionXLControlNetPAGImg2ImgPipeline +[[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline + - all + - __call__ + +## StableDiffusion3PAGPipeline +[[autodoc]] StableDiffusion3PAGPipeline + - all + - __call__ + +## StableDiffusion3PAGImg2ImgPipeline +[[autodoc]] StableDiffusion3PAGImg2ImgPipeline + - all + - __call__ + +## PixArtSigmaPAGPipeline +[[autodoc]] PixArtSigmaPAGPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/paint_by_example.md b/docs/source/en/api/pipelines/paint_by_example.md index effd608873fd..02bf6db7265d 100644 --- a/docs/source/en/api/pipelines/paint_by_example.md +++ b/docs/source/en/api/pipelines/paint_by_example.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Paint by Example [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. @@ -24,11 +27,8 @@ The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https:// Paint by Example is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## PaintByExamplePipeline [[autodoc]] PaintByExamplePipeline diff --git a/docs/source/en/api/pipelines/panorama.md b/docs/source/en/api/pipelines/panorama.md index b34008ad830f..b65e05dd0b51 100644 --- a/docs/source/en/api/pipelines/panorama.md +++ b/docs/source/en/api/pipelines/panorama.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # MultiDiffusion +
+ LoRA +
+ [MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. The abstract from the paper is: @@ -35,11 +42,8 @@ For example, without circular padding, there is a stitching artifact (default): But with circular padding, the right and the left parts are matching (`circular_padding=True`): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png) - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionPanoramaPipeline [[autodoc]] StableDiffusionPanoramaPipeline diff --git a/docs/source/en/api/pipelines/pia.md b/docs/source/en/api/pipelines/pia.md index 8ba78252c99b..eebfa4d4f8a6 100644 --- a/docs/source/en/api/pipelines/pia.md +++ b/docs/source/en/api/pipelines/pia.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Image-to-Video Generation with PIA (Personalized Image Animator) +
+ LoRA +
+ ## Overview -[PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen +[PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://huggingface.co/papers/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance. @@ -80,15 +87,12 @@ Here are some sample outputs: - - -If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the PIA checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. - - +> [!TIP] +> If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the PIA checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. ## Using FreeInit -[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. +[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://huggingface.co/papers/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to PIA, AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper. @@ -142,11 +146,8 @@ export_to_gif(frames, "pia-freeinit-animation.gif") - - -FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). - - +> [!WARNING] +> FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). ## PIAPipeline diff --git a/docs/source/en/api/pipelines/pix2pix.md b/docs/source/en/api/pipelines/pix2pix.md index 52767a90b214..84eb0cb5e5d3 100644 --- a/docs/source/en/api/pipelines/pix2pix.md +++ b/docs/source/en/api/pipelines/pix2pix.md @@ -1,4 +1,4 @@ - + +# PixArt-Σ + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage_sigma.jpg) + +[PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation](https://huggingface.co/papers/2403.04692) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. + +The abstract from the paper is: + +*In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of highquality visual content in industries such as film and gaming.* + +You can find the original codebase at [PixArt-alpha/PixArt-sigma](https://github.com/PixArt-alpha/PixArt-sigma) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha). + +Some notes about this pipeline: + +* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](https://hf.co/docs/transformers/model_doc/dit). +* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. +* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-sigma/blob/master/diffusion/data/datasets/utils.py). +* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as PixArt-α, Stable Diffusion XL, Playground V2.0 and DALL-E 3, while being more efficient than them. +* It shows the ability of generating super high resolution images, such as 2048px or even 4K. +* It shows that text-to-image models can grow from a weak model to a stronger one through several improvements (VAEs, datasets, and so on.) + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +> [!TIP] +> You can further improve generation quality by passing the generated image from [`PixArtSigmaPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. + +## Inference with under 8GB GPU VRAM + +Run the [`PixArtSigmaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example. + +First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library: + +```bash +pip install -U bitsandbytes +``` + +Then load the text encoder in 8-bit: + +```python +from transformers import T5EncoderModel +from diffusers import PixArtSigmaPipeline +import torch + +text_encoder = T5EncoderModel.from_pretrained( + "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", + subfolder="text_encoder", + load_in_8bit=True, + device_map="auto", +) +pipe = PixArtSigmaPipeline.from_pretrained( + "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", + text_encoder=text_encoder, + transformer=None, + device_map="balanced" +) +``` + +Now, use the `pipe` to encode a prompt: + +```python +with torch.no_grad(): + prompt = "cute cat" + prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt) +``` + +Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up some GPU VRAM: + +```python +import gc + +def flush(): + gc.collect() + torch.cuda.empty_cache() + +del text_encoder +del pipe +flush() +``` + +Then compute the latents with the prompt embeddings as inputs: + +```python +pipe = PixArtSigmaPipeline.from_pretrained( + "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", + text_encoder=None, + torch_dtype=torch.float16, +).to("cuda") + +latents = pipe( + negative_prompt=None, + prompt_embeds=prompt_embeds, + negative_prompt_embeds=negative_embeds, + prompt_attention_mask=prompt_attention_mask, + negative_prompt_attention_mask=negative_prompt_attention_mask, + num_images_per_prompt=1, + output_type="latent", +).images + +del pipe.transformer +flush() +``` + +> [!TIP] +> Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded. + +Once the latents are computed, pass it off to the VAE to decode into a real image: + +```python +with torch.no_grad(): + image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0] +image = pipe.image_processor.postprocess(image, output_type="pil")[0] +image.save("cat.png") +``` + +By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtSigmaPipeline`] with under 8GB GPU VRAM. + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png) + +If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e). + +> [!WARNING] +> Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit. + +While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB. + +## PixArtSigmaPipeline + +[[autodoc]] PixArtSigmaPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/qwenimage.md b/docs/source/en/api/pipelines/qwenimage.md new file mode 100644 index 000000000000..27cc4802f601 --- /dev/null +++ b/docs/source/en/api/pipelines/qwenimage.md @@ -0,0 +1,161 @@ + + +# QwenImage + +
+ LoRA +
+ +Qwen-Image from the Qwen team is an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. Experiments show strong general capabilities in both image generation and editing, with exceptional performance in text rendering, especially for Chinese. + +Qwen-Image comes in the following variants: + +| model type | model id | +|:----------:|:--------:| +| Qwen-Image | [`Qwen/Qwen-Image`](https://huggingface.co/Qwen/Qwen-Image) | +| Qwen-Image-Edit | [`Qwen/Qwen-Image-Edit`](https://huggingface.co/Qwen/Qwen-Image-Edit) | +| Qwen-Image-Edit Plus | [Qwen/Qwen-Image-Edit-2509](https://huggingface.co/Qwen/Qwen-Image-Edit-2509) | + +> [!TIP] +> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. + +## LoRA for faster inference + +Use a LoRA from `lightx2v/Qwen-Image-Lightning` to speed up inference by reducing the +number of steps. Refer to the code snippet below: + +
+Code + +```py +from diffusers import DiffusionPipeline, FlowMatchEulerDiscreteScheduler +import torch +import math + +ckpt_id = "Qwen/Qwen-Image" + +# From +# https://github.com/ModelTC/Qwen-Image-Lightning/blob/342260e8f5468d2f24d084ce04f55e101007118b/generate_with_diffusers.py#L82C9-L97C10 +scheduler_config = { + "base_image_seq_len": 256, + "base_shift": math.log(3), # We use shift=3 in distillation + "invert_sigmas": False, + "max_image_seq_len": 8192, + "max_shift": math.log(3), # We use shift=3 in distillation + "num_train_timesteps": 1000, + "shift": 1.0, + "shift_terminal": None, # set shift_terminal to None + "stochastic_sampling": False, + "time_shift_type": "exponential", + "use_beta_sigmas": False, + "use_dynamic_shifting": True, + "use_exponential_sigmas": False, + "use_karras_sigmas": False, +} +scheduler = FlowMatchEulerDiscreteScheduler.from_config(scheduler_config) +pipe = DiffusionPipeline.from_pretrained( + ckpt_id, scheduler=scheduler, torch_dtype=torch.bfloat16 +).to("cuda") +pipe.load_lora_weights( + "lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-8steps-V1.0.safetensors" +) + +prompt = "a tiny astronaut hatching from an egg on the moon, Ultra HD, 4K, cinematic composition." +negative_prompt = " " +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + width=1024, + height=1024, + num_inference_steps=8, + true_cfg_scale=1.0, + generator=torch.manual_seed(0), +).images[0] +image.save("qwen_fewsteps.png") +``` + +
+ +> [!TIP] +> The `guidance_scale` parameter in the pipeline is there to support future guidance-distilled models when they come up. Note that passing `guidance_scale` to the pipeline is ineffective. To enable classifier-free guidance, please pass `true_cfg_scale` and `negative_prompt` (even an empty negative prompt like " ") should enable classifier-free guidance computations. + +## Multi-image reference with QwenImageEditPlusPipeline + +With [`QwenImageEditPlusPipeline`], one can provide multiple images as input reference. + +``` +import torch +from PIL import Image +from diffusers import QwenImageEditPlusPipeline +from diffusers.utils import load_image + +pipe = QwenImageEditPlusPipeline.from_pretrained( + "Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16 +).to("cuda") + +image_1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/grumpy.jpg") +image_2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peng.png") +image = pipe( + image=[image_1, image_2], + prompt="put the penguin and the cat at a game show called "Qwen Edit Plus Games"", + num_inference_steps=50 +).images[0] +``` + +## QwenImagePipeline + +[[autodoc]] QwenImagePipeline + - all + - __call__ + +## QwenImageImg2ImgPipeline + +[[autodoc]] QwenImageImg2ImgPipeline + - all + - __call__ + +## QwenImageInpaintPipeline + +[[autodoc]] QwenImageInpaintPipeline + - all + - __call__ + +## QwenImageEditPipeline + +[[autodoc]] QwenImageEditPipeline + - all + - __call__ + +## QwenImageEditInpaintPipeline + +[[autodoc]] QwenImageEditInpaintPipeline + - all + - __call__ + +## QwenImageControlNetPipeline + +[[autodoc]] QwenImageControlNetPipeline + - all + - __call__ + +## QwenImageEditPlusPipeline + +[[autodoc]] QwenImageEditPlusPipeline + - all + - __call__ + +## QwenImagePipelineOutput + +[[autodoc]] pipelines.qwenimage.pipeline_output.QwenImagePipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/sana.md b/docs/source/en/api/pipelines/sana.md new file mode 100644 index 000000000000..a948620f96cb --- /dev/null +++ b/docs/source/en/api/pipelines/sana.md @@ -0,0 +1,106 @@ + + +# SanaPipeline + +
+ LoRA + MPS +
+ +[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han. + +The abstract from the paper is: + +*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj) and [chenjy2003](https://github.com/chenjy2003). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model). + +Available models: + +| Model | Recommended dtype | +|:-----:|:-----------------:| +| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` | +| [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` | +| [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` | +| [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` | +| [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` | +| [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` | +| [`Efficient-Large-Model/Sana_600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers) | `torch.float16` | + +Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) collection for more information. + +Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. + +> [!TIP] +> Make sure to pass the `variant` argument for downloaded checkpoints to use lower disk space. Set it to `"fp16"` for models with recommended dtype as `torch.float16`, and `"bf16"` for models with recommended dtype as `torch.bfloat16`. By default, `torch.float32` weights are downloaded, which use twice the amount of disk storage. Additionally, `torch.float32` weights can be downcasted on-the-fly by specifying the `torch_dtype` argument. Read about it in the [docs](https://huggingface.co/docs/diffusers/v0.31.0/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained). + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaPipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = AutoModel.from_pretrained( + "Efficient-Large-Model/Sana_1600M_1024px_diffusers", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = SanaTransformer2DModel.from_pretrained( + "Efficient-Large-Model/Sana_1600M_1024px_diffusers", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = SanaPipeline.from_pretrained( + "Efficient-Large-Model/Sana_1600M_1024px_diffusers", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt).images[0] +image.save("sana.png") +``` + +## SanaPipeline + +[[autodoc]] SanaPipeline + - all + - __call__ + +## SanaPAGPipeline + +[[autodoc]] SanaPAGPipeline + - all + - __call__ + +## SanaPipelineOutput + +[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput diff --git a/docs/source/en/api/pipelines/sana_sprint.md b/docs/source/en/api/pipelines/sana_sprint.md new file mode 100644 index 000000000000..357d7e406dd4 --- /dev/null +++ b/docs/source/en/api/pipelines/sana_sprint.md @@ -0,0 +1,131 @@ + + +# SANA-Sprint + +
+ LoRA +
+ +[SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation](https://huggingface.co/papers/2503.09641) from NVIDIA, MIT HAN Lab, and Hugging Face by Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, Song Han + +The abstract from the paper is: + +*This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.* + +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + +This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj), [shuchen Xue](https://github.com/scxue) and [Enze Xie](https://github.com/xieenze). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model/). + +Available models: + +| Model | Recommended dtype | +|:-------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------:| +| [`Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers) | `torch.bfloat16` | +| [`Efficient-Large-Model/Sana_Sprint_0.6B_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_Sprint_0.6B_1024px_diffusers) | `torch.bfloat16` | + +Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-sprint-67d6810d65235085b3b17c76) collection for more information. + +Note: The recommended dtype mentioned is for the transformer weights. The text encoder must stay in `torch.bfloat16` and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. + + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaSprintPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaSprintPipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = AutoModel.from_pretrained( + "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.bfloat16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = SanaTransformer2DModel.from_pretrained( + "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.bfloat16, +) + +pipeline = SanaSprintPipeline.from_pretrained( + "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.bfloat16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt).images[0] +image.save("sana.png") +``` + +## Setting `max_timesteps` + +Users can tweak the `max_timesteps` value for experimenting with the visual quality of the generated outputs. The default `max_timesteps` value was obtained with an inference-time search process. For more details about it, check out the paper. + +## Image to Image + +The [`SanaSprintImg2ImgPipeline`] is a pipeline for image-to-image generation. It takes an input image and a prompt, and generates a new image based on the input image and the prompt. + +```py +import torch +from diffusers import SanaSprintImg2ImgPipeline +from diffusers.utils.loading_utils import load_image + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" +) + +pipe = SanaSprintImg2ImgPipeline.from_pretrained( + "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", + torch_dtype=torch.bfloat16) +pipe.to("cuda") + +image = pipe( + prompt="a cute pink bear", + image=image, + strength=0.5, + height=832, + width=480 +).images[0] +image.save("output.png") +``` + +## SanaSprintPipeline + +[[autodoc]] SanaSprintPipeline + - all + - __call__ + +## SanaSprintImg2ImgPipeline + +[[autodoc]] SanaSprintImg2ImgPipeline + - all + - __call__ + + +## SanaPipelineOutput + +[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput diff --git a/docs/source/en/api/pipelines/self_attention_guidance.md b/docs/source/en/api/pipelines/self_attention_guidance.md index e56aae2a775b..8d411598ae6d 100644 --- a/docs/source/en/api/pipelines/self_attention_guidance.md +++ b/docs/source/en/api/pipelines/self_attention_guidance.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Self-Attention Guidance [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al. @@ -20,11 +23,8 @@ The abstract from the paper is: You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionSAGPipeline [[autodoc]] StableDiffusionSAGPipeline diff --git a/docs/source/en/api/pipelines/semantic_stable_diffusion.md b/docs/source/en/api/pipelines/semantic_stable_diffusion.md index 19a0a8116989..dda428e80f8f 100644 --- a/docs/source/en/api/pipelines/semantic_stable_diffusion.md +++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Semantic Guidance Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. @@ -19,11 +22,8 @@ The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.* - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## SemanticStableDiffusionPipeline [[autodoc]] SemanticStableDiffusionPipeline diff --git a/docs/source/en/api/pipelines/shap_e.md b/docs/source/en/api/pipelines/shap_e.md index 9f9155c79e89..3e505894ca80 100644 --- a/docs/source/en/api/pipelines/shap_e.md +++ b/docs/source/en/api/pipelines/shap_e.md @@ -1,4 +1,4 @@ - + +
+
+ + LoRA + +
+
+ +# SkyReels-V2: Infinite-length Film Generative model + +[SkyReels-V2](https://huggingface.co/papers/2504.13074) by the SkyReels Team from Skywork AI. + +*Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at [this https URL](https://github.com/SkyworkAI/SkyReels-V2).* + +You can find all the original SkyReels-V2 checkpoints under the [Skywork](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) organization. + +The following SkyReels-V2 models are supported in Diffusers: +- [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers) +- [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers) +- [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers) +- [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers) +- [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers) +- [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers) +- [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers) +- [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers) +- [SkyReels-V2 FLF2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-FLF2V-1.3B-540P-Diffusers) + +> [!TIP] +> Click on the SkyReels-V2 models in the right sidebar for more examples of video generation. + +### A _Visual_ Demonstration + +The example below has the following parameters: + +- `base_num_frames=97` +- `num_frames=97` +- `num_inference_steps=30` +- `ar_step=5` +- `causal_block_size=5` + +With `vae_scale_factor_temporal=4`, expect `5` blocks of `5` frames each as calculated by: + +`num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each` + +And the maximum context length in the latent space is calculated with `base_num_latent_frames`: + +`base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 -> 25//5 = 5 blocks` + +Asynchronous Processing Timeline: +```text +┌─────────────────────────────────────────────────────────────────┐ +│ Steps: 1 6 11 16 21 26 31 36 41 46 50 │ +│ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +│ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +│ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +│ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +│ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +└─────────────────────────────────────────────────────────────────┘ +``` + +For Long Videos (`num_frames` > `base_num_frames`): +`base_num_frames` acts as the "sliding window size" for processing long videos. + +Example: `257`-frame video with `base_num_frames=97`, `overlap_history=17` +```text +┌──── Iteration 1 (frames 1-97) ────┐ +│ Processing window: 97 frames │ → 5 blocks, +│ Generates: frames 1-97 │ async processing +└───────────────────────────────────┘ + ┌────── Iteration 2 (frames 81-177) ──────┐ + │ Processing window: 97 frames │ + │ Overlap: 17 frames (81-97) from prev │ → 5 blocks, + │ Generates: frames 98-177 │ async processing + └─────────────────────────────────────────┘ + ┌────── Iteration 3 (frames 161-257) ──────┐ + │ Processing window: 97 frames │ + │ Overlap: 17 frames (161-177) from prev │ → 5 blocks, + │ Generates: frames 178-257 │ async processing + └──────────────────────────────────────────┘ +``` + +Each iteration independently runs the asynchronous processing with its own `5` blocks. +`base_num_frames` controls: +1. Memory usage (larger window = more VRAM) +2. Model context length (must match training constraints) +3. Number of blocks per iteration (`base_num_latent_frames // causal_block_size`) + +Each block takes `30` steps to complete denoising. +Block N starts at step: `1 + (N-1) x ar_step` +Total steps: `30 + (5-1) x 5 = 50` steps + + +Synchronous mode (`ar_step=0`) would process all blocks/frames simultaneously: +```text +┌──────────────────────────────────────────────┐ +│ Steps: 1 ... 30 │ +│ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ +└──────────────────────────────────────────────┘ +``` +Total steps: `30` steps + + +An example on how the step matrix is constructed for asynchronous processing: +Given the parameters: (`num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5`) +``` +- num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25 +- step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948, + 941, 932, 922, 912, 901, 888, 874, 859, 841, 822, + 799, 773, 743, 708, 666, 615, 551, 470, 363, 216] +``` + +The algorithm creates a `50x25` `step_matrix` where: +``` +- Row 1: [999×5, 999×5, 999×5, 999×5, 999×5] +- Row 2: [995×5, 999×5, 999×5, 999×5, 999×5] +- Row 3: [991×5, 999×5, 999×5, 999×5, 999×5] +- ... +- Row 7: [969×5, 995×5, 999×5, 999×5, 999×5] +- ... +- Row 21: [799×5, 888×5, 941×5, 975×5, 999×5] +- ... +- Row 35: [ 0×5, 216×5, 666×5, 822×5, 901×5] +- ... +- Row 42: [ 0×5, 0×5, 0×5, 551×5, 773×5] +- ... +- Row 50: [ 0×5, 0×5, 0×5, 0×5, 216×5] +``` + +Detailed Row `6` Analysis: +``` +- step_matrix[5]: [ 975×5, 999×5, 999×5, 999×5, 999×5] +- step_index[5]: [ 6×5, 1×5, 0×5, 0×5, 0×5] +- step_update_mask[5]: [True×5, True×5, False×5, False×5, False×5] +- valid_interval[5]: (0, 25) +``` + +Key Pattern: Block `i` lags behind Block `i-1` by exactly `ar_step=5` timesteps, creating the +staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks. + + +### Text-to-Video Generation + +The example below demonstrates how to generate a video from text. + + + + +Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. + +From the original repo: +>You can use --ar_step 5 to enable asynchronous inference. When asynchronous inference, --causal_block_size 5 is recommended while it is not supposed to be set for synchronous generation... Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance. + +```py +import torch +from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler +from diffusers.utils import export_to_video + + +model_id = "Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers" +vae = AutoModel.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) + +pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained( + model_id, + vae=vae, + torch_dtype=torch.bfloat16, +) +pipeline.to("cuda") +flow_shift = 8.0 # 8.0 for T2V, 5.0 for I2V +pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) + +prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." + +output = pipeline( + prompt=prompt, + num_inference_steps=30, + height=544, # 720 for 720P + width=960, # 1280 for 720P + num_frames=97, + base_num_frames=97, # 121 for 720P + ar_step=5, # Controls asynchronous inference (0 for synchronous mode) + causal_block_size=5, # Number of frames in each block for asynchronous processing + overlap_history=None, # Number of frames to overlap for smooth transitions in long videos; 17 for long video generations + addnoise_condition=20, # Improves consistency in long video generation +).frames[0] +export_to_video(output, "video.mp4", fps=24, quality=8) +``` + + + + +### First-Last-Frame-to-Video Generation + +The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description, a starting frame, and an ending frame. + + + + +```python +import numpy as np +import torch +import torchvision.transforms.functional as TF +from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler +from diffusers.utils import export_to_video, load_image + + +model_id = "Skywork/SkyReels-V2-DF-1.3B-720P-Diffusers" +vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) +pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained( + model_id, vae=vae, torch_dtype=torch.bfloat16 +) +pipeline.to("cuda") +flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V +pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) + +first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png") +last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png") + +def aspect_ratio_resize(image, pipeline, max_area=720 * 1280): + aspect_ratio = image.height / image.width + mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1] + height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value + width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value + image = image.resize((width, height)) + return image, height, width + +def center_crop_resize(image, height, width): + # Calculate resize ratio to match first frame dimensions + resize_ratio = max(width / image.width, height / image.height) + + # Resize the image + width = round(image.width * resize_ratio) + height = round(image.height * resize_ratio) + size = [width, height] + image = TF.center_crop(image, size) + + return image, height, width + +first_frame, height, width = aspect_ratio_resize(first_frame, pipeline) +if last_frame.size != first_frame.size: + last_frame, _, _ = center_crop_resize(last_frame, height, width) + +prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." + +output = pipeline( + image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.0 +).frames[0] +export_to_video(output, "video.mp4", fps=24, quality=8) +``` + + + + + +### Video-to-Video Generation + + + + +`SkyReelsV2DiffusionForcingVideoToVideoPipeline` extends a given video. + +```python +import numpy as np +import torch +import torchvision.transforms.functional as TF +from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline, UniPCMultistepScheduler +from diffusers.utils import export_to_video, load_video + + +model_id = "Skywork/SkyReels-V2-DF-1.3B-720P-Diffusers" +vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) +pipeline = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained( + model_id, vae=vae, torch_dtype=torch.bfloat16 +) +pipeline.to("cuda") +flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V +pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) + +video = load_video("input_video.mp4") + +prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." + +output = pipeline( + video=video, prompt=prompt, height=720, width=1280, guidance_scale=5.0, overlap_history=17, + num_inference_steps=30, num_frames=257, base_num_frames=121#, ar_step=5, causal_block_size=5, +).frames[0] +export_to_video(output, "video.mp4", fps=24, quality=8) +# Total frames will be the number of frames of the given video + 257 +``` + + + + +## Notes + +- SkyReels-V2 supports LoRAs with [`~loaders.SkyReelsV2LoraLoaderMixin.load_lora_weights`]. + +`SkyReelsV2Pipeline` and `SkyReelsV2ImageToVideoPipeline` are also available without Diffusion Forcing framework applied. + + +## SkyReelsV2DiffusionForcingPipeline + +[[autodoc]] SkyReelsV2DiffusionForcingPipeline + - all + - __call__ + +## SkyReelsV2DiffusionForcingImageToVideoPipeline + +[[autodoc]] SkyReelsV2DiffusionForcingImageToVideoPipeline + - all + - __call__ + +## SkyReelsV2DiffusionForcingVideoToVideoPipeline + +[[autodoc]] SkyReelsV2DiffusionForcingVideoToVideoPipeline + - all + - __call__ + +## SkyReelsV2Pipeline + +[[autodoc]] SkyReelsV2Pipeline + - all + - __call__ + +## SkyReelsV2ImageToVideoPipeline + +[[autodoc]] SkyReelsV2ImageToVideoPipeline + - all + - __call__ + +## SkyReelsV2PipelineOutput + +[[autodoc]] pipelines.skyreels_v2.pipeline_output.SkyReelsV2PipelineOutput diff --git a/docs/source/en/api/pipelines/stable_audio.md b/docs/source/en/api/pipelines/stable_audio.md new file mode 100644 index 000000000000..82763a52a942 --- /dev/null +++ b/docs/source/en/api/pipelines/stable_audio.md @@ -0,0 +1,93 @@ + + +# Stable Audio + +Stable Audio was proposed in [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample. + +Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder. + +Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT. + +The abstract of the paper is the following: +*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* + +This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools). + +## Tips + +When constructing a prompt, keep in mind: + +* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). +* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". + +During inference: + +* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. +* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableAudioPipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, StableAudioDiTModel, StableAudioPipeline +from diffusers.utils import export_to_video +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "stabilityai/stable-audio-open-1.0", + subfolder="text_encoder", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = StableAudioDiTModel.from_pretrained( + "stabilityai/stable-audio-open-1.0", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = StableAudioPipeline.from_pretrained( + "stabilityai/stable-audio-open-1.0", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "The sound of a hammer hitting a wooden surface." +negative_prompt = "Low quality." +audio = pipeline( + prompt, + negative_prompt=negative_prompt, + num_inference_steps=200, + audio_end_in_s=10.0, + num_waveforms_per_prompt=3, + generator=generator, +).audios + +output = audio[0].T.float().cpu().numpy() +sf.write("hammer.wav", output, pipeline.vae.sampling_rate) +``` + + +## StableAudioPipeline +[[autodoc]] StableAudioPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/stable_cascade.md b/docs/source/en/api/pipelines/stable_cascade.md index 93a94d66c109..70de6776e98f 100644 --- a/docs/source/en/api/pipelines/stable_cascade.md +++ b/docs/source/en/api/pipelines/stable_cascade.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # GLIGEN (Grounded Language-to-Image Generation) The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs. @@ -18,13 +21,10 @@ The abstract from the [paper](https://huggingface.co/papers/2301.07093) is: *Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.* - - -Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently! - -If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations! - - +> [!TIP] +> Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently! +> +> If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations! [`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789). diff --git a/docs/source/en/api/pipelines/stable_diffusion/image_variation.md b/docs/source/en/api/pipelines/stable_diffusion/image_variation.md index 57dd2f0d5b39..b1b7146b336f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/image_variation.md +++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # K-Diffusion -[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion. +[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion. Note that most the samplers from k-diffusion are implemented in Diffusers and we recommend using existing schedulers. You can find a mapping between k-diffusion samplers and schedulers in Diffusers [here](https://huggingface.co/docs/diffusers/api/schedulers/overview) diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md index 9abccd6e1347..4f0521740cab 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md +++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Text-to-(RGB, depth) -LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. +
+ LoRA +
+ +LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. Two checkpoints are available for use: -- [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://arxiv.org/pdf/2305.10853.pdf) -- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images. +- [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://huggingface.co/papers/2305.10853) +- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images. The abstract from the paper is: *This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - +> [!TIP] +> Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionLDM3DPipeline @@ -44,7 +48,7 @@ Make sure to check out the Stable Diffusion [Tips](overview#tips) section to lea # Upscaler -[LDM3D-VR](https://arxiv.org/pdf/2311.03226.pdf) is an extended version of LDM3D. +[LDM3D-VR](https://huggingface.co/papers/2311.03226) is an extended version of LDM3D. The abstract from the paper is: *Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods* diff --git a/docs/source/en/api/pipelines/stable_diffusion/overview.md b/docs/source/en/api/pipelines/stable_diffusion/overview.md index 2509030e2fb5..7e6e16c347db 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/overview.md +++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md @@ -1,4 +1,4 @@ - + +# Stable Diffusion 3 + +
+ LoRA + MPS +
+ +Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://huggingface.co/papers/2403.03206) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. + +The abstract from the paper is: + +*Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.* + + +## Usage Example + +_As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ + +Use the command below to log in: + +```bash +hf auth login +``` + +> [!TIP] +> The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `torch.float16` data type for additional memory savings. + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) +pipe.to("cuda") + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +).images[0] + +image.save("sd3_hello_world.png") +``` + +**Note:** Stable Diffusion 3.5 can also be run using the SD3 pipeline, and all mentioned optimizations and techniques apply to it as well. In total there are three official models in the SD3 family: +- [`stabilityai/stable-diffusion-3-medium-diffusers`](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) +- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large) +- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo) + +## Image Prompting with IP-Adapters + +An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need: + +- `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder. +- `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`. +- `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection. + +IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally. + +```python +import torch +from PIL import Image + +from diffusers import StableDiffusion3Pipeline +from transformers import SiglipVisionModel, SiglipImageProcessor + +image_encoder_id = "google/siglip-so400m-patch14-384" +ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter" + +feature_extractor = SiglipImageProcessor.from_pretrained( + image_encoder_id, + torch_dtype=torch.float16 +) +image_encoder = SiglipVisionModel.from_pretrained( + image_encoder_id, + torch_dtype=torch.float16 +).to( "cuda") + +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3.5-large", + torch_dtype=torch.float16, + feature_extractor=feature_extractor, + image_encoder=image_encoder, +).to("cuda") + +pipe.load_ip_adapter(ip_adapter_id) +pipe.set_ip_adapter_scale(0.6) + +ref_img = Image.open("image.jpg").convert('RGB') + +image = pipe( + width=1024, + height=1024, + prompt="a cat", + negative_prompt="lowres, low quality, worst quality", + num_inference_steps=24, + guidance_scale=5.0, + ip_adapter_image=ref_img +).images[0] + +image.save("result.jpg") +``` + +
+ +
IP-Adapter examples with prompt "a cat"
+
+ + +> [!TIP] +> Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work. + + +## Memory Optimisations for SD3 + +SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware. + +### Running Inference with Model Offloading + +The most basic memory optimization available in Diffusers allows you to offload the components of the model to CPU during inference in order to save memory, while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it needs to be executed, while keeping the remaining components on the CPU. + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) +pipe.enable_model_cpu_offload() + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +).images[0] + +image.save("sd3_hello_world.png") +``` + +### Dropping the T5 Text Encoder during Inference + +Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance. + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3-medium-diffusers", + text_encoder_3=None, + tokenizer_3=None, + torch_dtype=torch.float16 +) +pipe.to("cuda") + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +).images[0] + +image.save("sd3_hello_world-no-T5.png") +``` + +### Using a Quantized Version of the T5 Text Encoder + +We can leverage the `bitsandbytes` library to load and quantize the T5-XXL text encoder to 8-bit precision. This allows you to keep using all three text encoders while only slightly impacting performance. + +First install the `bitsandbytes` library. + +```shell +pip install bitsandbytes +``` + +Then load the T5-XXL model using the `BitsAndBytesConfig`. + +```python +import torch +from diffusers import StableDiffusion3Pipeline +from transformers import T5EncoderModel, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_8bit=True) + +model_id = "stabilityai/stable-diffusion-3-medium-diffusers" +text_encoder = T5EncoderModel.from_pretrained( + model_id, + subfolder="text_encoder_3", + quantization_config=quantization_config, +) +pipe = StableDiffusion3Pipeline.from_pretrained( + model_id, + text_encoder_3=text_encoder, + device_map="balanced", + torch_dtype=torch.float16 +) + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +).images[0] + +image.save("sd3_hello_world-8bit-T5.png") +``` + +You can find the end-to-end script [here](https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1). + +## Performance Optimizations for SD3 + +### Using Torch Compile to Speed Up Inference + +Using compiled components in the SD3 pipeline can speed up inference by as much as 4X. The following code snippet demonstrates how to compile the Transformer and VAE components of the SD3 pipeline. + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +torch.set_float32_matmul_precision("high") + +torch._inductor.config.conv_1x1_as_mm = True +torch._inductor.config.coordinate_descent_tuning = True +torch._inductor.config.epilogue_fusion = False +torch._inductor.config.coordinate_descent_check_all_directions = True + +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3-medium-diffusers", + torch_dtype=torch.float16 +).to("cuda") +pipe.set_progress_bar_config(disable=True) + +pipe.transformer.to(memory_format=torch.channels_last) +pipe.vae.to(memory_format=torch.channels_last) + +pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True) +pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True) + +# Warm Up +prompt = "a photo of a cat holding a sign that says hello world" +for _ in range(3): + _ = pipe(prompt=prompt, generator=torch.manual_seed(1)) + +# Run Inference +image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0] +image.save("sd3_hello_world.png") +``` + +Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97). + +## Quantization + +Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. + +Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableDiffusion3Pipeline`] for inference with bitsandbytes. + +```py +import torch +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SD3Transformer2DModel, StableDiffusion3Pipeline +from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel + +quant_config = BitsAndBytesConfig(load_in_8bit=True) +text_encoder_8bit = T5EncoderModel.from_pretrained( + "stabilityai/stable-diffusion-3.5-large", + subfolder="text_encoder_3", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_8bit = SD3Transformer2DModel.from_pretrained( + "stabilityai/stable-diffusion-3.5-large", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +pipeline = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3.5-large", + text_encoder=text_encoder_8bit, + transformer=transformer_8bit, + torch_dtype=torch.float16, + device_map="balanced", +) + +prompt = "a tiny astronaut hatching from an egg on the moon" +image = pipeline(prompt, num_inference_steps=28, guidance_scale=7.0).images[0] +image.save("sd3.png") +``` + +## Using Long Prompts with the T5 Text Encoder + +By default, the T5 Text Encoder prompt uses a maximum sequence length of `256`. This can be adjusted by setting the `max_sequence_length` to accept fewer or more tokens. Keep in mind that longer sequences require additional resources and result in longer generation times, such as during batch inference. + +```python +prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" + +image = pipe( + prompt=prompt, + negative_prompt="", + num_inference_steps=28, + guidance_scale=4.5, + max_sequence_length=512, +).images[0] +``` + +### Sending a different prompt to the T5 Text Encoder + +You can send a different prompt to the CLIP Text Encoders and the T5 Text Encoder to prevent the prompt from being truncated by the CLIP Text Encoders and to improve generation. + +> [!TIP] +> The prompt with the CLIP Text Encoders is still truncated to the 77 token limit. + +```python +prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree." + +prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" + +image = pipe( + prompt=prompt, + prompt_3=prompt_3, + negative_prompt="", + num_inference_steps=28, + guidance_scale=4.5, + max_sequence_length=512, +).images[0] +``` + +## Tiny AutoEncoder for Stable Diffusion 3 + +Tiny AutoEncoder for Stable Diffusion (TAESD3) is a tiny distilled version of Stable Diffusion 3's VAE by [Ollin Boer Bohan](https://github.com/madebyollin/taesd) that can decode [`StableDiffusion3Pipeline`] latents almost instantly. + +To use with Stable Diffusion 3: + +```python +import torch +from diffusers import StableDiffusion3Pipeline, AutoencoderTiny + +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16 +) +pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) +pipe = pipe.to("cuda") + +prompt = "slice of delicious New York-style berry cheesecake" +image = pipe(prompt, num_inference_steps=25).images[0] +image.save("cheesecake.png") +``` + +## Loading the original checkpoints via `from_single_file` + +The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models. + +## Loading the original checkpoints for the `SD3Transformer2DModel` + +```python +from diffusers import SD3Transformer2DModel + +model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors") +``` + +## Loading the single checkpoint for the `StableDiffusion3Pipeline` + +### Loading the single file checkpoint without T5 + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors", + torch_dtype=torch.float16, + text_encoder_3=None +) +pipe.enable_model_cpu_offload() + +image = pipe("a picture of a cat holding a sign that says hello world").images[0] +image.save('sd3-single-file.png') +``` + +### Loading the single file checkpoint with T5 + +> [!TIP] +> The following example loads a checkpoint stored in a 8-bit floating point format which requires PyTorch 2.3 or later. + +```python +import torch +from diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors", + torch_dtype=torch.float16, +) +pipe.enable_model_cpu_offload() + +image = pipe("a picture of a cat holding a sign that says hello world").images[0] +image.save('sd3-single-file-t5-fp8.png') +``` + +### Loading the single file checkpoint for the Stable Diffusion 3.5 Transformer Model + +```python +import torch +from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline + +transformer = SD3Transformer2DModel.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors", + torch_dtype=torch.bfloat16, +) +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3.5-large", + transformer=transformer, + torch_dtype=torch.bfloat16, +) +pipe.enable_model_cpu_offload() +image = pipe("a cat holding a sign that says hello world").images[0] +image.save("sd35.png") +``` + +## StableDiffusion3Pipeline + +[[autodoc]] StableDiffusion3Pipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md index 97c11bfe23bb..151b0b8a6507 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Safe Stable Diffusion Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105) and mitigates inappropriate degeneration from Stable Diffusion models because they're trained on unfiltered web-crawled datasets. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, and otherwise offensive content. Safe Stable Diffusion is an extension of Stable Diffusion that drastically reduces this type of content. @@ -42,11 +45,8 @@ There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyC >>> out = pipeline(prompt=prompt, **SafetyConfig.MAX) ``` - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - +> [!TIP] +> Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionPipelineSafe diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md index c5433c0783ba..6863d408b5fd 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -1,4 +1,4 @@ - - - -🧪 This pipeline is for research purposes only. - - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Text-to-video -[ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang. +
+ LoRA +
+ +[ModelScope Text-to-Video Technical Report](https://huggingface.co/papers/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang. The abstract from the paper is: @@ -173,11 +174,8 @@ Video generation is memory-intensive and one way to reduce your memory usage is Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## TextToVideoSDPipeline [[autodoc]] TextToVideoSDPipeline diff --git a/docs/source/en/api/pipelines/text_to_video_zero.md b/docs/source/en/api/pipelines/text_to_video_zero.md index 1f8688a722d0..50e7620760f3 100644 --- a/docs/source/en/api/pipelines/text_to_video_zero.md +++ b/docs/source/en/api/pipelines/text_to_video_zero.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # Text2Video-Zero +
+ LoRA +
+ [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). Text2Video-Zero enables zero-shot video generation using either: @@ -30,7 +37,7 @@ Our key modifications include (i) enriching the latent codes of the generated fr Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.* -You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). +You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://huggingface.co/papers/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). ## Usage example @@ -40,8 +47,9 @@ To generate a video from prompt, run the following Python code: ```python import torch from diffusers import TextToVideoZeroPipeline +import imageio -model_id = "runwayml/stable-diffusion-v1-5" +model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") prompt = "A panda is playing guitar on times square" @@ -50,9 +58,9 @@ result = [(r * 255).astype("uint8") for r in result] imageio.mimsave("video.mp4", result, fps=4) ``` You can change these parameters in the pipeline call: -* Motion field strength (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1): +* Motion field strength (see the [paper](https://huggingface.co/papers/2303.13439), Sect. 3.3.1): * `motion_field_strength_x` and `motion_field_strength_y`. Default: `motion_field_strength_x=12`, `motion_field_strength_y=12` -* `T` and `T'` (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1) +* `T` and `T'` (see the [paper](https://huggingface.co/papers/2303.13439), Sect. 3.3.1) * `t0` and `t1` in the range `{0, ..., num_inference_steps}`. Default: `t0=45`, `t1=48` * Video length: * `video_length`, the number of frames video_length to be generated. Default: `video_length=8` @@ -63,7 +71,7 @@ import torch from diffusers import TextToVideoZeroPipeline import numpy as np -model_id = "runwayml/stable-diffusion-v1-5" +model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") seed = 0 video_length = 24 #24 ÷ 4fps = 6 seconds @@ -137,7 +145,7 @@ To generate a video from prompt with additional pose control from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - model_id = "runwayml/stable-diffusion-v1-5" + model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16 @@ -155,28 +163,28 @@ To generate a video from prompt with additional pose control imageio.mimsave("video.mp4", result, fps=4) ``` - #### SDXL Support - + Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL: ```python import torch from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - + controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0' model_id = 'stabilityai/stable-diffusion-xl-base-1.0' - + controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16 ).to('cuda') - + # Set the attention processor pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - + # fix latents for all frames latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) - + prompt = "Darth Vader dancing in a desert" result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images imageio.mimsave("video.mp4", result, fps=4) @@ -281,11 +289,8 @@ can run with custom [DreamBooth](../../training/dreambooth) models, as shown bel You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## TextToVideoZeroPipeline [[autodoc]] TextToVideoZeroPipeline diff --git a/docs/source/en/api/pipelines/unclip.md b/docs/source/en/api/pipelines/unclip.md index f379ffd63f53..7c5c2b0d9ab9 100644 --- a/docs/source/en/api/pipelines/unclip.md +++ b/docs/source/en/api/pipelines/unclip.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # unCLIP [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo). @@ -17,11 +20,8 @@ The abstract from the paper is following: You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## UnCLIPPipeline [[autodoc]] UnCLIPPipeline diff --git a/docs/source/en/api/pipelines/unidiffuser.md b/docs/source/en/api/pipelines/unidiffuser.md index 553a6d300152..2ff700e4b8be 100644 --- a/docs/source/en/api/pipelines/unidiffuser.md +++ b/docs/source/en/api/pipelines/unidiffuser.md @@ -1,4 +1,4 @@ - +> [!WARNING] +> This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. + # UniDiffuser +
+ LoRA +
+ The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. The abstract from the paper is: @@ -20,11 +27,8 @@ The abstract from the paper is: You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml). - - -There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X. - - +> [!WARNING] +> There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X. This pipeline was contributed by [dg845](https://github.com/dg845). ❤️ @@ -190,11 +194,8 @@ final_prompt = sample.text[0] print(final_prompt) ``` - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - +> [!TIP] +> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## UniDiffuserPipeline [[autodoc]] UniDiffuserPipeline diff --git a/docs/source/en/api/pipelines/value_guided_sampling.md b/docs/source/en/api/pipelines/value_guided_sampling.md index d21dbf04d7ee..d050ea309ca5 100644 --- a/docs/source/en/api/pipelines/value_guided_sampling.md +++ b/docs/source/en/api/pipelines/value_guided_sampling.md @@ -1,4 +1,4 @@ - + +# VisualCloze + +[VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning](https://huggingface.co/papers/2504.07960) is an innovative in-context learning based universal image generation framework that offers key capabilities: +1. Support for various in-domain tasks +2. Generalization to unseen tasks through in-context learning +3. Unify multiple tasks into one step and generate both target image and intermediate results +4. Support reverse-engineering conditions from target images + +## Overview + +The abstract from the paper is: + +*Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures. The codes, dataset, and models are available at https://visualcloze.github.io.* + +## Inference + +### Model loading + +VisualCloze is a two-stage cascade pipeline, containing `VisualClozeGenerationPipeline` and `VisualClozeUpsamplingPipeline`. +- In `VisualClozeGenerationPipeline`, each image is downsampled before concatenating images into a grid layout, avoiding excessively high resolutions. VisualCloze releases two models suitable for diffusers, i.e., [VisualClozePipeline-384](https://huggingface.co/VisualCloze/VisualClozePipeline-384) and [VisualClozePipeline-512](https://huggingface.co/VisualCloze/VisualClozePipeline-384), which downsample images to resolutions of 384 and 512, respectively. +- `VisualClozeUpsamplingPipeline` uses [SDEdit](https://huggingface.co/papers/2108.01073) to enable high-resolution image synthesis. + +The `VisualClozePipeline` integrates both stages to support convenient end-to-end sampling, while also allowing users to utilize each pipeline independently as needed. + +### Input Specifications + +#### Task and Content Prompts +- Task prompt: Required to describe the generation task intention +- Content prompt: Optional description or caption of the target image +- When content prompt is not needed, pass `None` +- For batch inference, pass `List[str|None]` + +#### Image Input Format +- Format: `List[List[Image|None]]` +- Structure: + - All rows except the last represent in-context examples + - Last row represents the current query (target image set to `None`) +- For batch inference, pass `List[List[List[Image|None]]]` + +#### Resolution Control +- Default behavior: + - Initial generation in the first stage: area of ${pipe.resolution}^2$ + - Upsampling in the second stage: 3x factor +- Custom resolution: Adjust using `upsampling_height` and `upsampling_width` parameters + +### Examples + +For comprehensive examples covering a wide range of tasks, please refer to the [Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze) and [GitHub Repository](https://github.com/lzyhha/VisualCloze). Below are simple examples for three cases: mask-to-image conversion, edge detection, and subject-driven generation. + +#### Example for mask2image + +```python +import torch +from diffusers import VisualClozePipeline +from diffusers.utils import load_image + +pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) +pipe.to("cuda") + +# Load in-context images (make sure the paths are correct and accessible) +image_paths = [ + # in-context examples + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg'), + ], + # query with the target image + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg'), + None, # No image needed for the target image + ], +] + +# Task and content prompt +task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding." +content_prompt = """Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. +The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. +Its plumage is a mix of dark brown and golden hues, with intricate feather details. +The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. +The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, +soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, +tranquil, majestic, wildlife photography.""" + +# Run the pipeline +image_result = pipe( + task_prompt=task_prompt, + content_prompt=content_prompt, + image=image_paths, + upsampling_width=1344, + upsampling_height=768, + upsampling_strength=0.4, + guidance_scale=30, + num_inference_steps=30, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0) +).images[0][0] + +# Save the resulting image +image_result.save("visualcloze.png") +``` + +#### Example for edge-detection + +```python +import torch +from diffusers import VisualClozePipeline +from diffusers.utils import load_image + +pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) +pipe.to("cuda") + +# Load in-context images (make sure the paths are correct and accessible) +image_paths = [ + # in-context examples + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_image.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_edge.jpg'), + ], + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_image.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_edge.jpg'), + ], + # query with the target image + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_query_image.jpg'), + None, # No image needed for the target image + ], +] + +# Task and content prompt +task_prompt = "Each row illustrates a pathway from [IMAGE1] a sharp and beautifully composed photograph to [IMAGE2] edge map with natural well-connected outlines using a clear logical task." +content_prompt = "" + +# Run the pipeline +image_result = pipe( + task_prompt=task_prompt, + content_prompt=content_prompt, + image=image_paths, + upsampling_width=864, + upsampling_height=1152, + upsampling_strength=0.4, + guidance_scale=30, + num_inference_steps=30, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0) +).images[0][0] + +# Save the resulting image +image_result.save("visualcloze.png") +``` + +#### Example for subject-driven generation + +```python +import torch +from diffusers import VisualClozePipeline +from diffusers.utils import load_image + +pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) +pipe.to("cuda") + +# Load in-context images (make sure the paths are correct and accessible) +image_paths = [ + # in-context examples + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_reference.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_depth.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_image.jpg'), + ], + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_reference.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_depth.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_image.jpg'), + ], + # query with the target image + [ + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_reference.jpg'), + load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_depth.jpg'), + None, # No image needed for the target image + ], +] + +# Task and content prompt +task_prompt = """Each row describes a process that begins with [IMAGE1] an image containing the key object, +[IMAGE2] depth map revealing gray-toned spatial layers and results in +[IMAGE3] an image with artistic qualitya high-quality image with exceptional detail.""" +content_prompt = """A vintage porcelain collector's item. Beneath a blossoming cherry tree in early spring, +this treasure is photographed up close, with soft pink petals drifting through the air and vibrant blossoms framing the scene.""" + +# Run the pipeline +image_result = pipe( + task_prompt=task_prompt, + content_prompt=content_prompt, + image=image_paths, + upsampling_width=1024, + upsampling_height=1024, + upsampling_strength=0.2, + guidance_scale=30, + num_inference_steps=30, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0) +).images[0][0] + +# Save the resulting image +image_result.save("visualcloze.png") +``` + +#### Utilize each pipeline independently + +```python +import torch +from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline +from diffusers.utils import load_image +from PIL import Image + +pipe = VisualClozeGenerationPipeline.from_pretrained( + "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +image_paths = [ + # in-context examples + [ + load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg" + ), + load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg" + ), + ], + # query with the target image + [ + load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg" + ), + None, # No image needed for the target image + ], +] +task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding." +content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography." + +# Stage 1: Generate initial image +image = pipe( + task_prompt=task_prompt, + content_prompt=content_prompt, + image=image_paths, + guidance_scale=30, + num_inference_steps=30, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0), +).images[0][0] + +# Stage 2 (optional): Upsample the generated image +pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe) +pipe_upsample.to("cuda") + +mask_image = Image.new("RGB", image.size, (255, 255, 255)) + +image = pipe_upsample( + image=image, + mask_image=mask_image, + prompt=content_prompt, + width=1344, + height=768, + strength=0.4, + guidance_scale=30, + num_inference_steps=30, + max_sequence_length=512, + generator=torch.Generator("cpu").manual_seed(0), +).images[0] + +image.save("visualcloze.png") +``` + +## VisualClozePipeline + +[[autodoc]] VisualClozePipeline + - all + - __call__ + +## VisualClozeGenerationPipeline + +[[autodoc]] VisualClozeGenerationPipeline + - all + - __call__ diff --git a/docs/source/en/api/pipelines/wan.md b/docs/source/en/api/pipelines/wan.md new file mode 100644 index 000000000000..3289a840e2b1 --- /dev/null +++ b/docs/source/en/api/pipelines/wan.md @@ -0,0 +1,364 @@ + + +
+
+ + LoRA + +
+
+ +# Wan + +[Wan-2.1](https://huggingface.co/papers/2503.20314) by the Wan Team. + +*This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at [this https URL](https://github.com/Wan-Video/Wan2.1).* + +You can find all the original Wan2.1 checkpoints under the [Wan-AI](https://huggingface.co/Wan-AI) organization. + +The following Wan models are supported in Diffusers: + +- [Wan 2.1 T2V 1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) +- [Wan 2.1 T2V 14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers) +- [Wan 2.1 I2V 14B - 480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P-Diffusers) +- [Wan 2.1 I2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P-Diffusers) +- [Wan 2.1 FLF2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers) +- [Wan 2.1 VACE 1.3B](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B-diffusers) +- [Wan 2.1 VACE 14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B-diffusers) +- [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) +- [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) +- [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) + +> [!TIP] +> Click on the Wan models in the right sidebar for more examples of video generation. + +### Text-to-Video Generation + +The example below demonstrates how to generate a video from text optimized for memory or inference speed. + + + + +Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. + +The Wan2.1 text-to-video model below requires ~13GB of VRAM. + +```py +# pip install ftfy +import torch +import numpy as np +from diffusers import AutoModel, WanPipeline +from diffusers.quantizers import PipelineQuantizationConfig +from diffusers.hooks.group_offloading import apply_group_offloading +from diffusers.utils import export_to_video, load_image +from transformers import UMT5EncoderModel + +text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16) +vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32) +transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) + +# group-offloading +onload_device = torch.device("cuda") +offload_device = torch.device("cpu") +apply_group_offloading(text_encoder, + onload_device=onload_device, + offload_device=offload_device, + offload_type="block_level", + num_blocks_per_group=4 +) +transformer.enable_group_offload( + onload_device=onload_device, + offload_device=offload_device, + offload_type="leaf_level", + use_stream=True +) + +pipeline = WanPipeline.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + vae=vae, + transformer=transformer, + text_encoder=text_encoder, + torch_dtype=torch.bfloat16 +) +pipeline.to("cuda") + +prompt = """ +The camera rushes from far to near in a low-angle shot, +revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in +for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. +Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic +shadows and warm highlights. Medium composition, front view, low angle, with depth of field. +""" +negative_prompt = """ +Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, +low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, +misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards +""" + +output = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_frames=81, + guidance_scale=5.0, +).frames[0] +export_to_video(output, "output.mp4", fps=16) +``` + + + + +[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. + +```py +# pip install ftfy +import torch +import numpy as np +from diffusers import AutoModel, WanPipeline +from diffusers.hooks.group_offloading import apply_group_offloading +from diffusers.utils import export_to_video, load_image +from transformers import UMT5EncoderModel + +text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16) +vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32) +transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) + +pipeline = WanPipeline.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + vae=vae, + transformer=transformer, + text_encoder=text_encoder, + torch_dtype=torch.bfloat16 +) +pipeline.to("cuda") + +# torch.compile +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.transformer = torch.compile( + pipeline.transformer, mode="max-autotune", fullgraph=True +) + +prompt = """ +The camera rushes from far to near in a low-angle shot, +revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in +for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. +Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic +shadows and warm highlights. Medium composition, front view, low angle, with depth of field. +""" +negative_prompt = """ +Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, +low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, +misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards +""" + +output = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_frames=81, + guidance_scale=5.0, +).frames[0] +export_to_video(output, "output.mp4", fps=16) +``` + + + + +### First-Last-Frame-to-Video Generation + +The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description, a starting frame, and an ending frame. + + + + +```python +import numpy as np +import torch +import torchvision.transforms.functional as TF +from diffusers import AutoencoderKLWan, WanImageToVideoPipeline +from diffusers.utils import export_to_video, load_image +from transformers import CLIPVisionModel + + +model_id = "Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers" +image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32) +vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) +pipe = WanImageToVideoPipeline.from_pretrained( + model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16 +) +pipe.to("cuda") + +first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png") +last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png") + +def aspect_ratio_resize(image, pipe, max_area=720 * 1280): + aspect_ratio = image.height / image.width + mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] + height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value + width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value + image = image.resize((width, height)) + return image, height, width + +def center_crop_resize(image, height, width): + # Calculate resize ratio to match first frame dimensions + resize_ratio = max(width / image.width, height / image.height) + + # Resize the image + width = round(image.width * resize_ratio) + height = round(image.height * resize_ratio) + size = [width, height] + image = TF.center_crop(image, size) + + return image, height, width + +first_frame, height, width = aspect_ratio_resize(first_frame, pipe) +if last_frame.size != first_frame.size: + last_frame, _, _ = center_crop_resize(last_frame, height, width) + +prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." + +output = pipe( + image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.5 +).frames[0] +export_to_video(output, "output.mp4", fps=16) +``` + + + + +### Any-to-Video Controllable Generation + +Wan VACE supports various generation techniques which achieve controllable video generation. Some of the capabilities include: +- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: [huggingface/controlnet_aux]() +- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips) +- Inpainting and Outpainting +- Subject to Video (faces, object, characters, etc.) +- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.) + +The code snippets available in [this](https://github.com/huggingface/diffusers/pull/11582) pull request demonstrate some examples of how videos can be generated with controllability signals. + +The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color. + +## Notes + +- Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`]. + +
+ Show example code + + ```py + # pip install ftfy + import torch + from diffusers import AutoModel, WanPipeline + from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler + from diffusers.utils import export_to_video + + vae = AutoModel.from_pretrained( + "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32 + ) + pipeline = WanPipeline.from_pretrained( + "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", vae=vae, torch_dtype=torch.bfloat16 + ) + pipeline.scheduler = UniPCMultistepScheduler.from_config( + pipeline.scheduler.config, flow_shift=5.0 + ) + pipeline.to("cuda") + + pipeline.load_lora_weights("benjamin-paine/steamboat-willie-1.3b", adapter_name="steamboat-willie") + pipeline.set_adapters("steamboat-willie") + + pipeline.enable_model_cpu_offload() + + # use "steamboat willie style" to trigger the LoRA + prompt = """ + steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, + revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in + for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. + Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic + shadows and warm highlights. Medium composition, front view, low angle, with depth of field. + """ + + output = pipeline( + prompt=prompt, + num_frames=81, + guidance_scale=5.0, + ).frames[0] + export_to_video(output, "output.mp4", fps=16) + ``` + +
+ +- [`WanTransformer3DModel`] and [`AutoencoderKLWan`] supports loading from single files with [`~loaders.FromSingleFileMixin.from_single_file`]. + +
+ Show example code + + ```py + # pip install ftfy + import torch + from diffusers import WanPipeline, WanTransformer3DModel, AutoencoderKLWan + + vae = AutoencoderKLWan.from_single_file( + "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors" + ) + transformer = WanTransformer3DModel.from_single_file( + "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/diffusion_models/wan2.1_t2v_1.3B_bf16.safetensors", + torch_dtype=torch.bfloat16 + ) + pipeline = WanPipeline.from_pretrained( + "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", + vae=vae, + transformer=transformer, + torch_dtype=torch.bfloat16 + ) + ``` + +
+ +- Set the [`AutoencoderKLWan`] dtype to `torch.float32` for better decoding quality. + +- The number of frames per second (fps) or `k` should be calculated by `4 * k + 1`. + +- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. + +- Wan 2.1 and 2.2 support using [LightX2V LoRAs](https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Lightx2v) to speed up inference. Using them on Wan 2.2 is slightly more involed. Refer to [this code snippet](https://github.com/huggingface/diffusers/pull/12040#issuecomment-3144185272) to learn more. + +- Wan 2.2 has two denoisers. By default, LoRAs are only loaded into the first denoiser. One can set `load_into_transformer_2=True` to load LoRAs into the second denoiser. Refer to [this](https://github.com/huggingface/diffusers/pull/12074#issue-3292620048) and [this](https://github.com/huggingface/diffusers/pull/12074#issuecomment-3155896144) examples to learn more. + +## WanPipeline + +[[autodoc]] WanPipeline + - all + - __call__ + +## WanImageToVideoPipeline + +[[autodoc]] WanImageToVideoPipeline + - all + - __call__ + +## WanVACEPipeline + +[[autodoc]] WanVACEPipeline + - all + - __call__ + +## WanVideoToVideoPipeline + +[[autodoc]] WanVideoToVideoPipeline + - all + - __call__ + +## WanPipelineOutput + +[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput \ No newline at end of file diff --git a/docs/source/en/api/pipelines/wuerstchen.md b/docs/source/en/api/pipelines/wuerstchen.md index 4d90ad46dc64..2be3631d8456 100644 --- a/docs/source/en/api/pipelines/wuerstchen.md +++ b/docs/source/en/api/pipelines/wuerstchen.md @@ -1,4 +1,4 @@ - + +# Quantization + +Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. + +> [!TIP] +> Learn how to quantize models in the [Quantization](../quantization/overview) guide. + +## PipelineQuantizationConfig + +[[autodoc]] quantizers.PipelineQuantizationConfig + +## BitsAndBytesConfig + +[[autodoc]] quantizers.quantization_config.BitsAndBytesConfig + +## GGUFQuantizationConfig + +[[autodoc]] quantizers.quantization_config.GGUFQuantizationConfig + +## QuantoConfig + +[[autodoc]] quantizers.quantization_config.QuantoConfig + +## TorchAoConfig + +[[autodoc]] quantizers.quantization_config.TorchAoConfig + +## DiffusersQuantizer + +[[autodoc]] quantizers.base.DiffusersQuantizer diff --git a/docs/source/en/api/schedulers/cm_stochastic_iterative.md b/docs/source/en/api/schedulers/cm_stochastic_iterative.md index 89e50b5d6b61..edc10529db2c 100644 --- a/docs/source/en/api/schedulers/cm_stochastic_iterative.md +++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.md @@ -1,4 +1,4 @@ - + +# CosineDPMSolverMultistepScheduler + +The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021). +It is being used in the [Stable Audio Open](https://huggingface.co/papers/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tools) codebase. + +This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). + +## CosineDPMSolverMultistepScheduler +[[autodoc]] CosineDPMSolverMultistepScheduler + +## SchedulerOutput +[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/docs/source/en/api/schedulers/ddim.md b/docs/source/en/api/schedulers/ddim.md index 952855dbd2ac..61ef30c786f9 100644 --- a/docs/source/en/api/schedulers/ddim.md +++ b/docs/source/en/api/schedulers/ddim.md @@ -1,4 +1,4 @@ - + +# CogVideoXDDIMScheduler + +`CogVideoXDDIMScheduler` is based on [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502), specifically for CogVideoX models. + +## CogVideoXDDIMScheduler + +[[autodoc]] CogVideoXDDIMScheduler diff --git a/docs/source/en/api/schedulers/ddim_inverse.md b/docs/source/en/api/schedulers/ddim_inverse.md index 82069cce4c53..fdae50c2ab27 100644 --- a/docs/source/en/api/schedulers/ddim_inverse.md +++ b/docs/source/en/api/schedulers/ddim_inverse.md @@ -1,4 +1,4 @@ - + +# FlowMatchEulerDiscreteScheduler + +`FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://huggingface.co/papers/2403.03206). + +## FlowMatchEulerDiscreteScheduler +[[autodoc]] FlowMatchEulerDiscreteScheduler diff --git a/docs/source/en/api/schedulers/flow_match_heun_discrete.md b/docs/source/en/api/schedulers/flow_match_heun_discrete.md new file mode 100644 index 000000000000..c6359ed843d0 --- /dev/null +++ b/docs/source/en/api/schedulers/flow_match_heun_discrete.md @@ -0,0 +1,18 @@ + + +# FlowMatchHeunDiscreteScheduler + +`FlowMatchHeunDiscreteScheduler` is based on the flow-matching sampling introduced in [EDM](https://huggingface.co/papers/2403.03206). + +## FlowMatchHeunDiscreteScheduler +[[autodoc]] FlowMatchHeunDiscreteScheduler diff --git a/docs/source/en/api/schedulers/heun.md b/docs/source/en/api/schedulers/heun.md index bca5cf743d05..efe6ad23a373 100644 --- a/docs/source/en/api/schedulers/heun.md +++ b/docs/source/en/api/schedulers/heun.md @@ -1,4 +1,4 @@ - + +# CogVideoXDPMScheduler + +`CogVideoXDPMScheduler` is based on [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095), specifically for CogVideoX models. + +## CogVideoXDPMScheduler + +[[autodoc]] CogVideoXDPMScheduler diff --git a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md index b77a5cf14079..80b8cb182395 100644 --- a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md +++ b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md @@ -1,4 +1,4 @@ - -# TCDScheduler +# TCDScheduler [Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function. diff --git a/docs/source/en/api/schedulers/unipc.md b/docs/source/en/api/schedulers/unipc.md index d82345996fba..54c32222ba4f 100644 --- a/docs/source/en/api/schedulers/unipc.md +++ b/docs/source/en/api/schedulers/unipc.md @@ -1,4 +1,4 @@ - + +# Video Processor + +The [`VideoProcessor`] provides a unified API for video pipelines to prepare inputs for VAE encoding and post-processing outputs once they're decoded. The class inherits [`VaeImageProcessor`] so it includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. + +## VideoProcessor + +[[autodoc]] video_processor.VideoProcessor.preprocess_video + +[[autodoc]] video_processor.VideoProcessor.postprocess_video diff --git a/docs/source/en/community_projects.md b/docs/source/en/community_projects.md new file mode 100644 index 000000000000..339e538da9e2 --- /dev/null +++ b/docs/source/en/community_projects.md @@ -0,0 +1,90 @@ + + +# Community Projects + +Welcome to Community Projects. This space is dedicated to showcasing the incredible work and innovative applications created by our vibrant community using the `diffusers` library. + +This section aims to: + +- Highlight diverse and inspiring projects built with `diffusers` +- Foster knowledge sharing within our community +- Provide real-world examples of how `diffusers` can be leveraged + +Happy exploring, and thank you for being part of the Diffusers community! + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Project NameDescription
dream-textures Stable Diffusion built-in to Blender
HiDiffusion Increases the resolution and speed of your diffusion model by only adding a single line of code
IC-Light IC-Light is a project to manipulate the illumination of images
InstantID InstantID : Zero-shot Identity-Preserving Generation in Seconds
IOPaint Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.
Kohya Gradio GUI for Kohya's Stable Diffusion trainers
MagicAnimate MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
OOTDiffusion Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
SD.Next SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
stable-dreamfusion Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
StoryDiffusion StoryDiffusion can create a magic story by generating consistent images and videos.
StreamDiffusion A Pipeline-Level Solution for Real-Time Interactive Generation
Stable Diffusion Server A server configured for Inpainting/Generation/img2img with one stable diffusion model
Model Search Search models on Civitai and Hugging Face
Skrample Fully modular scheduler functions with 1st class diffusers integration.
diff --git a/docs/source/en/conceptual/contribution.md b/docs/source/en/conceptual/contribution.md index 24ac52ba19c9..e39a6434f095 100644 --- a/docs/source/en/conceptual/contribution.md +++ b/docs/source/en/conceptual/contribution.md @@ -1,4 +1,4 @@ - + +# Hybrid Inference + +**Empowering local AI builders with Hybrid Inference** + + +> [!TIP] +> Hybrid Inference is an [experimental feature](https://huggingface.co/blog/remote_vae). +> Feedback can be provided [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml). + + + +## Why use Hybrid Inference? + +Hybrid Inference offers a fast and simple way to offload local generation requirements. + +- 🚀 **Reduced Requirements:** Access powerful models without expensive hardware. +- 💎 **Without Compromise:** Achieve the highest quality without sacrificing performance. +- 💰 **Cost Effective:** It's free! 🤑 +- 🎯 **Diverse Use Cases:** Fully compatible with Diffusers 🧨 and the wider community. +- 🔧 **Developer-Friendly:** Simple requests, fast responses. + +--- + +## Available Models + +* **VAE Decode 🖼️:** Quickly decode latent representations into high-quality images without compromising performance or workflow speed. +* **VAE Encode 🔢:** Efficiently encode images into latent representations for generation and training. +* **Text Encoders 📃 (coming soon):** Compute text embeddings for your prompts quickly and accurately, ensuring a smooth and high-quality workflow. + +--- + +## Integrations + +* **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference. +* **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference. + +## Changelog + +- March 10 2025: Added VAE encode +- March 2 2025: Initial release with VAE decoding + +## Contents + +The documentation is organized into three sections: + +* **VAE Decode** Learn the basics of how to use VAE Decode with Hybrid Inference. +* **VAE Encode** Learn the basics of how to use VAE Encode with Hybrid Inference. +* **API Reference** Dive into task-specific settings and parameters. diff --git a/docs/source/en/hybrid_inference/vae_decode.md b/docs/source/en/hybrid_inference/vae_decode.md new file mode 100644 index 000000000000..1457090550c7 --- /dev/null +++ b/docs/source/en/hybrid_inference/vae_decode.md @@ -0,0 +1,345 @@ +# Getting Started: VAE Decode with Hybrid Inference + +VAE decode is an essential component of diffusion models - turning latent representations into images or videos. + +## Memory + +These tables demonstrate the VRAM requirements for VAE decode with SD v1 and SD XL on different GPUs. + +For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled decoding has to be used which increases time taken and impacts quality. + +
SD v1.5 + +| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) | +| --- | --- | --- | --- | --- | --- | +| NVIDIA GeForce RTX 4090 | 512x512 | 0.031 | 5.60% | 0.031 (0%) | 5.60% | +| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.148 | 20.00% | 0.301 (+103%) | 5.60% | +| NVIDIA GeForce RTX 4080 | 512x512 | 0.05 | 8.40% | 0.050 (0%) | 8.40% | +| NVIDIA GeForce RTX 4080 | 1024x1024 | 0.224 | 30.00% | 0.356 (+59%) | 8.40% | +| NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.066 | 11.30% | 0.066 (0%) | 11.30% | +| NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.284 | 40.50% | 0.454 (+60%) | 11.40% | +| NVIDIA GeForce RTX 3090 | 512x512 | 0.062 | 5.20% | 0.062 (0%) | 5.20% | +| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.253 | 18.50% | 0.464 (+83%) | 5.20% | +| NVIDIA GeForce RTX 3080 | 512x512 | 0.07 | 12.80% | 0.070 (0%) | 12.80% | +| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.286 | 45.30% | 0.466 (+63%) | 12.90% | +| NVIDIA GeForce RTX 3070 | 512x512 | 0.102 | 15.90% | 0.102 (0%) | 15.90% | +| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.421 | 56.30% | 0.746 (+77%) | 16.00% | + +
+ +
SDXL + +| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) | +| --- | --- | --- | --- | --- | --- | +| NVIDIA GeForce RTX 4090 | 512x512 | 0.057 | 10.00% | 0.057 (0%) | 10.00% | +| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.256 | 35.50% | 0.257 (+0.4%) | 35.50% | +| NVIDIA GeForce RTX 4080 | 512x512 | 0.092 | 15.00% | 0.092 (0%) | 15.00% | +| NVIDIA GeForce RTX 4080 | 1024x1024 | 0.406 | 53.30% | 0.406 (0%) | 53.30% | +| NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.121 | 20.20% | 0.120 (-0.8%) | 20.20% | +| NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.519 | 72.00% | 0.519 (0%) | 72.00% | +| NVIDIA GeForce RTX 3090 | 512x512 | 0.107 | 10.50% | 0.107 (0%) | 10.50% | +| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.459 | 38.00% | 0.460 (+0.2%) | 38.00% | +| NVIDIA GeForce RTX 3080 | 512x512 | 0.121 | 25.60% | 0.121 (0%) | 25.60% | +| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.524 | 93.00% | 0.524 (0%) | 93.00% | +| NVIDIA GeForce RTX 3070 | 512x512 | 0.183 | 31.80% | 0.183 (0%) | 31.80% | +| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.794 | 96.40% | 0.794 (0%) | 96.40% | + +
+ +## Available VAEs + +| | **Endpoint** | **Model** | +|:-:|:-----------:|:--------:| +| **Stable Diffusion v1** | [https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud](https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) | +| **Stable Diffusion XL** | [https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud](https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) | +| **Flux** | [https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud](https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) | +| **HunyuanVideo** | [https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud](https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud) | [`hunyuanvideo-community/HunyuanVideo`](https://hf.co/hunyuanvideo-community/HunyuanVideo) | + + +> [!TIP] +> Model support can be requested [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml). + + +## Code + +> [!TIP] +> Install `diffusers` from `main` to run the code: `pip install git+https://github.com/huggingface/diffusers@main` + + +A helper method simplifies interacting with Hybrid Inference. + +```python +from diffusers.utils.remote_utils import remote_decode +``` + +### Basic example + +Here, we show how to use the remote VAE on random tensors. + +
Code + +```python +image = remote_decode( + endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16), + scaling_factor=0.18215, +) +``` + +
+ +
+ +
+ +Usage for Flux is slightly different. Flux latents are packed so we need to send the `height` and `width`. + +
Code + +```python +image = remote_decode( + endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=torch.randn([1, 4096, 64], dtype=torch.float16), + height=1024, + width=1024, + scaling_factor=0.3611, + shift_factor=0.1159, +) +``` + +
+ +
+ +
+ +Finally, an example for HunyuanVideo. + +
Code + +```python +video = remote_decode( + endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16), + output_type="mp4", +) +with open("video.mp4", "wb") as f: + f.write(video) +``` + +
+ +
+ +
+ + +### Generation + +But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5. + +
Code + +```python +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16, + variant="fp16", + vae=None, +).to("cuda") + +prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious" + +latent = pipe( + prompt=prompt, + output_type="latent", +).images +image = remote_decode( + endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=latent, + scaling_factor=0.18215, +) +image.save("test.jpg") +``` + +
+ +
+ +
+ +Here’s another example with Flux. + +
Code + +```python +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-schnell", + torch_dtype=torch.bfloat16, + vae=None, +).to("cuda") + +prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious" + +latent = pipe( + prompt=prompt, + guidance_scale=0.0, + num_inference_steps=4, + output_type="latent", +).images +image = remote_decode( + endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=latent, + height=1024, + width=1024, + scaling_factor=0.3611, + shift_factor=0.1159, +) +image.save("test.jpg") +``` + +
+ +
+ +
+ +Here’s an example with HunyuanVideo. + +
Code + +```python +from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel + +model_id = "hunyuanvideo-community/HunyuanVideo" +transformer = HunyuanVideoTransformer3DModel.from_pretrained( + model_id, subfolder="transformer", torch_dtype=torch.bfloat16 +) +pipe = HunyuanVideoPipeline.from_pretrained( + model_id, transformer=transformer, vae=None, torch_dtype=torch.float16 +).to("cuda") + +latent = pipe( + prompt="A cat walks on the grass, realistic", + height=320, + width=512, + num_frames=61, + num_inference_steps=30, + output_type="latent", +).frames + +video = remote_decode( + endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=latent, + output_type="mp4", +) + +if isinstance(video, bytes): + with open("video.mp4", "wb") as f: + f.write(video) +``` + +
+ +
+ +
+ + +### Queueing + +One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency. + + +
Code + +```python +import queue +import threading +from IPython.display import display +from diffusers import StableDiffusionPipeline + +def decode_worker(q: queue.Queue): + while True: + item = q.get() + if item is None: + break + image = remote_decode( + endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=item, + scaling_factor=0.18215, + ) + display(image) + q.task_done() + +q = queue.Queue() +thread = threading.Thread(target=decode_worker, args=(q,), daemon=True) +thread.start() + +def decode(latent: torch.Tensor): + q.put(latent) + +prompts = [ + "Blueberry ice cream, in a stylish modern glass , ice cubes, nuts, mint leaves, splashing milk cream, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious", + "Lemonade in a glass, mint leaves, in an aqua and white background, flowers, ice cubes, halo, fluid motion, dynamic movement, soft lighting, digital painting, rule of thirds composition, Art by Greg rutkowski, Coby whitmore", + "Comic book art, beautiful, vintage, pastel neon colors, extremely detailed pupils, delicate features, light on face, slight smile, Artgerm, Mary Blair, Edmund Dulac, long dark locks, bangs, glowing, fashionable style, fairytale ambience, hot pink.", + "Masterpiece, vanilla cone ice cream garnished with chocolate syrup, crushed nuts, choco flakes, in a brown background, gold, cinematic lighting, Art by WLOP", + "A bowl of milk, falling cornflakes, berries, blueberries, in a white background, soft lighting, intricate details, rule of thirds, octane render, volumetric lighting", + "Cold Coffee with cream, crushed almonds, in a glass, choco flakes, ice cubes, wet, in a wooden background, cinematic lighting, hyper realistic painting, art by Carne Griffiths, octane render, volumetric lighting, fluid motion, dynamic movement, muted colors,", +] + +pipe = StableDiffusionPipeline.from_pretrained( + "Lykon/dreamshaper-8", + torch_dtype=torch.float16, + vae=None, +).to("cuda") + +pipe.unet = pipe.unet.to(memory_format=torch.channels_last) +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + +_ = pipe( + prompt=prompts[0], + output_type="latent", +) + +for prompt in prompts: + latent = pipe( + prompt=prompt, + output_type="latent", + ).images + decode(latent) + +q.put(None) +thread.join() +``` + +
+ + +
+ +
+ +## Integrations + +* **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference. +* **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference. diff --git a/docs/source/en/hybrid_inference/vae_encode.md b/docs/source/en/hybrid_inference/vae_encode.md new file mode 100644 index 000000000000..dd285fa25c03 --- /dev/null +++ b/docs/source/en/hybrid_inference/vae_encode.md @@ -0,0 +1,183 @@ +# Getting Started: VAE Encode with Hybrid Inference + +VAE encode is used for training, image-to-image and image-to-video - turning into images or videos into latent representations. + +## Memory + +These tables demonstrate the VRAM requirements for VAE encode with SD v1 and SD XL on different GPUs. + +For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled encoding has to be used which increases time taken and impacts quality. + +
SD v1.5 + +| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) | +|:------------------------------|:-------------|-----------------:|-------------:|--------------------:|-------------------:| +| NVIDIA GeForce RTX 4090 | 512x512 | 0.015 | 3.51901 | 0.015 | 3.51901 | +| NVIDIA GeForce RTX 4090 | 256x256 | 0.004 | 1.3154 | 0.005 | 1.3154 | +| NVIDIA GeForce RTX 4090 | 2048x2048 | 0.402 | 47.1852 | 0.496 | 3.51901 | +| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.078 | 12.2658 | 0.094 | 3.51901 | +| NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.023 | 5.30105 | 0.023 | 5.30105 | +| NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.006 | 1.98152 | 0.006 | 1.98152 | +| NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 0.574 | 71.08 | 0.656 | 5.30105 | +| NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.111 | 18.4772 | 0.14 | 5.30105 | +| NVIDIA GeForce RTX 3090 | 512x512 | 0.032 | 3.52782 | 0.032 | 3.52782 | +| NVIDIA GeForce RTX 3090 | 256x256 | 0.01 | 1.31869 | 0.009 | 1.31869 | +| NVIDIA GeForce RTX 3090 | 2048x2048 | 0.742 | 47.3033 | 0.954 | 3.52782 | +| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.136 | 12.2965 | 0.207 | 3.52782 | +| NVIDIA GeForce RTX 3080 | 512x512 | 0.036 | 8.51761 | 0.036 | 8.51761 | +| NVIDIA GeForce RTX 3080 | 256x256 | 0.01 | 3.18387 | 0.01 | 3.18387 | +| NVIDIA GeForce RTX 3080 | 2048x2048 | 0.863 | 86.7424 | 1.191 | 8.51761 | +| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.157 | 29.6888 | 0.227 | 8.51761 | +| NVIDIA GeForce RTX 3070 | 512x512 | 0.051 | 10.6941 | 0.051 | 10.6941 | +| NVIDIA GeForce RTX 3070 | 256x256 | 0.015 | 3.99743 | 0.015 | 3.99743 | +| NVIDIA GeForce RTX 3070 | 2048x2048 | 1.217 | 96.054 | 1.482 | 10.6941 | +| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.223 | 37.2751 | 0.327 | 10.6941 | + + +
+ +
SDXL + +| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) | +|:------------------------------|:-------------|-----------------:|----------------------:|-----------------------:|-------------------:| +| NVIDIA GeForce RTX 4090 | 512x512 | 0.029 | 4.95707 | 0.029 | 4.95707 | +| NVIDIA GeForce RTX 4090 | 256x256 | 0.007 | 2.29666 | 0.007 | 2.29666 | +| NVIDIA GeForce RTX 4090 | 2048x2048 | 0.873 | 66.3452 | 0.863 | 15.5649 | +| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.142 | 15.5479 | 0.143 | 15.5479 | +| NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.044 | 7.46735 | 0.044 | 7.46735 | +| NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.01 | 3.4597 | 0.01 | 3.4597 | +| NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 1.317 | 87.1615 | 1.291 | 23.447 | +| NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.213 | 23.4215 | 0.214 | 23.4215 | +| NVIDIA GeForce RTX 3090 | 512x512 | 0.058 | 5.65638 | 0.058 | 5.65638 | +| NVIDIA GeForce RTX 3090 | 256x256 | 0.016 | 2.45081 | 0.016 | 2.45081 | +| NVIDIA GeForce RTX 3090 | 2048x2048 | 1.755 | 77.8239 | 1.614 | 18.4193 | +| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.265 | 18.4023 | 0.265 | 18.4023 | +| NVIDIA GeForce RTX 3080 | 512x512 | 0.064 | 13.6568 | 0.064 | 13.6568 | +| NVIDIA GeForce RTX 3080 | 256x256 | 0.018 | 5.91728 | 0.018 | 5.91728 | +| NVIDIA GeForce RTX 3080 | 2048x2048 | OOM | OOM | 1.866 | 44.4717 | +| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.302 | 44.4308 | 0.302 | 44.4308 | +| NVIDIA GeForce RTX 3070 | 512x512 | 0.093 | 17.1465 | 0.093 | 17.1465 | +| NVIDIA GeForce RTX 3070 | 256x256 | 0.025 | 7.42931 | 0.026 | 7.42931 | +| NVIDIA GeForce RTX 3070 | 2048x2048 | OOM | OOM | 2.674 | 55.8355 | +| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.443 | 55.7841 | 0.443 | 55.7841 | + +
+ +## Available VAEs + +| | **Endpoint** | **Model** | +|:-:|:-----------:|:--------:| +| **Stable Diffusion v1** | [https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud](https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) | +| **Stable Diffusion XL** | [https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud](https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) | +| **Flux** | [https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud](https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) | + + +> [!TIP] +> Model support can be requested [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml). + + +## Code + +> [!TIP] +> Install `diffusers` from `main` to run the code: `pip install git+https://github.com/huggingface/diffusers@main` + + +A helper method simplifies interacting with Hybrid Inference. + +```python +from diffusers.utils.remote_utils import remote_encode +``` + +### Basic example + +Let's encode an image, then decode it to demonstrate. + +
+ +
+ +
Code + +```python +from diffusers.utils import load_image +from diffusers.utils.remote_utils import remote_decode + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg?download=true") + +latent = remote_encode( + endpoint="https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud/", + scaling_factor=0.3611, + shift_factor=0.1159, +) + +decoded = remote_decode( + endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=latent, + scaling_factor=0.3611, + shift_factor=0.1159, +) +``` + +
+ +
+ +
+ + +### Generation + +Now let's look at a generation example, we'll encode the image, generate then remotely decode too! + +
Code + +```python +import torch +from diffusers import StableDiffusionImg2ImgPipeline +from diffusers.utils import load_image +from diffusers.utils.remote_utils import remote_decode, remote_encode + +pipe = StableDiffusionImg2ImgPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16, + variant="fp16", + vae=None, +).to("cuda") + +init_image = load_image( + "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" +) +init_image = init_image.resize((768, 512)) + +init_latent = remote_encode( + endpoint="https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud/", + image=init_image, + scaling_factor=0.18215, +) + +prompt = "A fantasy landscape, trending on artstation" +latent = pipe( + prompt=prompt, + image=init_latent, + strength=0.75, + output_type="latent", +).images + +image = remote_decode( + endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", + tensor=latent, + scaling_factor=0.18215, +) +image.save("fantasy_landscape.jpg") +``` + +
+ +
+ +
+ +## Integrations + +* **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference. +* **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference. diff --git a/docs/source/en/index.md b/docs/source/en/index.md index 957d90786dd7..0aca1d22c142 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -1,4 +1,4 @@ - + +# AutoPipelineBlocks + +[`~modular_pipelines.AutoPipelineBlocks`] are a multi-block type containing blocks that support different workflows. It automatically selects which sub-blocks to run based on the input provided at runtime. This is typically used to package multiple workflows - text-to-image, image-to-image, inpaint - into a single pipeline for convenience. + +This guide shows how to create [`~modular_pipelines.AutoPipelineBlocks`]. + +Create three [`~modular_pipelines.ModularPipelineBlocks`] for text-to-image, image-to-image, and inpainting. These represent the different workflows available in the pipeline. + + + + +```py +import torch +from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam + +class TextToImageBlock(ModularPipelineBlocks): + model_name = "text2img" + + @property + def inputs(self): + return [InputParam(name="prompt")] + + @property + def intermediate_outputs(self): + return [] + + @property + def description(self): + return "I'm a text-to-image workflow!" + + def __call__(self, components, state): + block_state = self.get_block_state(state) + print("running the text-to-image workflow") + # Add your text-to-image logic here + # For example: generate image from prompt + self.set_block_state(state, block_state) + return components, state +``` + + + + + +```py +class ImageToImageBlock(ModularPipelineBlocks): + model_name = "img2img" + + @property + def inputs(self): + return [InputParam(name="prompt"), InputParam(name="image")] + + @property + def intermediate_outputs(self): + return [] + + @property + def description(self): + return "I'm an image-to-image workflow!" + + def __call__(self, components, state): + block_state = self.get_block_state(state) + print("running the image-to-image workflow") + # Add your image-to-image logic here + # For example: transform input image based on prompt + self.set_block_state(state, block_state) + return components, state +``` + + + + + +```py +class InpaintBlock(ModularPipelineBlocks): + model_name = "inpaint" + + @property + def inputs(self): + return [InputParam(name="prompt"), InputParam(name="image"), InputParam(name="mask")] + + @property + def intermediate_outputs(self): + return [] + + @property + def description(self): + return "I'm an inpaint workflow!" + + def __call__(self, components, state): + block_state = self.get_block_state(state) + print("running the inpaint workflow") + # Add your inpainting logic here + # For example: fill masked areas based on prompt + self.set_block_state(state, block_state) + return components, state +``` + + + + +Create an [`~modular_pipelines.AutoPipelineBlocks`] class that includes a list of the sub-block classes and their corresponding block names. + +You also need to include `block_trigger_inputs`, a list of input names that trigger the corresponding block. If a trigger input is provided at runtime, then that block is selected to run. Use `None` to specify the default block to run if no trigger inputs are detected. + +Lastly, it is important to include a `description` that clearly explains which inputs trigger which workflow. This helps users understand how to run specific workflows. + +```py +from diffusers.modular_pipelines import AutoPipelineBlocks + +class AutoImageBlocks(AutoPipelineBlocks): + # List of sub-block classes to choose from + block_classes = [block_inpaint_cls, block_i2i_cls, block_t2i_cls] + # Names for each block in the same order + block_names = ["inpaint", "img2img", "text2img"] + # Trigger inputs that determine which block to run + # - "mask" triggers inpaint workflow + # - "image" triggers img2img workflow (but only if mask is not provided) + # - if none of above, runs the text2img workflow (default) + block_trigger_inputs = ["mask", "image", None] + # Description is extremely important for AutoPipelineBlocks + + def description(self): + return ( + "Pipeline generates images given different types of conditions!\n" + + "This is an auto pipeline block that works for text2img, img2img and inpainting tasks.\n" + + " - inpaint workflow is run when `mask` is provided.\n" + + " - img2img workflow is run when `image` is provided (but only when `mask` is not provided).\n" + + " - text2img workflow is run when neither `image` nor `mask` is provided.\n" + ) +``` + +It is **very** important to include a `description` to avoid any confusion over how to run a block and what inputs are required. While [`~modular_pipelines.AutoPipelineBlocks`] are convenient, it's conditional logic may be difficult to figure out if it isn't properly explained. + +Create an instance of `AutoImageBlocks`. + +```py +auto_blocks = AutoImageBlocks() +``` + +For more complex compositions, such as nested [`~modular_pipelines.AutoPipelineBlocks`] blocks when they're used as sub-blocks in larger pipelines, use the [`~modular_pipelines.SequentialPipelineBlocks.get_execution_blocks`] method to extract the a block that is actually run based on your input. + +```py +auto_blocks.get_execution_blocks("mask") +``` \ No newline at end of file diff --git a/docs/source/en/modular_diffusers/components_manager.md b/docs/source/en/modular_diffusers/components_manager.md new file mode 100644 index 000000000000..af53411b9533 --- /dev/null +++ b/docs/source/en/modular_diffusers/components_manager.md @@ -0,0 +1,190 @@ + + +# ComponentsManager + +The [`ComponentsManager`] is a model registry and management system for Modular Diffusers. It adds and tracks models, stores useful metadata (model size, device placement, adapters), prevents duplicate model instances, and supports offloading. + +This guide will show you how to use [`ComponentsManager`] to manage components and device memory. + +## Add a component + +The [`ComponentsManager`] should be created alongside a [`ModularPipeline`] in either [`~ModularPipeline.from_pretrained`] or [`~ModularPipelineBlocks.init_pipeline`]. + +> [!TIP] +> The `collection` parameter is optional but makes it easier to organize and manage components. + + + + +```py +from diffusers import ModularPipeline, ComponentsManager + +comp = ComponentsManager() +pipe = ModularPipeline.from_pretrained("YiYiXu/modular-demo-auto", components_manager=comp, collection="test1") +``` + + + + +```py +from diffusers import ComponentsManager +from diffusers.modular_pipelines import SequentialPipelineBlocks +from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS + +t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS) + +modular_repo_id = "YiYiXu/modular-loader-t2i-0704" +components = ComponentsManager() +t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id, components_manager=components) +``` + + + + +Components are only loaded and registered when using [`~ModularPipeline.load_components`] or [`~ModularPipeline.load_components`]. The example below uses [`~ModularPipeline.load_components`] to create a second pipeline that reuses all the components from the first one, and assigns it to a different collection + +```py +pipe.load_components() +pipe2 = ModularPipeline.from_pretrained("YiYiXu/modular-demo-auto", components_manager=comp, collection="test2") +``` + +Use the [`~ModularPipeline.null_component_names`] property to identify any components that need to be loaded, retrieve them with [`~ComponentsManager.get_components_by_names`], and then call [`~ModularPipeline.update_components`] to add the missing components. + +```py +pipe2.null_component_names +['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'image_encoder', 'unet', 'vae', 'scheduler', 'controlnet'] + +comp_dict = comp.get_components_by_names(names=pipe2.null_component_names) +pipe2.update_components(**comp_dict) +``` + +To add individual components, use the [`~ComponentsManager.add`] method. This registers a component with a unique id. + +```py +from diffusers import AutoModel + +text_encoder = AutoModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder") +component_id = comp.add("text_encoder", text_encoder) +comp +``` + +Use [`~ComponentsManager.remove`] to remove a component using their id. + +```py +comp.remove("text_encoder_139917733042864") +``` + +## Retrieve a component + +The [`ComponentsManager`] provides several methods to retrieve registered components. + +### get_one + +The [`~ComponentsManager.get_one`] method returns a single component and supports pattern matching for the `name` parameter. If multiple components match, [`~ComponentsManager.get_one`] returns an error. + +| Pattern | Example | Description | +|-------------|----------------------------------|-------------------------------------------| +| exact | `comp.get_one(name="unet")` | exact name match | +| wildcard | `comp.get_one(name="unet*")` | names starting with "unet" | +| exclusion | `comp.get_one(name="!unet")` | exclude components named "unet" | +| or | `comp.get_one(name="unet|vae")` | name is "unet" or "vae" | + +[`~ComponentsManager.get_one`] also filters components by the `collection` argument or `load_id` argument. + +```py +comp.get_one(name="unet", collection="sdxl") +``` + +### get_components_by_names + +The [`~ComponentsManager.get_components_by_names`] method accepts a list of names and returns a dictionary mapping names to components. This is especially useful with [`ModularPipeline`] since they provide lists of required component names and the returned dictionary can be passed directly to [`~ModularPipeline.update_components`]. + +```py +component_dict = comp.get_components_by_names(names=["text_encoder", "unet", "vae"]) +{"text_encoder": component1, "unet": component2, "vae": component3} +``` + +## Duplicate detection + +It is recommended to load model components with [`ComponentSpec`] to assign components with a unique id that encodes their loading parameters. This allows [`ComponentsManager`] to automatically detect and prevent duplicate model instances even when different objects represent the same underlying checkpoint. + +```py +from diffusers import ComponentSpec, ComponentsManager +from transformers import CLIPTextModel + +comp = ComponentsManager() + +# Create ComponentSpec for the first text encoder +spec = ComponentSpec(name="text_encoder", repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder", type_hint=AutoModel) +# Create ComponentSpec for a duplicate text encoder (it is same checkpoint, from the same repo/subfolder) +spec_duplicated = ComponentSpec(name="text_encoder_duplicated", repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder", type_hint=CLIPTextModel) + +# Load and add both components - the manager will detect they're the same model +comp.add("text_encoder", spec.load()) +comp.add("text_encoder_duplicated", spec_duplicated.load()) +``` + +This returns a warning with instructions for removing the duplicate. + +```py +ComponentsManager: adding component 'text_encoder_duplicated_139917580682672', but it has duplicate load_id 'stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null' with existing components: text_encoder_139918506246832. To remove a duplicate, call `components_manager.remove('')`. +'text_encoder_duplicated_139917580682672' +``` + +You could also add a component without using [`ComponentSpec`] and duplicate detection still works in most cases even if you're adding the same component under a different name. + +However, [`ComponentManager`] can't detect duplicates when you load the same component into different objects. In this case, you should load a model with [`ComponentSpec`]. + +```py +text_encoder_2 = AutoModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder") +comp.add("text_encoder", text_encoder_2) +'text_encoder_139917732983664' +``` + +## Collections + +Collections are labels assigned to components for better organization and management. Add a component to a collection with the `collection` argument in [`~ComponentsManager.add`]. + +Only one component per name is allowed in each collection. Adding a second component with the same name automatically removes the first component. + +```py +from diffusers import ComponentSpec, ComponentsManager + +comp = ComponentsManager() +# Create ComponentSpec for the first UNet +spec = ComponentSpec(name="unet", repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", type_hint=AutoModel) +# Create ComponentSpec for a different UNet +spec2 = ComponentSpec(name="unet", repo="RunDiffusion/Juggernaut-XL-v9", subfolder="unet", type_hint=AutoModel, variant="fp16") + +# Add both UNets to the same collection - the second one will replace the first +comp.add("unet", spec.load(), collection="sdxl") +comp.add("unet", spec2.load(), collection="sdxl") +``` + +This makes it convenient to work with node-based systems because you can: + +- Mark all models as loaded from one node with the `collection` label. +- Automatically replace models when new checkpoints are loaded under the same name. +- Batch delete all models in a collection when a node is removed. + +## Offloading + +The [`~ComponentsManager.enable_auto_cpu_offload`] method is a global offloading strategy that works across all models regardless of which pipeline is using them. Once enabled, you don't need to worry about device placement if you add or remove components. + +```py +comp.enable_auto_cpu_offload(device="cuda") +``` + +All models begin on the CPU and [`ComponentsManager`] moves them to the appropriate device right before they're needed, and moves other models back to the CPU when GPU memory is low. + +You can set your own rules for which models to offload first. diff --git a/docs/source/en/modular_diffusers/guiders.md b/docs/source/en/modular_diffusers/guiders.md new file mode 100644 index 000000000000..fd0d27844205 --- /dev/null +++ b/docs/source/en/modular_diffusers/guiders.md @@ -0,0 +1,175 @@ + + +# Guiders + +[Classifier-free guidance](https://huggingface.co/papers/2207.12598) steers model generation that better match a prompt and is commonly used to improve generation quality, control, and adherence to prompts. There are different types of guidance methods, and in Diffusers, they are known as *guiders*. Like blocks, it is easy to switch and use different guiders for different use cases without rewriting the pipeline. + +This guide will show you how to switch guiders, adjust guider parameters, and load and share them to the Hub. + +## Switching guiders + +[`ClassifierFreeGuidance`] is the default guider and created when a pipeline is initialized with [`~ModularPipelineBlocks.init_pipeline`]. It is created by `from_config` which means it doesn't require loading specifications from a modular repository. A guider won't be listed in `modular_model_index.json`. + +Use [`~ModularPipeline.get_component_spec`] to inspect a guider. + +```py +t2i_pipeline.get_component_spec("guider") +ComponentSpec(name='guider', type_hint=, description=None, config=FrozenDict([('guidance_scale', 7.5), ('guidance_rescale', 0.0), ('use_original_formulation', False), ('start', 0.0), ('stop', 1.0), ('_use_default_values', ['start', 'guidance_rescale', 'stop', 'use_original_formulation'])]), repo=None, subfolder=None, variant=None, revision=None, default_creation_method='from_config') +``` + +Switch to a different guider by passing the new guider to [`~ModularPipeline.update_components`]. + +> [!TIP] +> Changing guiders will return text letting you know you're changing the guider type. +> ```bash +> ModularPipeline.update_components: adding guider with new type: PerturbedAttentionGuidance, previous type: ClassifierFreeGuidance +> ``` + +```py +from diffusers import LayerSkipConfig, PerturbedAttentionGuidance + +config = LayerSkipConfig(indices=[2, 9], fqn="mid_block.attentions.0.transformer_blocks", skip_attention=False, skip_attention_scores=True, skip_ff=False) +guider = PerturbedAttentionGuidance( + guidance_scale=5.0, perturbed_guidance_scale=2.5, perturbed_guidance_config=config +) +t2i_pipeline.update_components(guider=guider) +``` + +Use [`~ModularPipeline.get_component_spec`] again to verify the guider type is different. + +```py +t2i_pipeline.get_component_spec("guider") +ComponentSpec(name='guider', type_hint=, description=None, config=FrozenDict([('guidance_scale', 5.0), ('perturbed_guidance_scale', 2.5), ('perturbed_guidance_start', 0.01), ('perturbed_guidance_stop', 0.2), ('perturbed_guidance_layers', None), ('perturbed_guidance_config', LayerSkipConfig(indices=[2, 9], fqn='mid_block.attentions.0.transformer_blocks', skip_attention=False, skip_attention_scores=True, skip_ff=False, dropout=1.0)), ('guidance_rescale', 0.0), ('use_original_formulation', False), ('start', 0.0), ('stop', 1.0), ('_use_default_values', ['perturbed_guidance_start', 'use_original_formulation', 'perturbed_guidance_layers', 'stop', 'start', 'guidance_rescale', 'perturbed_guidance_stop']), ('_class_name', 'PerturbedAttentionGuidance'), ('_diffusers_version', '0.35.0.dev0')]), repo=None, subfolder=None, variant=None, revision=None, default_creation_method='from_config') +``` + +## Loading custom guiders + +Guiders that are already saved on the Hub with a `modular_model_index.json` file are considered a `from_pretrained` component now instead of a `from_config` component. + +```json +{ + "guider": [ + null, + null, + { + "repo": "YiYiXu/modular-loader-t2i-guider", + "revision": null, + "subfolder": "pag_guider", + "type_hint": [ + "diffusers", + "PerturbedAttentionGuidance" + ], + "variant": null + } + ] +} +``` + +The guider is only created after calling [`~ModularPipeline.load_components`] based on the loading specification in `modular_model_index.json`. + +```py +t2i_pipeline = t2i_blocks.init_pipeline("YiYiXu/modular-doc-guider") +# not created during init +assert t2i_pipeline.guider is None +t2i_pipeline.load_components() +# loaded as PAG guider +t2i_pipeline.guider +``` + + +## Changing guider parameters + +The guider parameters can be adjusted with either the [`~ComponentSpec.create`] method or with [`~ModularPipeline.update_components`]. The example below changes the `guidance_scale` value. + + + + +```py +guider_spec = t2i_pipeline.get_component_spec("guider") +guider = guider_spec.create(guidance_scale=10) +t2i_pipeline.update_components(guider=guider) +``` + + + + +```py +guider_spec = t2i_pipeline.get_component_spec("guider") +guider_spec.config["guidance_scale"] = 10 +t2i_pipeline.update_components(guider=guider_spec) +``` + + + + +## Uploading custom guiders + +Call the [`~utils.PushToHubMixin.push_to_hub`] method on a custom guider to share it to the Hub. + +```py +guider.push_to_hub("YiYiXu/modular-loader-t2i-guider", subfolder="pag_guider") +``` + +To make this guider available to the pipeline, either modify the `modular_model_index.json` file or use the [`~ModularPipeline.update_components`] method. + + + + +Edit the `modular_model_index.json` file and add a loading specification for the guider by pointing to a folder containing the guider config. + +```json +{ + "guider": [ + "diffusers", + "PerturbedAttentionGuidance", + { + "repo": "YiYiXu/modular-loader-t2i-guider", + "revision": null, + "subfolder": "pag_guider", + "type_hint": [ + "diffusers", + "PerturbedAttentionGuidance" + ], + "variant": null + } + ], +``` + + + + +Change the [`~ComponentSpec.default_creation_method`] to `from_pretrained` and use [`~ModularPipeline.update_components`] to update the guider and component specifications as well as the pipeline config. + +> [!TIP] +> Changing the creation method will return text letting you know you're changing the creation type to `from_pretrained`. +> ```bash +> ModularPipeline.update_components: changing the default_creation_method of guider from from_config to from_pretrained. +> ``` + +```py +guider_spec = t2i_pipeline.get_component_spec("guider") +guider_spec.default_creation_method="from_pretrained" +guider_spec.repo="YiYiXu/modular-loader-t2i-guider" +guider_spec.subfolder="pag_guider" +pag_guider = guider_spec.load() +t2i_pipeline.update_components(guider=pag_guider) +``` + +To make it the default guider for a pipeline, call [`~utils.PushToHubMixin.push_to_hub`]. This is an optional step and not necessary if you are only experimenting locally. + +```py +t2i_pipeline.push_to_hub("YiYiXu/modular-doc-guider") +``` + + + diff --git a/docs/source/en/modular_diffusers/loop_sequential_pipeline_blocks.md b/docs/source/en/modular_diffusers/loop_sequential_pipeline_blocks.md new file mode 100644 index 000000000000..86c82b5145d3 --- /dev/null +++ b/docs/source/en/modular_diffusers/loop_sequential_pipeline_blocks.md @@ -0,0 +1,93 @@ + + +# LoopSequentialPipelineBlocks + +[`~modular_pipelines.LoopSequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a loop. Data flows circularly, using `intermediate_inputs` and `intermediate_outputs`, and each block is run iteratively. This is typically used to create a denoising loop which is iterative by default. + +This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. + +## Loop wrapper + +[`~modular_pipelines.LoopSequentialPipelineBlocks`], is also known as the *loop wrapper* because it defines the loop structure, iteration variables, and configuration. Within the loop wrapper, you need the following variables. + +- `loop_inputs` are user provided values and equivalent to [`~modular_pipelines.ModularPipelineBlocks.inputs`]. +- `loop_intermediate_inputs` are intermediate variables from the [`~modular_pipelines.PipelineState`] and equivalent to [`~modular_pipelines.ModularPipelineBlocks.intermediate_inputs`]. +- `loop_intermediate_outputs` are new intermediate variables created by the block and added to the [`~modular_pipelines.PipelineState`]. It is equivalent to [`~modular_pipelines.ModularPipelineBlocks.intermediate_outputs`]. +- `__call__` method defines the loop structure and iteration logic. + +```py +import torch +from diffusers.modular_pipelines import LoopSequentialPipelineBlocks, ModularPipelineBlocks, InputParam, OutputParam + +class LoopWrapper(LoopSequentialPipelineBlocks): + model_name = "test" + @property + def description(self): + return "I'm a loop!!" + @property + def loop_inputs(self): + return [InputParam(name="num_steps")] + @torch.no_grad() + def __call__(self, components, state): + block_state = self.get_block_state(state) + # Loop structure - can be customized to your needs + for i in range(block_state.num_steps): + # loop_step executes all registered blocks in sequence + components, block_state = self.loop_step(components, block_state, i=i) + self.set_block_state(state, block_state) + return components, state +``` + +The loop wrapper can pass additional arguments, like current iteration index, to the loop blocks. + +## Loop blocks + +A loop block is a [`~modular_pipelines.ModularPipelineBlocks`], but the `__call__` method behaves differently. + +- It recieves the iteration variable from the loop wrapper. +- It works directly with the [`~modular_pipelines.BlockState`] instead of the [`~modular_pipelines.PipelineState`]. +- It doesn't require retrieving or updating the [`~modular_pipelines.BlockState`]. + +Loop blocks share the same [`~modular_pipelines.BlockState`] to allow values to accumulate and change for each iteration in the loop. + +```py +class LoopBlock(ModularPipelineBlocks): + model_name = "test" + @property + def inputs(self): + return [InputParam(name="x")] + @property + def intermediate_outputs(self): + # outputs produced by this block + return [OutputParam(name="x")] + @property + def description(self): + return "I'm a block used inside the `LoopWrapper` class" + def __call__(self, components, block_state, i: int): + block_state.x += 1 + return components, block_state +``` + +## LoopSequentialPipelineBlocks + +Use the [`~modular_pipelines.LoopSequentialPipelineBlocks.from_blocks_dict`] method to add the loop block to the loop wrapper to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. + +```py +loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock}) +``` + +Add more loop blocks to run within each iteration with [`~modular_pipelines.LoopSequentialPipelineBlocks.from_blocks_dict`]. This allows you to modify the blocks without changing the loop logic itself. + +```py +loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock(), "block2": LoopBlock}) +``` \ No newline at end of file diff --git a/docs/source/en/modular_diffusers/modular_diffusers_states.md b/docs/source/en/modular_diffusers/modular_diffusers_states.md new file mode 100644 index 000000000000..eb55b524e491 --- /dev/null +++ b/docs/source/en/modular_diffusers/modular_diffusers_states.md @@ -0,0 +1,75 @@ + + +# States + +Blocks rely on the [`~modular_pipelines.PipelineState`] and [`~modular_pipelines.BlockState`] data structures for communicating and sharing data. + +| State | Description | +|-------|-------------| +| [`~modular_pipelines.PipelineState`] | Maintains the overall data required for a pipeline's execution and allows blocks to read and update its data. | +| [`~modular_pipelines.BlockState`] | Allows each block to perform its computation with the necessary data from `inputs`| + +This guide explains how states work and how they connect blocks. + +## PipelineState + +The [`~modular_pipelines.PipelineState`] is a global state container for all blocks. It maintains the complete runtime state of the pipeline and provides a structured way for blocks to read from and write to shared data. + +There are two dict's in [`~modular_pipelines.PipelineState`] for structuring data. + +- The `values` dict is a **mutable** state containing a copy of user provided input values and intermediate output values generated by blocks. If a block modifies an `input`, it will be reflected in the `values` dict after calling `set_block_state`. + +```py +PipelineState( + values={ + 'prompt': 'a cat' + 'guidance_scale': 7.0 + 'num_inference_steps': 25 + 'prompt_embeds': Tensor(dtype=torch.float32, shape=torch.Size([1, 1, 1, 1])) + 'negative_prompt_embeds': None + }, +) +``` + +## BlockState + +The [`~modular_pipelines.BlockState`] is a local view of the relevant variables an individual block needs from [`~modular_pipelines.PipelineState`] for performing it's computations. + +Access these variables directly as attributes like `block_state.image`. + +```py +BlockState( + image: +) +``` + +When a block's `__call__` method is executed, it retrieves the [`BlockState`] with `self.get_block_state(state)`, performs it's operations, and updates [`~modular_pipelines.PipelineState`] with `self.set_block_state(state, block_state)`. + +```py +def __call__(self, components, state): + # retrieve BlockState + block_state = self.get_block_state(state) + + # computation logic on inputs + + # update PipelineState + self.set_block_state(state, block_state) + return components, state +``` + +## State interaction + +[`~modular_pipelines.PipelineState`] and [`~modular_pipelines.BlockState`] interaction is defined by a block's `inputs`, and `intermediate_outputs`. + +- `inputs`, a block can modify an input - like `block_state.image` - and this change can be propagated globally to [`~modular_pipelines.PipelineState`] by calling `set_block_state`. +- `intermediate_outputs`, is a new variable that a block creates. It is added to the [`~modular_pipelines.PipelineState`]'s `values` dict and is available as for subsequent blocks or accessed by users as a final output from the pipeline. diff --git a/docs/source/en/modular_diffusers/modular_pipeline.md b/docs/source/en/modular_diffusers/modular_pipeline.md new file mode 100644 index 000000000000..0e0a7bd75d51 --- /dev/null +++ b/docs/source/en/modular_diffusers/modular_pipeline.md @@ -0,0 +1,358 @@ + + +# ModularPipeline + +[`ModularPipeline`] converts [`~modular_pipelines.ModularPipelineBlocks`]'s into an executable pipeline that loads models and performs the computation steps defined in the block. It is the main interface for running a pipeline and it is very similar to the [`DiffusionPipeline`] API. + +The main difference is to include an expected `output` argument in the pipeline. + + + + +```py +import torch +from diffusers.modular_pipelines import SequentialPipelineBlocks +from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS + +blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS) + +modular_repo_id = "YiYiXu/modular-loader-t2i-0704" +pipeline = blocks.init_pipeline(modular_repo_id) + +pipeline.load_components(torch_dtype=torch.float16) +pipeline.to("cuda") + +image = pipeline(prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", output="images")[0] +image.save("modular_t2i_out.png") +``` + + + + +```py +import torch +from diffusers.modular_pipelines import SequentialPipelineBlocks +from diffusers.modular_pipelines.stable_diffusion_xl import IMAGE2IMAGE_BLOCKS + +blocks = SequentialPipelineBlocks.from_blocks_dict(IMAGE2IMAGE_BLOCKS) + +modular_repo_id = "YiYiXu/modular-loader-t2i-0704" +pipeline = blocks.init_pipeline(modular_repo_id) + +pipeline.load_components(torch_dtype=torch.float16) +pipeline.to("cuda") + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +init_image = load_image(url) +prompt = "a dog catching a frisbee in the jungle" +image = pipeline(prompt=prompt, image=init_image, strength=0.8, output="images")[0] +image.save("modular_i2i_out.png") +``` + + + + +```py +import torch +from diffusers.modular_pipelines import SequentialPipelineBlocks +from diffusers.modular_pipelines.stable_diffusion_xl import INPAINT_BLOCKS +from diffusers.utils import load_image + +blocks = SequentialPipelineBlocks.from_blocks_dict(INPAINT_BLOCKS) + +modular_repo_id = "YiYiXu/modular-loader-t2i-0704" +pipeline = blocks.init_pipeline(modular_repo_id) + +pipeline.load_components(torch_dtype=torch.float16) +pipeline.to("cuda") + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" + +init_image = load_image(img_url) +mask_image = load_image(mask_url) + +prompt = "A deep sea diver floating" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, output="images")[0] +image.save("moduar_inpaint_out.png") +``` + + + + +This guide will show you how to create a [`ModularPipeline`] and manage the components in it. + +## Adding blocks + +Blocks are [`InsertableDict`] objects that can be inserted at specific positions, providing a flexible way to mix-and-match blocks. + +Use [`~modular_pipelines.modular_pipeline_utils.InsertableDict.insert`] on either the block class or `sub_blocks` attribute to add a block. + +```py +# BLOCKS is dict of block classes, you need to add class to it +BLOCKS.insert("block_name", BlockClass, index) +# sub_blocks attribute contains instance, add a block instance to the attribute +t2i_blocks.sub_blocks.insert("block_name", block_instance, index) +``` + +Use [`~modular_pipelines.modular_pipeline_utils.InsertableDict.pop`] on either the block class or `sub_blocks` attribute to remove a block. + +```py +# remove a block class from preset +BLOCKS.pop("text_encoder") +# split out a block instance on its own +text_encoder_block = t2i_blocks.sub_blocks.pop("text_encoder") +``` + +Swap blocks by setting the existing block to the new block. + +```py +# Replace block class in preset +BLOCKS["prepare_latents"] = CustomPrepareLatents +# Replace in sub_blocks attribute using an block instance +t2i_blocks.sub_blocks["prepare_latents"] = CustomPrepareLatents() +``` + +## Creating a pipeline + +There are two ways to create a [`ModularPipeline`]. Assemble and create a pipeline from [`ModularPipelineBlocks`] or load an existing pipeline with [`~ModularPipeline.from_pretrained`]. + +You should also initialize a [`ComponentsManager`] to handle device placement and memory and component management. + +> [!TIP] +> Refer to the [ComponentsManager](./components_manager) doc for more details about how it can help manage components across different workflows. + + + + +Use the [`~ModularPipelineBlocks.init_pipeline`] method to create a [`ModularPipeline`] from the component and configuration specifications. This method loads the *specifications* from a `modular_model_index.json` file, but it doesn't load the *models* yet. + +```py +from diffusers import ComponentsManager +from diffusers.modular_pipelines import SequentialPipelineBlocks +from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS + +t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS) + +modular_repo_id = "YiYiXu/modular-loader-t2i-0704" +components = ComponentsManager() +t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id, components_manager=components) +``` + + + + +The [`~ModularPipeline.from_pretrained`] method creates a [`ModularPipeline`] from a modular repository on the Hub. + +```py +from diffusers import ModularPipeline, ComponentsManager + +components = ComponentsManager() +pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-loader-t2i-0704", components_manager=components) +``` + +Add the `trust_remote_code` argument to load a custom [`ModularPipeline`]. + +```py +from diffusers import ModularPipeline, ComponentsManager + +components = ComponentsManager() +modular_repo_id = "YiYiXu/modular-diffdiff-0704" +diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remote_code=True, components_manager=components) +``` + + + + +## Loading components + +A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load all components with [`~ModularPipeline.load_components`] or only load specific components with [`~ModularPipeline.load_components`]. + + + + +```py +import torch + +t2i_pipeline.load_components(torch_dtype=torch.float16) +t2i_pipeline.to("cuda") +``` + + + + +The example below only loads the UNet and VAE. + +```py +import torch + +t2i_pipeline.load_components(names=["unet", "vae"], torch_dtype=torch.float16) +``` + + + + +Print the pipeline to inspect the loaded pretrained components. + +```py +t2i_pipeline +``` + +This should match the `modular_model_index.json` file from the modular repository a pipeline is initialized from. If a pipeline doesn't need a component, it won't be included even if it exists in the modular repository. + +To modify where components are loaded from, edit the `modular_model_index.json` file in the repository and change it to your desired loading path. The example below loads a UNet from a different repository. + +```json +# original +"unet": [ + null, null, + { + "repo": "stabilityai/stable-diffusion-xl-base-1.0", + "subfolder": "unet", + "variant": "fp16" + } +] + +# modified +"unet": [ + null, null, + { + "repo": "RunDiffusion/Juggernaut-XL-v9", + "subfolder": "unet", + "variant": "fp16" + } +] +``` + +### Component loading status + +The pipeline properties below provide more information about which components are loaded. + +Use `component_names` to return all expected components. + +```py +t2i_pipeline.component_names +['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'guider', 'scheduler', 'unet', 'vae', 'image_processor'] +``` + +Use `null_component_names` to return components that aren't loaded yet. Load these components with [`~ModularPipeline.from_pretrained`]. + +```py +t2i_pipeline.null_component_names +['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'scheduler'] +``` + +Use `pretrained_component_names` to return components that will be loaded from pretrained models. + +```py +t2i_pipeline.pretrained_component_names +['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'scheduler', 'unet', 'vae'] +``` + +Use `config_component_names` to return components that are created with the default config (not loaded from a modular repository). Components from a config aren't included because they are already initialized during pipeline creation. This is why they aren't listed in `null_component_names`. + +```py +t2i_pipeline.config_component_names +['guider', 'image_processor'] +``` + +## Updating components + +Components may be updated depending on whether it is a *pretrained component* or a *config component*. + +> [!WARNING] +> A component may change from pretrained to config when updating a component. The component type is initially defined in a block's `expected_components` field. + +A pretrained component is updated with [`ComponentSpec`] whereas a config component is updated by eihter passing the object directly or with [`ComponentSpec`]. + +The [`ComponentSpec`] shows `default_creation_method="from_pretrained"` for a pretrained component shows `default_creation_method="from_config` for a config component. + +To update a pretrained component, create a [`ComponentSpec`] with the name of the component and where to load it from. Use the [`~ComponentSpec.load`] method to load the component. + +```py +from diffusers import ComponentSpec, UNet2DConditionModel + +unet_spec = ComponentSpec(name="unet",type_hint=UNet2DConditionModel, repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", variant="fp16") +unet = unet_spec.load(torch_dtype=torch.float16) +``` + +The [`~ModularPipeline.update_components`] method replaces the component with a new one. + +```py +t2i_pipeline.update_components(unet=unet2) +``` + +When a component is updated, the loading specifications are also updated in the pipeline config. + +### Component extraction and modification + +When you use [`~ComponentSpec.load`], the new component maintains its loading specifications. This makes it possible to extract the specification and recreate the component. + +```py +spec = ComponentSpec.from_component("unet", unet2) +spec +ComponentSpec(name='unet', type_hint=, description=None, config=None, repo='stabilityai/stable-diffusion-xl-base-1.0', subfolder='unet', variant='fp16', revision=None, default_creation_method='from_pretrained') +unet2_recreated = spec.load(torch_dtype=torch.float16) +``` + +The [`~ModularPipeline.get_component_spec`] method gets a copy of the current component specification to modify or update. + +```py +unet_spec = t2i_pipeline.get_component_spec("unet") +unet_spec +ComponentSpec( + name='unet', + type_hint=, + repo='RunDiffusion/Juggernaut-XL-v9', + subfolder='unet', + variant='fp16', + default_creation_method='from_pretrained' +) + +# modify to load from a different repository +unet_spec.repo = "stabilityai/stable-diffusion-xl-base-1.0" + +# load component with modified spec +unet = unet_spec.load(torch_dtype=torch.float16) +``` + +## Modular repository + +A repository is required if the pipeline blocks use *pretrained components*. The repository supplies loading specifications and metadata. + +[`ModularPipeline`] specifically requires *modular repositories* (see [example repository](https://huggingface.co/YiYiXu/modular-diffdiff)) which are more flexible than a typical repository. It contains a `modular_model_index.json` file containing the following 3 elements. + +- `library` and `class` shows which library the component was loaded from and it's class. If `null`, the component hasn't been loaded yet. +- `loading_specs_dict` contains the information required to load the component such as the repository and subfolder it is loaded from. + +Unlike standard repositories, a modular repository can fetch components from different repositories based on the `loading_specs_dict`. Components don't need to exist in the same repository. + +A modular repository may contain custom code for loading a [`ModularPipeline`]. This allows you to use specialized blocks that aren't native to Diffusers. + +``` +modular-diffdiff-0704/ +├── block.py # Custom pipeline blocks implementation +├── config.json # Pipeline configuration and auto_map +└── modular_model_index.json # Component loading specifications +``` + +The [config.json](https://huggingface.co/YiYiXu/modular-diffdiff-0704/blob/main/config.json) file contains an `auto_map` key that points to where a custom block is defined in `block.py`. + +```json +{ + "_class_name": "DiffDiffBlocks", + "auto_map": { + "ModularPipelineBlocks": "block.DiffDiffBlocks" + } +} +``` diff --git a/docs/source/en/modular_diffusers/overview.md b/docs/source/en/modular_diffusers/overview.md new file mode 100644 index 000000000000..7d07c4b73434 --- /dev/null +++ b/docs/source/en/modular_diffusers/overview.md @@ -0,0 +1,41 @@ + + +# Overview + +> [!WARNING] +> Modular Diffusers is under active development and it's API may change. + +Modular Diffusers is a unified pipeline system that simplifies your workflow with *pipeline blocks*. + +- Blocks are reusable and you only need to create new blocks that are unique to your pipeline. +- Blocks can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows. + +The Modular Diffusers docs are organized as shown below. + +## Quickstart + +- A [quickstart](./quickstart) demonstrating how to implement an example workflow with Modular Diffusers. + +## ModularPipelineBlocks + +- [States](./modular_diffusers_states) explains how data is shared and communicated between blocks and [`ModularPipeline`]. +- [ModularPipelineBlocks](./pipeline_block) is the most basic unit of a [`ModularPipeline`] and this guide shows you how to create one. +- [SequentialPipelineBlocks](./sequential_pipeline_blocks) is a type of block that chains multiple blocks so they run one after another, passing data along the chain. This guide shows you how to create [`~modular_pipelines.SequentialPipelineBlocks`] and how they connect and work together. +- [LoopSequentialPipelineBlocks](./loop_sequential_pipeline_blocks) is a type of block that runs a series of blocks in a loop. This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. +- [AutoPipelineBlocks](./auto_pipeline_blocks) is a type of block that automatically chooses which blocks to run based on the input. This guide shows you how to create [`~modular_pipelines.AutoPipelineBlocks`]. + +## ModularPipeline + +- [ModularPipeline](./modular_pipeline) shows you how to create and convert pipeline blocks into an executable [`ModularPipeline`]. +- [ComponentsManager](./components_manager) shows you how to manage and reuse components across multiple pipelines. +- [Guiders](./guiders) shows you how to use different guidance methods in the pipeline. \ No newline at end of file diff --git a/docs/source/en/modular_diffusers/pipeline_block.md b/docs/source/en/modular_diffusers/pipeline_block.md new file mode 100644 index 000000000000..66d26b021456 --- /dev/null +++ b/docs/source/en/modular_diffusers/pipeline_block.md @@ -0,0 +1,115 @@ + + +# ModularPipelineBlocks + +[`~modular_pipelines.ModularPipelineBlocks`] is the basic block for building a [`ModularPipeline`]. It defines what components, inputs/outputs, and computation a block should perform for a specific step in a pipeline. A [`~modular_pipelines.ModularPipelineBlocks`] connects with other blocks, using [state](./modular_diffusers_states), to enable the modular construction of workflows. + +A [`~modular_pipelines.ModularPipelineBlocks`] on it's own can't be executed. It is a blueprint for what a step should do in a pipeline. To actually run and execute a pipeline, the [`~modular_pipelines.ModularPipelineBlocks`] needs to be converted into a [`ModularPipeline`]. + +This guide will show you how to create a [`~modular_pipelines.ModularPipelineBlocks`]. + +## Inputs and outputs + +> [!TIP] +> Refer to the [States](./modular_diffusers_states) guide if you aren't familiar with how state works in Modular Diffusers. + +A [`~modular_pipelines.ModularPipelineBlocks`] requires `inputs`, and `intermediate_outputs`. + +- `inputs` are values provided by a user and retrieved from the [`~modular_pipelines.PipelineState`]. This is useful because some workflows resize an image, but the original image is still required. The [`~modular_pipelines.PipelineState`] maintains the original image. + + Use `InputParam` to define `inputs`. + + ```py + from diffusers.modular_pipelines import InputParam + + user_inputs = [ + InputParam(name="image", type_hint="PIL.Image", description="raw input image to process") + ] + ``` + +- `intermediate_inputs` are values typically created from a previous block but it can also be directly provided if no preceding block generates them. Unlike `inputs`, `intermediate_inputs` can be modified. + + Use `InputParam` to define `intermediate_inputs`. + + ```py + user_intermediate_inputs = [ + InputParam(name="processed_image", type_hint="torch.Tensor", description="image that has been preprocessed and normalized"), + ] + ``` + +- `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `intermediate_inputs` for subsequent blocks or available as the final output from running the pipeline. + + Use `OutputParam` to define `intermediate_outputs`. + + ```py + from diffusers.modular_pipelines import OutputParam + + user_intermediate_outputs = [ + OutputParam(name="image_latents", description="latents representing the image") + ] + ``` + +The intermediate inputs and outputs share data to connect blocks. They are accessible at any point, allowing you to track the workflow's progress. + +## Computation logic + +The computation a block performs is defined in the `__call__` method and it follows a specific structure. + +1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs` and `intermediate_inputs`. +2. Implement the computation logic on the `inputs` and `intermediate_inputs`. +3. Update [`~modular_pipelines.PipelineState`] to push changes from the local [`~modular_pipelines.BlockState`] back to the global [`~modular_pipelines.PipelineState`]. +4. Return the components and state which becomes available to the next block. + +```py +def __call__(self, components, state): + # Get a local view of the state variables this block needs + block_state = self.get_block_state(state) + + # Your computation logic here + # block_state contains all your inputs and intermediate_inputs + # Access them like: block_state.image, block_state.processed_image + + # Update the pipeline state with your updated block_states + self.set_block_state(state, block_state) + return components, state +``` + +### Components and configs + +The components and pipeline-level configs a block needs are specified in [`ComponentSpec`] and [`~modular_pipelines.ConfigSpec`]. + +- [`ComponentSpec`] contains the expected components used by a block. You need the `name` of the component and ideally a `type_hint` that specifies exactly what the component is. +- [`~modular_pipelines.ConfigSpec`] contains pipeline-level settings that control behavior across all blocks. + +```py +from diffusers import ComponentSpec, ConfigSpec + +expected_components = [ + ComponentSpec(name="unet", type_hint=UNet2DConditionModel), + ComponentSpec(name="scheduler", type_hint=EulerDiscreteScheduler) +] + +expected_config = [ + ConfigSpec("force_zeros_for_empty_prompt", True) +] +``` + +When the blocks are converted into a pipeline, the components become available to the block as the first argument in `__call__`. + +```py +def __call__(self, components, state): + # Access components using dot notation + unet = components.unet + vae = components.vae + scheduler = components.scheduler +``` \ No newline at end of file diff --git a/docs/source/en/modular_diffusers/quickstart.md b/docs/source/en/modular_diffusers/quickstart.md new file mode 100644 index 000000000000..9d4eaa0c0c3d --- /dev/null +++ b/docs/source/en/modular_diffusers/quickstart.md @@ -0,0 +1,344 @@ + + +# Quickstart + +Modular Diffusers is a framework for quickly building flexible and customizable pipelines. At the core of Modular Diffusers are [`ModularPipelineBlocks`] that can be combined with other blocks to adapt to new workflows. The blocks are converted into a [`ModularPipeline`], a friendly user-facing interface developers can use. + +This doc will show you how to implement a [Differential Diffusion](https://differential-diffusion.github.io/) pipeline with the modular framework. + +## ModularPipelineBlocks + +[`ModularPipelineBlocks`] are *definitions* that specify the components, inputs, outputs, and computation logic for a single step in a pipeline. There are four types of blocks. + +- [`ModularPipelineBlocks`] is the most basic block for a single step. +- [`SequentialPipelineBlocks`] is a multi-block that composes other blocks linearly. The outputs of one block are the inputs to the next block. +- [`LoopSequentialPipelineBlocks`] is a multi-block that runs iteratively and is designed for iterative workflows. +- [`AutoPipelineBlocks`] is a collection of blocks for different workflows and it selects which block to run based on the input. It is designed to conveniently package multiple workflows into a single pipeline. + +[Differential Diffusion](https://differential-diffusion.github.io/) is an image-to-image workflow. Start with the `IMAGE2IMAGE_BLOCKS` preset, a collection of `ModularPipelineBlocks` for image-to-image generation. + +```py +from diffusers.modular_pipelines.stable_diffusion_xl import IMAGE2IMAGE_BLOCKS +IMAGE2IMAGE_BLOCKS = InsertableDict([ + ("text_encoder", StableDiffusionXLTextEncoderStep), + ("image_encoder", StableDiffusionXLVaeEncoderStep), + ("input", StableDiffusionXLInputStep), + ("set_timesteps", StableDiffusionXLImg2ImgSetTimestepsStep), + ("prepare_latents", StableDiffusionXLImg2ImgPrepareLatentsStep), + ("prepare_add_cond", StableDiffusionXLImg2ImgPrepareAdditionalConditioningStep), + ("denoise", StableDiffusionXLDenoiseStep), + ("decode", StableDiffusionXLDecodeStep) +]) +``` + +## Pipeline and block states + +Modular Diffusers uses *state* to communicate data between blocks. There are two types of states. + +- [`PipelineState`] is a global state that can be used to track all inputs and outputs across all blocks. +- [`BlockState`] is a local view of relevant variables from [`PipelineState`] for an individual block. + +## Customizing blocks + +[Differential Diffusion](https://differential-diffusion.github.io/) differs from standard image-to-image in its `prepare_latents` and `denoise` blocks. All the other blocks can be reused, but you'll need to modify these two. + +Create placeholder `ModularPipelineBlocks` for `prepare_latents` and `denoise` by copying and modifying the existing ones. + +Print the `denoise` block to see that it is composed of [`LoopSequentialPipelineBlocks`] with three sub-blocks, `before_denoiser`, `denoiser`, and `after_denoiser`. Only the `before_denoiser` sub-block needs to be modified to prepare the latent input for the denoiser based on the change map. + +```py +denoise_blocks = IMAGE2IMAGE_BLOCKS["denoise"]() +print(denoise_blocks) +``` + +Replace the `StableDiffusionXLLoopBeforeDenoiser` sub-block with the new `SDXLDiffDiffLoopBeforeDenoiser` block. + +```py +# Copy existing blocks as placeholders +class SDXLDiffDiffPrepareLatentsStep(ModularPipelineBlocks): + """Copied from StableDiffusionXLImg2ImgPrepareLatentsStep - will modify later""" + # ... same implementation as StableDiffusionXLImg2ImgPrepareLatentsStep + +class SDXLDiffDiffDenoiseStep(StableDiffusionXLDenoiseLoopWrapper): + block_classes = [SDXLDiffDiffLoopBeforeDenoiser, StableDiffusionXLLoopDenoiser, StableDiffusionXLLoopAfterDenoiser] + block_names = ["before_denoiser", "denoiser", "after_denoiser"] +``` + +### prepare_latents + +The `prepare_latents` block requires the following changes. + +- a processor to process the change map +- a new `inputs` to accept the user-provided change map, `timestep` for precomputing all the latents and `num_inference_steps` to create the mask for updating the image regions +- update the computation in the `__call__` method for processing the change map and creating the masks, and storing it in the [`BlockState`] + +```diff +class SDXLDiffDiffPrepareLatentsStep(ModularPipelineBlocks): + @property + def expected_components(self) -> List[ComponentSpec]: + return [ + ComponentSpec("vae", AutoencoderKL), + ComponentSpec("scheduler", EulerDiscreteScheduler), ++ ComponentSpec("mask_processor", VaeImageProcessor, config=FrozenDict({"do_normalize": False, "do_convert_grayscale": True})) + ] + @property + def inputs(self) -> List[Tuple[str, Any]]: + return [ + InputParam("generator"), ++ InputParam("diffdiff_map", required=True), +- InputParam("latent_timestep", required=True, type_hint=torch.Tensor), ++ InputParam("timesteps", type_hint=torch.Tensor), ++ InputParam("num_inference_steps", type_hint=int), + ] + + @property + def intermediate_outputs(self) -> List[OutputParam]: + return [ ++ OutputParam("original_latents", type_hint=torch.Tensor), ++ OutputParam("diffdiff_masks", type_hint=torch.Tensor), + ] + def __call__(self, components, state: PipelineState): + # ... existing logic ... ++ # Process change map and create masks ++ diffdiff_map = components.mask_processor.preprocess(block_state.diffdiff_map, height=latent_height, width=latent_width) ++ thresholds = torch.arange(block_state.num_inference_steps, dtype=diffdiff_map.dtype) / block_state.num_inference_steps ++ block_state.diffdiff_masks = diffdiff_map > (thresholds + (block_state.denoising_start or 0)) ++ block_state.original_latents = block_state.latents +``` + +### denoise + +The `before_denoiser` sub-block requires the following changes. + +- a new `inputs` to accept a `denoising_start` parameter, `original_latents` and `diffdiff_masks` from the `prepare_latents` block +- update the computation in the `__call__` method for applying Differential Diffusion + +```diff +class SDXLDiffDiffLoopBeforeDenoiser(ModularPipelineBlocks): + @property + def description(self) -> str: + return ( + "Step within the denoising loop for differential diffusion that prepare the latent input for the denoiser" + ) + + @property + def inputs(self) -> List[str]: + return [ + InputParam("latents", required=True, type_hint=torch.Tensor), ++ InputParam("denoising_start"), ++ InputParam("original_latents", type_hint=torch.Tensor), ++ InputParam("diffdiff_masks", type_hint=torch.Tensor), + ] + + def __call__(self, components, block_state, i, t): ++ # Apply differential diffusion logic ++ if i == 0 and block_state.denoising_start is None: ++ block_state.latents = block_state.original_latents[:1] ++ else: ++ block_state.mask = block_state.diffdiff_masks[i].unsqueeze(0).unsqueeze(1) ++ block_state.latents = block_state.original_latents[i] * block_state.mask + block_state.latents * (1 - block_state.mask) + + # ... rest of existing logic ... +``` + +## Assembling the blocks + +You should have all the blocks you need at this point to create a [`ModularPipeline`]. + +Copy the existing `IMAGE2IMAGE_BLOCKS` preset and for the `set_timesteps` block, use the `set_timesteps` from the `TEXT2IMAGE_BLOCKS` because Differential Diffusion doesn't require a `strength` parameter. + +Set the `prepare_latents` and `denoise` blocks to the `SDXLDiffDiffPrepareLatentsStep` and `SDXLDiffDiffDenoiseStep` blocks you just modified. + +Call [`SequentialPipelineBlocks.from_blocks_dict`] on the blocks to create a `SequentialPipelineBlocks`. + +```py +DIFFDIFF_BLOCKS = IMAGE2IMAGE_BLOCKS.copy() +DIFFDIFF_BLOCKS["set_timesteps"] = TEXT2IMAGE_BLOCKS["set_timesteps"] +DIFFDIFF_BLOCKS["prepare_latents"] = SDXLDiffDiffPrepareLatentsStep +DIFFDIFF_BLOCKS["denoise"] = SDXLDiffDiffDenoiseStep + +dd_blocks = SequentialPipelineBlocks.from_blocks_dict(DIFFDIFF_BLOCKS) +print(dd_blocks) +``` + +## ModularPipeline + +Convert the [`SequentialPipelineBlocks`] into a [`ModularPipeline`] with the [`ModularPipeline.init_pipeline`] method. This initializes the expected components to load from a `modular_model_index.json` file. Explicitly load the components by calling [`ModularPipeline.load_components`]. + +It is a good idea to initialize the [`ComponentManager`] with the pipeline to help manage the different components. Once you call [`~ModularPipeline.load_components`], the components are registered to the [`ComponentManager`] and can be shared between workflows. The example below uses the `collection` argument to assign the components a `"diffdiff"` label for better organization. + +```py +from diffusers.modular_pipelines import ComponentsManager + +components = ComponentManager() + +dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", components_manager=components, collection="diffdiff") +dd_pipeline.load_default_componenets(torch_dtype=torch.float16) +dd_pipeline.to("cuda") +``` + +## Adding workflows + +Other workflows can be added to the [`ModularPipeline`] to support additional features without rewriting the entire pipeline from scratch. + +This section demonstrates how to add an IP-Adapter or ControlNet. + +### IP-Adapter + +Stable Diffusion XL already has a preset IP-Adapter block that you can use and doesn't require any changes to the existing Differential Diffusion pipeline. + +```py +from diffusers.modular_pipelines.stable_diffusion_xl.encoders import StableDiffusionXLAutoIPAdapterStep + +ip_adapter_block = StableDiffusionXLAutoIPAdapterStep() +``` + +Use the [`sub_blocks.insert`] method to insert it into the [`ModularPipeline`]. The example below inserts the `ip_adapter_block` at position `0`. Print the pipeline to see that the `ip_adapter_block` is added and it requires an `ip_adapter_image`. This also added two components to the pipeline, the `image_encoder` and `feature_extractor`. + +```py +dd_blocks.sub_blocks.insert("ip_adapter", ip_adapter_block, 0) +``` + +Call [`~ModularPipeline.init_pipeline`] to initialize a [`ModularPipeline`] and use [`~ModularPipeline.load_components`] to load the model components. Load and set the IP-Adapter to run the pipeline. + +```py +dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff") +dd_pipeline.load_components(torch_dtype=torch.float16) +dd_pipeline.loader.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") +dd_pipeline.loader.set_ip_adapter_scale(0.6) +dd_pipeline = dd_pipeline.to(device) + +ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/diffdiff_orange.jpeg") +image = load_image("https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/differential/20240329211129_4024911930.png?download=true") +mask = load_image("https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/differential/gradient_mask.png?download=true") + +prompt = "a green pear" +negative_prompt = "blurry" +generator = torch.Generator(device=device).manual_seed(42) + +image = dd_pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_inference_steps=25, + generator=generator, + ip_adapter_image=ip_adapter_image, + diffdiff_map=mask, + image=image, + output="images" +)[0] +``` + +### ControlNet + +Stable Diffusion XL already has a preset ControlNet block that can readily be used. + +```py +from diffusers.modular_pipelines.stable_diffusion_xl.modular_blocks import StableDiffusionXLAutoControlNetInputStep + +control_input_block = StableDiffusionXLAutoControlNetInputStep() +``` + +However, it requires modifying the `denoise` block because that's where the ControlNet injects the control information into the UNet. + +Modify the `denoise` block by replacing the `StableDiffusionXLLoopDenoiser` sub-block with the `StableDiffusionXLControlNetLoopDenoiser`. + +```py +class SDXLDiffDiffControlNetDenoiseStep(StableDiffusionXLDenoiseLoopWrapper): + block_classes = [SDXLDiffDiffLoopBeforeDenoiser, StableDiffusionXLControlNetLoopDenoiser, StableDiffusionXLDenoiseLoopAfterDenoiser] + block_names = ["before_denoiser", "denoiser", "after_denoiser"] + +controlnet_denoise_block = SDXLDiffDiffControlNetDenoiseStep() +``` + +Insert the `controlnet_input` block and replace the `denoise` block with the new `controlnet_denoise_block`. Initialize a [`ModularPipeline`] and [`~ModularPipeline.load_components`] into it. + +```py +dd_blocks.sub_blocks.insert("controlnet_input", control_input_block, 7) +dd_blocks.sub_blocks["denoise"] = controlnet_denoise_block + +dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff") +dd_pipeline.load_components(torch_dtype=torch.float16) +dd_pipeline = dd_pipeline.to(device) + +control_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/diffdiff_tomato_canny.jpeg") +image = load_image("https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/differential/20240329211129_4024911930.png?download=true") +mask = load_image("https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/differential/gradient_mask.png?download=true") + +prompt = "a green pear" +negative_prompt = "blurry" +generator = torch.Generator(device=device).manual_seed(42) + +image = dd_pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_inference_steps=25, + generator=generator, + control_image=control_image, + controlnet_conditioning_scale=0.5, + diffdiff_map=mask, + image=image, + output="images" +)[0] +``` + +### AutoPipelineBlocks + +The Differential Diffusion, IP-Adapter, and ControlNet workflows can be bundled into a single [`ModularPipeline`] by using [`AutoPipelineBlocks`]. This allows automatically selecting which sub-blocks to run based on the inputs like `control_image` or `ip_adapter_image`. If none of these inputs are passed, then it defaults to the Differential Diffusion. + +Use `block_trigger_inputs` to only run the `SDXLDiffDiffControlNetDenoiseStep` block if a `control_image` input is provided. Otherwise, the `SDXLDiffDiffDenoiseStep` is used. + +```py +class SDXLDiffDiffAutoDenoiseStep(AutoPipelineBlocks): + block_classes = [SDXLDiffDiffControlNetDenoiseStep, SDXLDiffDiffDenoiseStep] + block_names = ["controlnet_denoise", "denoise"] + block_trigger_inputs = ["controlnet_cond", None] +``` + +Add the `ip_adapter` and `controlnet_input` blocks. + +```py +DIFFDIFF_AUTO_BLOCKS = IMAGE2IMAGE_BLOCKS.copy() +DIFFDIFF_AUTO_BLOCKS["prepare_latents"] = SDXLDiffDiffPrepareLatentsStep +DIFFDIFF_AUTO_BLOCKS["set_timesteps"] = TEXT2IMAGE_BLOCKS["set_timesteps"] +DIFFDIFF_AUTO_BLOCKS["denoise"] = SDXLDiffDiffAutoDenoiseStep +DIFFDIFF_AUTO_BLOCKS.insert("ip_adapter", StableDiffusionXLAutoIPAdapterStep, 0) +DIFFDIFF_AUTO_BLOCKS.insert("controlnet_input",StableDiffusionXLControlNetAutoInput, 7) +``` + +Call [`SequentialPipelineBlocks.from_blocks_dict`] to create a [`SequentialPipelineBlocks`] and create a [`ModularPipeline`] and load in the model components to run. + +```py +dd_auto_blocks = SequentialPipelineBlocks.from_blocks_dict(DIFFDIFF_AUTO_BLOCKS) +dd_pipeline = dd_auto_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff") +dd_pipeline.load_components(torch_dtype=torch.float16) +``` + +## Share + +Add your [`ModularPipeline`] to the Hub with [`~ModularPipeline.save_pretrained`] and set `push_to_hub` argument to `True`. + +```py +dd_pipeline.save_pretrained("YiYiXu/test_modular_doc", push_to_hub=True) +``` + +Other users can load the [`ModularPipeline`] with [`~ModularPipeline.from_pretrained`]. + +```py +import torch +from diffusers.modular_pipelines import ModularPipeline, ComponentsManager + +components = ComponentsManager() + +diffdiff_pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-diffdiff-0704", trust_remote_code=True, components_manager=components, collection="diffdiff") +diffdiff_pipeline.load_components(torch_dtype=torch.float16) +``` diff --git a/docs/source/en/modular_diffusers/sequential_pipeline_blocks.md b/docs/source/en/modular_diffusers/sequential_pipeline_blocks.md new file mode 100644 index 000000000000..bbeb28aae5a4 --- /dev/null +++ b/docs/source/en/modular_diffusers/sequential_pipeline_blocks.md @@ -0,0 +1,113 @@ + + +# SequentialPipelineBlocks + +[`~modular_pipelines.SequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a sequence. Data flows linearly from one block to the next using `intermediate_inputs` and `intermediate_outputs`. Each block in [`~modular_pipelines.SequentialPipelineBlocks`] usually represents a step in the pipeline, and by combining them, you gradually build a pipeline. + +This guide shows you how to connect two blocks into a [`~modular_pipelines.SequentialPipelineBlocks`]. + +Create two [`~modular_pipelines.ModularPipelineBlocks`]. The first block, `InputBlock`, outputs a `batch_size` value and the second block, `ImageEncoderBlock` uses `batch_size` as `intermediate_inputs`. + + + + +```py +from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam + +class InputBlock(ModularPipelineBlocks): + + @property + def inputs(self): + return [ + InputParam(name="prompt", type_hint=list, description="list of text prompts"), + InputParam(name="num_images_per_prompt", type_hint=int, description="number of images per prompt"), + ] + + @property + def intermediate_outputs(self): + return [ + OutputParam(name="batch_size", description="calculated batch size"), + ] + + @property + def description(self): + return "A block that determines batch_size based on the number of prompts and num_images_per_prompt argument." + + def __call__(self, components, state): + block_state = self.get_block_state(state) + batch_size = len(block_state.prompt) + block_state.batch_size = batch_size * block_state.num_images_per_prompt + self.set_block_state(state, block_state) + return components, state +``` + + + + +```py +import torch +from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam + +class ImageEncoderBlock(ModularPipelineBlocks): + + @property + def inputs(self): + return [ + InputParam(name="image", type_hint="PIL.Image", description="raw input image to process"), + InputParam(name="batch_size", type_hint=int), + ] + + @property + def intermediate_outputs(self): + return [ + OutputParam(name="image_latents", description="latents representing the image"), + ] + + @property + def description(self): + return "Encode raw image into its latent presentation" + + def __call__(self, components, state): + block_state = self.get_block_state(state) + # Simulate processing the image + # This will change the state of the image from a PIL image to a tensor for all blocks + block_state.image = torch.randn(1, 3, 512, 512) + block_state.batch_size = block_state.batch_size * 2 + block_state.image_latents = torch.randn(1, 4, 64, 64) + self.set_block_state(state, block_state) + return components, state +``` + + + + +Connect the two blocks by defining an [`InsertableDict`] to map the block names to the block instances. Blocks are executed in the order they're registered in `blocks_dict`. + +Use [`~modular_pipelines.SequentialPipelineBlocks.from_blocks_dict`] to create a [`~modular_pipelines.SequentialPipelineBlocks`]. + +```py +from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict + +blocks_dict = InsertableDict() +blocks_dict["input"] = input_block +blocks_dict["image_encoder"] = image_encoder_block + +blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict) +``` + +Inspect the sub-blocks in [`~modular_pipelines.SequentialPipelineBlocks`] by calling `blocks`, and for more details about the inputs and outputs, access the `docs` attribute. + +```py +print(blocks) +print(blocks.doc) +``` \ No newline at end of file diff --git a/docs/source/en/optimization/attention_backends.md b/docs/source/en/optimization/attention_backends.md new file mode 100644 index 000000000000..e603878a6383 --- /dev/null +++ b/docs/source/en/optimization/attention_backends.md @@ -0,0 +1,114 @@ + + +# Attention backends + +> [!NOTE] +> The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems. + +Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them. + +Refer to the table below for an overview of the available attention families and to the [Available backends](#available-backends) section for a more complete list. + +| attention family | main feature | +|---|---| +| FlashAttention | minimizes memory reads/writes through tiling and recomputation | +| SageAttention | quantizes attention to int8 | +| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) | +| xFormers | memory-efficient attention with support for various attention kernels | + +This guide will show you how to set and use the different attention backends. + +## set_attention_backend + +The [`~ModelMixin.set_attention_backend`] method iterates through all the modules in the model and sets the appropriate attention backend to use. The attention backend setting persists until [`~ModelMixin.reset_attention_backend`] is called. + +The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [kernel](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup. + +> [!NOTE] +> FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`. + +```py +import torch +from diffusers import QwenImagePipeline + +pipeline = QwenImagePipeline.from_pretrained( + "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" +) +pipeline.transformer.set_attention_backend("_flash_3_hub") + +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] +``` + +To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`]. + +```py +pipeline.transformer.reset_attention_backend() +``` + +## attention_backend context manager + +The [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager temporarily sets an attention backend for a model within the context. Outside the context, the default attention (PyTorch's native scaled dot product attention) is used. This is useful if you want to use different backends for different parts of a pipeline or if you want to test the different backends. + +```py +import torch +from diffusers import QwenImagePipeline + +pipeline = QwenImagePipeline.from_pretrained( + "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" +) +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" + +with attention_backend("_flash_3_hub"): + image = pipeline(prompt).images[0] +``` + +> [!TIP] +> Most attention backends support `torch.compile` without graph breaks and can be used to further speed up inference. + +## Available backends + +Refer to the table below for a complete list of available attention backends and their variants. + +
+Expand + +| Backend Name | Family | Description | +|--------------|--------|-------------| +| `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention | +| `flex` | [FlexAttention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention) | PyTorch FlexAttention implementation | +| `_native_cudnn` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | CuDNN-optimized attention | +| `_native_efficient` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Memory-efficient attention | +| `_native_flash` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | PyTorch's FlashAttention | +| `_native_math` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Math-based attention (fallback) | +| `_native_npu` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | NPU-optimized attention | +| `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention | +| `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 | +| `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention | +| `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 | +| `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 | +| `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels | +| `sage` | [SageAttention](https://github.com/thu-ml/SageAttention) | Quantized attention (INT8 QK) | +| `sage_varlen` | [SageAttention](https://github.com/thu-ml/SageAttention) | Variable length SageAttention | +| `_sage_qk_int8_pv_fp8_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (CUDA) | +| `_sage_qk_int8_pv_fp8_cuda_sm90` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (SM90) | +| `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) | +| `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) | +| `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention | + +
\ No newline at end of file diff --git a/docs/source/en/optimization/cache.md b/docs/source/en/optimization/cache.md new file mode 100644 index 000000000000..881529b27ff1 --- /dev/null +++ b/docs/source/en/optimization/cache.md @@ -0,0 +1,69 @@ + + +# Caching + +Caching accelerates inference by storing and reusing intermediate outputs of different layers, such as attention and feedforward layers, instead of performing the entire computation at each inference step. It significantly improves generation speed at the expense of more memory and doesn't require additional training. + +This guide shows you how to use the caching methods supported in Diffusers. + +## Pyramid Attention Broadcast + +[Pyramid Attention Broadcast (PAB)](https://huggingface.co/papers/2408.12588) is based on the observation that attention outputs aren't that different between successive timesteps of the generation process. The attention differences are smallest in the cross attention layers and are generally cached over a longer timestep range. This is followed by temporal attention and spatial attention layers. + +> [!TIP] +> Not all video models have three types of attention (cross, temporal, and spatial)! + +PAB can be combined with other techniques like sequence parallelism and classifier-free guidance parallelism (data parallelism) for near real-time video generation. + +Set up and pass a [`PyramidAttentionBroadcastConfig`] to a pipeline's transformer to enable it. The `spatial_attention_block_skip_range` controls how often to skip attention calculations in the spatial attention blocks and the `spatial_attention_timestep_skip_range` is the range of timesteps to skip. Take care to choose an appropriate range because a smaller interval can lead to slower inference speeds and a larger interval can result in lower generation quality. + +```python +import torch +from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig + +pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) +pipeline.to("cuda") + +config = PyramidAttentionBroadcastConfig( + spatial_attention_block_skip_range=2, + spatial_attention_timestep_skip_range=(100, 800), + current_timestep_callback=lambda: pipe.current_timestep, +) +pipeline.transformer.enable_cache(config) +``` + +## FasterCache + +[FasterCache](https://huggingface.co/papers/2410.19355) caches and reuses attention features similar to [PAB](#pyramid-attention-broadcast) since output differences are small for each successive timestep. + +This method may also choose to skip the unconditional branch prediction, when using classifier-free guidance for sampling (common in most base models), and estimate it from the conditional branch prediction if there is significant redundancy in the predicted latent outputs between successive timesteps. + +Set up and pass a [`FasterCacheConfig`] to a pipeline's transformer to enable it. + +```python +import torch +from diffusers import CogVideoXPipeline, FasterCacheConfig + +pipe line= CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) +pipeline.to("cuda") + +config = FasterCacheConfig( + spatial_attention_block_skip_range=2, + spatial_attention_timestep_skip_range=(-1, 681), + current_timestep_callback=lambda: pipe.current_timestep, + attention_weight_callback=lambda _: 0.3, + unconditional_batch_skip_range=5, + unconditional_batch_timestep_skip_range=(-1, 781), + tensor_format="BFCHW", +) +pipeline.transformer.enable_cache(config) +``` \ No newline at end of file diff --git a/docs/source/en/optimization/cache_dit.md b/docs/source/en/optimization/cache_dit.md new file mode 100644 index 000000000000..126142321249 --- /dev/null +++ b/docs/source/en/optimization/cache_dit.md @@ -0,0 +1,270 @@ +## CacheDiT + +CacheDiT is a unified, flexible, and training-free cache acceleration framework designed to support nearly all Diffusers' DiT-based pipelines. It provides a unified cache API that supports automatic block adapter, DBCache, and more. + +To learn more, refer to the [CacheDiT](https://github.com/vipshop/cache-dit) repository. + +Install a stable release of CacheDiT from PyPI or you can install the latest version from GitHub. + + + + +```bash +pip3 install -U cache-dit +``` + + + + +```bash +pip3 install git+https://github.com/vipshop/cache-dit.git +``` + + + + +Run the command below to view supported DiT pipelines. + +```python +>>> import cache_dit +>>> cache_dit.supported_pipelines() +(30, ['Flux*', 'Mochi*', 'CogVideoX*', 'Wan*', 'HunyuanVideo*', 'QwenImage*', 'LTX*', 'Allegro*', +'CogView3Plus*', 'CogView4*', 'Cosmos*', 'EasyAnimate*', 'SkyReelsV2*', 'StableDiffusion3*', +'ConsisID*', 'DiT*', 'Amused*', 'Bria*', 'Lumina*', 'OmniGen*', 'PixArt*', 'Sana*', 'StableAudio*', +'VisualCloze*', 'AuraFlow*', 'Chroma*', 'ShapE*', 'HiDream*', 'HunyuanDiT*', 'HunyuanDiTPAG*']) +``` + +For a complete benchmark, please refer to [Benchmarks](https://github.com/vipshop/cache-dit/blob/main/bench/). + + +## Unified Cache API + +CacheDiT works by matching specific input/output patterns as shown below. + +![](https://github.com/vipshop/cache-dit/raw/main/assets/patterns-v1.png) + +Call the `enable_cache()` function on a pipeline to enable cache acceleration. This function is the entry point to many of CacheDiT's features. + +```python +import cache_dit +from diffusers import DiffusionPipeline + +# Can be any diffusion pipeline +pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image") + +# One-line code with default cache options. +cache_dit.enable_cache(pipe) + +# Just call the pipe as normal. +output = pipe(...) + +# Disable cache and run original pipe. +cache_dit.disable_cache(pipe) +``` + +## Automatic Block Adapter + +For custom or modified pipelines or transformers not included in Diffusers, use the `BlockAdapter` in `auto` mode or via manual configuration. Please check the [BlockAdapter](https://github.com/vipshop/cache-dit/blob/main/docs/User_Guide.md#automatic-block-adapter) docs for more details. Refer to [Qwen-Image w/ BlockAdapter](https://github.com/vipshop/cache-dit/blob/main/examples/adapter/run_qwen_image_adapter.py) as an example. + + +```python +from cache_dit import ForwardPattern, BlockAdapter + +# Use 🔥BlockAdapter with `auto` mode. +cache_dit.enable_cache( + BlockAdapter( + # Any DiffusionPipeline, Qwen-Image, etc. + pipe=pipe, auto=True, + # Check `📚Forward Pattern Matching` documentation and hack the code of + # of Qwen-Image, you will find that it has satisfied `FORWARD_PATTERN_1`. + forward_pattern=ForwardPattern.Pattern_1, + ), +) + +# Or, manually setup transformer configurations. +cache_dit.enable_cache( + BlockAdapter( + pipe=pipe, # Qwen-Image, etc. + transformer=pipe.transformer, + blocks=pipe.transformer.transformer_blocks, + forward_pattern=ForwardPattern.Pattern_1, + ), +) +``` + +Sometimes, a Transformer class will contain more than one transformer `blocks`. For example, FLUX.1 (HiDream, Chroma, etc) contains `transformer_blocks` and `single_transformer_blocks` (with different forward patterns). The BlockAdapter is able to detect this hybrid pattern type as well. +Refer to [FLUX.1](https://github.com/vipshop/cache-dit/blob/main/examples/adapter/run_flux_adapter.py) as an example. + +```python +# For diffusers <= 0.34.0, FLUX.1 transformer_blocks and +# single_transformer_blocks have different forward patterns. +cache_dit.enable_cache( + BlockAdapter( + pipe=pipe, # FLUX.1, etc. + transformer=pipe.transformer, + blocks=[ + pipe.transformer.transformer_blocks, + pipe.transformer.single_transformer_blocks, + ], + forward_pattern=[ + ForwardPattern.Pattern_1, + ForwardPattern.Pattern_3, + ], + ), +) +``` + +This also works if there is more than one transformer (namely `transformer` and `transformer_2`) in its structure. Refer to [Wan 2.2 MoE](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline/run_wan_2.2.py) as an example. + +## Patch Functor + +For any pattern not included in CacheDiT, use the Patch Functor to convert the pattern into a known pattern. You need to subclass the Patch Functor and may also need to fuse the operations within the blocks for loop into block `forward`. After implementing a Patch Functor, set the `patch_functor` property in `BlockAdapter`. + +![](https://github.com/vipshop/cache-dit/raw/main/assets/patch-functor.png) + +Some Patch Functors are already provided in CacheDiT, [HiDreamPatchFunctor](https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/cache_factory/patch_functors/functor_hidream.py), [ChromaPatchFunctor](https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/cache_factory/patch_functors/functor_chroma.py), etc. + +```python +@BlockAdapterRegistry.register("HiDream") +def hidream_adapter(pipe, **kwargs) -> BlockAdapter: + from diffusers import HiDreamImageTransformer2DModel + from cache_dit.cache_factory.patch_functors import HiDreamPatchFunctor + + assert isinstance(pipe.transformer, HiDreamImageTransformer2DModel) + return BlockAdapter( + pipe=pipe, + transformer=pipe.transformer, + blocks=[ + pipe.transformer.double_stream_blocks, + pipe.transformer.single_stream_blocks, + ], + forward_pattern=[ + ForwardPattern.Pattern_0, + ForwardPattern.Pattern_3, + ], + # NOTE: Setup your custom patch functor here. + patch_functor=HiDreamPatchFunctor(), + **kwargs, + ) +``` + +Finally, you can call the `cache_dit.summary()` function on a pipeline after its completed inference to get the cache acceleration details. + +```python +stats = cache_dit.summary(pipe) +``` + +```python +⚡️Cache Steps and Residual Diffs Statistics: QwenImagePipeline + +| Cache Steps | Diffs Min | Diffs P25 | Diffs P50 | Diffs P75 | Diffs P95 | Diffs Max | +|-------------|-----------|-----------|-----------|-----------|-----------|-----------| +| 23 | 0.045 | 0.084 | 0.114 | 0.147 | 0.241 | 0.297 | +``` + +## DBCache: Dual Block Cache + +![](https://github.com/vipshop/cache-dit/raw/main/assets/dbcache-v1.png) + +DBCache (Dual Block Caching) supports different configurations of compute blocks (F8B12, etc.) to enable a balanced trade-off between performance and precision. +- Fn_compute_blocks: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks. +- Bn_compute_blocks: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache. + + +```python +import cache_dit +from diffusers import FluxPipeline + +pipe_or_adapter = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16, +).to("cuda") + +# Default options, F8B0, 8 warmup steps, and unlimited cached +# steps for good balance between performance and precision +cache_dit.enable_cache(pipe_or_adapter) + +# Custom options, F8B8, higher precision +from cache_dit import BasicCacheConfig + +cache_dit.enable_cache( + pipe_or_adapter, + cache_config=BasicCacheConfig( + max_warmup_steps=8, # steps do not cache + max_cached_steps=-1, # -1 means no limit + Fn_compute_blocks=8, # Fn, F8, etc. + Bn_compute_blocks=8, # Bn, B8, etc. + residual_diff_threshold=0.12, + ), +) +``` +Check the [DBCache](https://github.com/vipshop/cache-dit/blob/main/docs/DBCache.md) and [User Guide](https://github.com/vipshop/cache-dit/blob/main/docs/User_Guide.md#dbcache) docs for more design details. + +## TaylorSeer Calibrator + +The [TaylorSeers](https://huggingface.co/papers/2503.06923) algorithm further improves the precision of DBCache in cases where the cached steps are large (Hybrid TaylorSeer + DBCache). At timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, significantly harming the generation quality. + +TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. The TaylorSeer implemented in CacheDiT supports both hidden states and residual cache types. F_pred can be a residual cache or a hidden-state cache. + +```python +from cache_dit import BasicCacheConfig, TaylorSeerCalibratorConfig + +cache_dit.enable_cache( + pipe_or_adapter, + # Basic DBCache w/ FnBn configurations + cache_config=BasicCacheConfig( + max_warmup_steps=8, # steps do not cache + max_cached_steps=-1, # -1 means no limit + Fn_compute_blocks=8, # Fn, F8, etc. + Bn_compute_blocks=8, # Bn, B8, etc. + residual_diff_threshold=0.12, + ), + # Then, you can use the TaylorSeer Calibrator to approximate + # the values in cached steps, taylorseer_order default is 1. + calibrator_config=TaylorSeerCalibratorConfig( + taylorseer_order=1, + ), +) +``` + +> [!TIP] +> The `Bn_compute_blocks` parameter of DBCache can be set to `0` if you use TaylorSeer as the calibrator for approximate hidden states. DBCache's `Bn_compute_blocks` also acts as a calibrator, so you can choose either `Bn_compute_blocks` > 0 or TaylorSeer. We recommend using the configuration scheme of TaylorSeer + DBCache FnB0. + +## Hybrid Cache CFG + +CacheDiT supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG in the forward step, please set `enable_separate_cfg` parameter to `False (default, None)`. Otherwise, set it to `True`. + +```python +from cache_dit import BasicCacheConfig + +cache_dit.enable_cache( + pipe_or_adapter, + cache_config=BasicCacheConfig( + ..., + # For example, set it as True for Wan 2.1, Qwen-Image + # and set it as False for FLUX.1, HunyuanVideo, etc. + enable_separate_cfg=True, + ), +) +``` + +## torch.compile + +CacheDiT is designed to work with torch.compile for even better performance. Call `torch.compile` after enabling the cache. + + +```python +cache_dit.enable_cache(pipe) + +# Compile the Transformer module +pipe.transformer = torch.compile(pipe.transformer) +``` + +If you're using CacheDiT with dynamic input shapes, consider increasing the `recompile_limit` of `torch._dynamo`. Otherwise, the `recompile_limit` error may be triggered, causing the module to fall back to eager mode. + +```python +torch._dynamo.config.recompile_limit = 96 # default is 8 +torch._dynamo.config.accumulated_recompile_limit = 2048 # default is 256 +``` + +Please check [perf.py](https://github.com/vipshop/cache-dit/blob/main/bench/perf.py) for more details. diff --git a/docs/source/en/optimization/coreml.md b/docs/source/en/optimization/coreml.md index ee6af9d87c64..71da1e3dc1fe 100644 --- a/docs/source/en/optimization/coreml.md +++ b/docs/source/en/optimization/coreml.md @@ -1,4 +1,4 @@ - -# Speed up inference +# Accelerate inference -There are several ways to optimize 🤗 Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention. +Diffusion models are slow at inference because generation is an iterative process where noise is gradually refined into an image or video over a certain number of "steps". To speedup this process, you can try experimenting with different [schedulers](../api/schedulers/overview), reduce the precision of the model weights for faster computations, use more memory-efficient attention mechanisms, and more. - +Combine and use these techniques together to make inference faster than using any single technique on its own. -In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide. +This guide will go over how to accelerate inference. - +## Model data type -The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect. +The precision and data type of the model weights affect inference speed because a higher precision requires more memory to load and more time to perform the computations. PyTorch loads model weights in float32 or full precision by default, so changing the data type is a simple way to quickly get faster inference. -| | latency | speed-up | -| ---------------- | ------- | ------- | -| original | 9.50s | x1 | -| fp16 | 3.61s | x2.63 | -| channels last | 3.30s | x2.88 | -| traced UNet | 3.21s | x2.96 | -| memory efficient attention | 2.63s | x3.61 | + + -## Use TensorFloat-32 +bfloat16 is similar to float16 but it is more robust to numerical errors. Hardware support for bfloat16 varies, but most modern GPUs are capable of supporting bfloat16. -On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy. +```py +import torch +from diffusers import StableDiffusionXLPipeline + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +pipeline(prompt, num_inference_steps=30).images[0] +``` + + + + +float16 is similar to bfloat16 but may be more prone to numerical errors. + +```py +import torch +from diffusers import StableDiffusionXLPipeline + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +pipeline(prompt, num_inference_steps=30).images[0] +``` + + + -```python +[TensorFloat-32 (tf32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode is supported on NVIDIA Ampere GPUs and it computes the convolution and matrix multiplication operations in tf32. Storage and other operations are kept in float32. This enables significantly faster computations when combined with bfloat16 or float16. + +PyTorch only enables tf32 mode for convolutions by default and you'll need to explicitly enable it for matrix multiplications. + +```py import torch +from diffusers import StableDiffusionXLPipeline torch.backends.cuda.matmul.allow_tf32 = True + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +pipeline(prompt, num_inference_steps=30).images[0] ``` -You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide. +Refer to the [mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#mixed-precision) docs for more details. + + + + +## Scaled dot product attention -## Half-precision weights +> [!TIP] +> Memory-efficient attention optimizes for inference speed *and* [memory usage](./memory#memory-efficient-attention)! -To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16: +[Scaled dot product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) implements several attention backends, [FlashAttention](https://github.com/Dao-AILab/flash-attention), [xFormers](https://github.com/facebookresearch/xformers), and a native C++ implementation. It automatically selects the most optimal backend for your hardware. -```Python +SDPA is enabled by default if you're using PyTorch >= 2.0 and no additional changes are required to your code. You could try experimenting with other attention backends though if you'd like to choose your own. The example below uses the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable efficient attention. + +```py +from torch.nn.attention import SDPBackend, sdpa_kernel import torch -from diffusers import DiffusionPipeline +from diffusers import StableDiffusionXLPipeline -pipe = DiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): + image = pipeline(prompt, num_inference_steps=30).images[0] +``` + +## torch.compile + +[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) accelerates inference by compiling PyTorch code and operations into optimized kernels. Diffusers typically compiles the more compute-intensive models like the UNet, transformer, or VAE. + +Enable the following compiler settings for maximum speed (refer to the [full list](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py) for more options). + +```py +import torch +from diffusers import StableDiffusionXLPipeline + +torch._inductor.config.conv_1x1_as_mm = True +torch._inductor.config.coordinate_descent_tuning = True +torch._inductor.config.epilogue_fusion = False +torch._inductor.config.coordinate_descent_check_all_directions = True +``` + +Load and compile the UNet and VAE. There are several different modes you can choose from, but `"max-autotune"` optimizes for the fastest speed by compiling to a CUDA graph. CUDA graphs effectively reduces the overhead by launching multiple GPU operations through a single CPU operation. + +> [!TIP] +> With PyTorch 2.3.1, you can control the caching behavior of torch.compile. This is particularly beneficial for compilation modes like `"max-autotune"` which performs a grid-search over several compilation flags to find the optimal configuration. Learn more in the [Compile Time Caching in torch.compile](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) tutorial. + +Changing the memory layout to [channels_last](./memory#torchchannels_last) also optimizes memory and inference speed. + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 +).to("cuda") +pipeline.unet.to(memory_format=torch.channels_last) +pipeline.vae.to(memory_format=torch.channels_last) +pipeline.unet = torch.compile( + pipeline.unet, mode="max-autotune", fullgraph=True +) +pipeline.vae.decode = torch.compile( + pipeline.vae.decode, + mode="max-autotune", + fullgraph=True ) -pipe = pipe.to("cuda") -prompt = "a photo of an astronaut riding a horse on mars" -image = pipe(prompt).images[0] +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +pipeline(prompt, num_inference_steps=30).images[0] ``` - +Compilation is slow the first time, but once compiled, it is significantly faster. Try to only use the compiled pipeline on the same type of inference operations. Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient. + +### Dynamic shape compilation + +> [!TIP] +> Make sure to always use the nightly version of PyTorch for better support. + +`torch.compile` keeps track of input shapes and conditions, and if these are different, it recompiles the model. For example, if a model is compiled on a 1024x1024 resolution image and used on an image with a different resolution, it triggers recompilation. + +To avoid recompilation, add `dynamic=True` to try and generate a more dynamic kernel to avoid recompilation when conditions change. + +```diff ++ torch.fx.experimental._config.use_duck_shape = False ++ pipeline.unet = torch.compile( + pipeline.unet, fullgraph=True, dynamic=True +) +``` + +Specifying `use_duck_shape=False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out this [comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790). + +Not all models may benefit from dynamic compilation out of the box and may require changes. Refer to this [PR](https://github.com/huggingface/diffusers/pull/11297/) that improved the [`AuraFlowPipeline`] implementation to benefit from dynamic compilation. + +Feel free to open an issue if dynamic compilation doesn't work as expected for a Diffusers model. + +### Regional compilation + +[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by only compiling the *small and frequently-repeated block(s)* of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. +For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. -Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision. +Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `torch.compile`, on any component such as the transformer model as shown below. - +```py +# pip install -U diffusers +import torch +from diffusers import StableDiffusionXLPipeline + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, +).to("cuda") + +# compile only the repeated transformer layers inside the UNet +pipeline.unet.compile_repeated_blocks(fullgraph=True) +``` + +To enable regional compilation for a new model, add a `_repeated_blocks` attribute to a model class containing the class names (as strings) of the blocks you want to compile. + +```py +class MyUNet(ModelMixin): + _repeated_blocks = ("Transformer2DModel",) # ← compiled by default +``` + +> [!TIP] +> For more regional compilation examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). + +There is also a [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method in [Accelerate](https://huggingface.co/docs/accelerate/index) that automatically selects candidate blocks in a model to compile. The remaining graph is compiled separately. This is useful for quick experiments because there aren't as many options for you to set which blocks to compile or adjust compilation flags. + +```py +# pip install -U accelerate +import torch +from diffusers import StableDiffusionXLPipeline +from accelerate.utils import compile_regions + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 +).to("cuda") +pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True) +``` + +[`~ModelMixin.compile_repeated_blocks`] is intentionally explicit. List the blocks to repeat in `_repeated_blocks` and the helper only compiles those blocks. It offers predictable behavior and easy reasoning about cache reuse in one line of code. + +### Graph breaks + +It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables. + +```diff +- latents = unet( +- latents, timestep=timestep, encoder_hidden_states=prompt_embeds +-).sample + ++ latents = unet( ++ latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False ++)[0] +``` + +### GPU sync + +The `step()` function is [called](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) on the scheduler each time after the denoiser makes a prediction, and the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476). When placed on the GPU, it introduces latency because of the communication sync between the CPU and GPU. It becomes more evident when the denoiser has already been compiled. + +In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240) to avoid the communication sync and latency. + +> [!TIP] +> Refer to the [torch.compile and Diffusers: A Hands-On Guide to Peak Performance](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) blog post for maximizing performance with `torch.compile` for diffusion models. + +### Benchmarks + +Refer to the [diffusers/benchmarks](https://huggingface.co/datasets/diffusers/benchmarks) dataset to see inference latency and memory usage data for compiled pipelines. + +The [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao#benchmarking-results) repository also contains benchmarking results for compiled versions of Flux and CogVideoX. + +## Dynamic quantization + +[Dynamic quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) improves inference speed by reducing precision to enable faster math operations. This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor. As a result, the scaling factor is more accurately aligned with the data. + +The example below applies [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE with the [torchao](../quantization/torchao) library. + +> [!TIP] +> Refer to our [torchao](../quantization/torchao) docs to learn more about how to use the Diffusers torchao integration. + +Configure the compiler tags for maximum speed. + +```py +import torch +from torchao import apply_dynamic_quant +from diffusers import StableDiffusionXLPipeline + +torch._inductor.config.conv_1x1_as_mm = True +torch._inductor.config.coordinate_descent_tuning = True +torch._inductor.config.epilogue_fusion = False +torch._inductor.config.coordinate_descent_check_all_directions = True +torch._inductor.config.force_fuse_int_mm_with_mul = True +torch._inductor.config.use_mixed_mm = True +``` + +Filter out some linear layers in the UNet and VAE which don't benefit from dynamic quantization with the [dynamic_quant_filter_fn](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16). + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 +).to("cuda") + +apply_dynamic_quant(pipeline.unet, dynamic_quant_filter_fn) +apply_dynamic_quant(pipeline.vae, dynamic_quant_filter_fn) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +pipeline(prompt, num_inference_steps=30).images[0] +``` + +## Fused projection matrices + +> [!WARNING] +> The [fuse_qkv_projections](https://github.com/huggingface/diffusers/blob/58431f102cf39c3c8a569f32d71b2ea8caa461e1/src/diffusers/pipelines/pipeline_utils.py#L2034) method is experimental and support is limited to mostly Stable Diffusion pipelines. Take a look at this [PR](https://github.com/huggingface/diffusers/pull/6179) to learn more about how to enable it for other pipelines + +An input is projected into three subspaces, represented by the projection matrices Q, K, and V, in an attention block. These projections are typically calculated separately, but you can horizontally combine these into a single matrix and perform the projection in a single step. It increases the size of the matrix multiplications of the input projections and also improves the impact of quantization. + +```py +pipeline.fuse_qkv_projections() +``` -## Distilled model +## Resources -You could also use a distilled Stable Diffusion model and autoencoder to speed up inference. During distillation, many of the UNet's residual and attention blocks are shed to reduce the model size. The distilled model is faster and uses less memory while generating images of comparable quality to the full Stable Diffusion model. +- Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup using recipes from [flux-fast](https://github.com/huggingface/flux-fast). -Learn more about in the [Distilled Stable Diffusion inference](../using-diffusers/distilled_sd) guide! + These recipes support AMD hardware and [Flux.1 Kontext Dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev). +- Read the [torch.compile and Diffusers: A Hands-On Guide to Peak Performance](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) blog post +to maximize performance when using `torch.compile`. \ No newline at end of file diff --git a/docs/source/en/optimization/habana.md b/docs/source/en/optimization/habana.md index a1123d980361..1e5563ae101e 100644 --- a/docs/source/en/optimization/habana.md +++ b/docs/source/en/optimization/habana.md @@ -1,4 +1,4 @@ - -# Habana Gaudi +# Intel Gaudi -🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana: +The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai/products/gaudi/), [Intel Gaudi 2](https://habana.ai/products/gaudi2/), and [Intel Gaudi 3](https://habana.ai/products/gaudi3/). Each server is equipped with 8 devices, known as Habana Processing Units (HPUs), providing 128GB of memory on Gaudi 3, 96GB on Gaudi 2, and 32GB on the first-gen Gaudi. For more details on the underlying hardware architecture, check out the [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html) overview. -```bash -python -m pip install --upgrade-strategy eager optimum[habana] -``` - -To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances: - -- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation. -- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler. - -When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`. +Diffusers pipelines can take advantage of HPU acceleration, even if a pipeline hasn't been added to [Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/main/en/habana/index) yet, with the [GPU Migration Toolkit](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html). -Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub. - -```python -from optimum.habana import GaudiConfig -from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline - -model_name = "stabilityai/stable-diffusion-2-base" -scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler") -pipeline = GaudiStableDiffusionPipeline.from_pretrained( - model_name, - scheduler=scheduler, - use_habana=True, - use_hpu_graphs=True, - gaudi_config="Habana/stable-diffusion-2", -) -``` +Call `.to("hpu")` on your pipeline to move it to a HPU device as shown below for Flux: +```py +import torch +from diffusers import DiffusionPipeline -Now you can call the pipeline to generate images by batches from one or several prompts: +pipeline = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) +pipeline.to("hpu") -```python -outputs = pipeline( - prompt=[ - "High quality photo of an astronaut riding a horse in space", - "Face of a yellow cat, high resolution, sitting on a park bench", - ], - num_images_per_prompt=10, - batch_size=4, -) +image = pipeline("An image of a squirrel in Picasso style").images[0] ``` -For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official GitHub repository. - -## Benchmark - -We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance. - -For [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) on 512x512 images: - -| | Latency (batch size = 1) | Throughput | -| ---------------------- |:------------------------:|:---------------------------:| -| first-generation Gaudi | 3.80s | 0.308 images/s (batch size = 8) | -| Gaudi2 | 1.33s | 1.081 images/s (batch size = 8) | - -For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images: - -| | Latency (batch size = 1) | Throughput | -| ---------------------- |:------------------------:|:-------------------------------:| -| first-generation Gaudi | 10.2s | 0.108 images/s (batch size = 4) | -| Gaudi2 | 3.17s | 0.379 images/s (batch size = 8) | +> [!TIP] +> For Gaudi-optimized diffusion pipeline implementations, we recommend using [Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/main/en/habana/index). diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md index 6b2a22b315c8..611e07ec7655 100644 --- a/docs/source/en/optimization/memory.md +++ b/docs/source/en/optimization/memory.md @@ -1,4 +1,4 @@ - + +# AWS Neuron + +Diffusers functionalities are available on [AWS Inf2 instances](https://aws.amazon.com/ec2/instance-types/inf2/), which are EC2 instances powered by [Neuron machine learning accelerators](https://aws.amazon.com/machine-learning/inferentia/). These instances aim to provide better compute performance (higher throughput, lower latency) with good cost-efficiency, making them good candidates for AWS users to deploy diffusion models to production. + +[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index) is the interface between Hugging Face libraries and AWS Accelerators, including AWS [Trainium](https://aws.amazon.com/machine-learning/trainium/) and AWS [Inferentia](https://aws.amazon.com/machine-learning/inferentia/). It supports many of the features in Diffusers with similar APIs, so it is easier to learn if you're already familiar with Diffusers. Once you have created an AWS Inf2 instance, install Optimum Neuron. + +```bash +python -m pip install --upgrade-strategy eager optimum[neuronx] +``` + +> [!TIP] +> We provide pre-built [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI) and Optimum Neuron containers for Amazon SageMaker. It's recommended to correctly set up your environment. + +The example below demonstrates how to generate images with the Stable Diffusion XL model on an inf2.8xlarge instance (you can switch to cheaper inf2.xlarge instances once the model is compiled). To generate some images, use the [`~optimum.neuron.NeuronStableDiffusionXLPipeline`] class, which is similar to the [`StableDiffusionXLPipeline`] class in Diffusers. + +Unlike Diffusers, you need to compile models in the pipeline to the Neuron format, `.neuron`. Launch the following command to export the model to the `.neuron` format. + +```bash +optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \ + --batch_size 1 \ + --height 1024 `# height in pixels of generated image, eg. 768, 1024` \ + --width 1024 `# width in pixels of generated image, eg. 768, 1024` \ + --num_images_per_prompt 1 `# number of images to generate per prompt, defaults to 1` \ + --auto_cast matmul `# cast only matrix multiplication operations` \ + --auto_cast_type bf16 `# cast operations from FP32 to BF16` \ + sd_neuron_xl/ +``` + +Now generate some images with the pre-compiled SDXL model. + +```python +>>> from optimum.neuron import NeuronStableDiffusionXLPipeline + +>>> stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl/") +>>> prompt = "a pig with wings flying in floating US dollar banknotes in the air, skyscrapers behind, warm color palette, muted colors, detailed, 8k" +>>> image = stable_diffusion_xl(prompt).images[0] +``` + +peggy generated by sdxl on inf2 + +Feel free to check out more guides and examples on different use cases from the Optimum Neuron [documentation](https://huggingface.co/docs/optimum-neuron/en/inference_tutorials/stable_diffusion#generate-images-with-stable-diffusion-models-on-aws-inferentia)! diff --git a/docs/source/en/optimization/onnx.md b/docs/source/en/optimization/onnx.md index 486f450389b1..620f2af994b3 100644 --- a/docs/source/en/optimization/onnx.md +++ b/docs/source/en/optimization/onnx.md @@ -1,4 +1,4 @@ - - -# Overview - -Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🤗 Diffuser's goals is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware. - -This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors. diff --git a/docs/source/en/optimization/para_attn.md b/docs/source/en/optimization/para_attn.md new file mode 100644 index 000000000000..94b0d5ce3af4 --- /dev/null +++ b/docs/source/en/optimization/para_attn.md @@ -0,0 +1,497 @@ +# ParaAttention + +
+ +
+
+ +
+ + +Large image and video generation models, such as [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) and [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), can be an inference challenge for real-time applications and deployment because of their size. + +[ParaAttention](https://github.com/chengzeyi/ParaAttention) is a library that implements **context parallelism** and **first block cache**, and can be combined with other techniques (torch.compile, fp8 dynamic quantization), to accelerate inference. + +This guide will show you how to apply ParaAttention to FLUX.1-dev and HunyuanVideo on NVIDIA L20 GPUs. +No optimizations are applied for our baseline benchmark, except for HunyuanVideo to avoid out-of-memory errors. + +Our baseline benchmark shows that FLUX.1-dev is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds, and HunyuanVideo is able to generate 129 frames at 720p resolution in 30 steps in 3675.71 seconds. + +> [!TIP] +> For even faster inference with context parallelism, try using NVIDIA A100 or H100 GPUs (if available) with NVLink support, especially when there is a large number of GPUs. + +## First Block Cache + +Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster. + +However, it is hard to decide when to reuse the cache to ensure quality generated images or videos. ParaAttention directly uses the **residual difference of the first transformer block output** to approximate the difference among model outputs. When the difference is small enough, the residual difference of previous inference steps is reused. In other words, the denoising step is skipped. + +This achieves a 2x speedup on FLUX.1-dev and HunyuanVideo inference with very good quality. + +
+ Cache in Diffusion Transformer +
How AdaCache works, First Block Cache is a variant of it
+
+ + + + +To apply first block cache on FLUX.1-dev, call `apply_cache_on_pipe` as shown below. 0.08 is the default residual difference value for FLUX models. + +```python +import time +import torch +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16, +).to("cuda") + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe(pipe, residual_diff_threshold=0.08) + +# Enable memory savings +# pipe.enable_model_cpu_offload() +# pipe.enable_sequential_cpu_offload() + +begin = time.time() +image = pipe( + "A cat holding a sign that says hello world", + num_inference_steps=28, +).images[0] +end = time.time() +print(f"Time: {end - begin:.2f}s") + +print("Saving image to flux.png") +image.save("flux.png") +``` + +| Optimizations | Original | FBCache rdt=0.06 | FBCache rdt=0.08 | FBCache rdt=0.10 | FBCache rdt=0.12 | +| - | - | - | - | - | - | +| Preview | ![Original](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-original.png) | ![FBCache rdt=0.06](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.06.png) | ![FBCache rdt=0.08](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.08.png) | ![FBCache rdt=0.10](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.10.png) | ![FBCache rdt=0.12](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.12.png) | +| Wall Time (s) | 26.36 | 21.83 | 17.01 | 16.00 | 13.78 | + +First Block Cache reduced the inference speed to 17.01 seconds compared to the baseline, or 1.55x faster, while maintaining nearly zero quality loss. + + + + +To apply First Block Cache on HunyuanVideo, `apply_cache_on_pipe` as shown below. 0.06 is the default residual difference value for HunyuanVideo models. + +```python +import time +import torch +from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel +from diffusers.utils import export_to_video + +model_id = "tencent/HunyuanVideo" +transformer = HunyuanVideoTransformer3DModel.from_pretrained( + model_id, + subfolder="transformer", + torch_dtype=torch.bfloat16, + revision="refs/pr/18", +) +pipe = HunyuanVideoPipeline.from_pretrained( + model_id, + transformer=transformer, + torch_dtype=torch.float16, + revision="refs/pr/18", +).to("cuda") + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe(pipe, residual_diff_threshold=0.6) + +pipe.vae.enable_tiling() + +begin = time.time() +output = pipe( + prompt="A cat walks on the grass, realistic", + height=720, + width=1280, + num_frames=129, + num_inference_steps=30, +).frames[0] +end = time.time() +print(f"Time: {end - begin:.2f}s") + +print("Saving video to hunyuan_video.mp4") +export_to_video(output, "hunyuan_video.mp4", fps=15) +``` + + + + HunyuanVideo without FBCache + + + + HunyuanVideo with FBCache + +First Block Cache reduced the inference speed to 2271.06 seconds compared to the baseline, or 1.62x faster, while maintaining nearly zero quality loss. + + + + +## fp8 quantization + +fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/). + +Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model. + +The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy. + +Install [torchao](https://github.com/pytorch/ao/tree/main) with the command below. + +```bash +pip3 install -U torch torchao +``` + +[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance. Compilation can take a long time if it's the first time the model is called, but it is worth it once the model has been compiled. + +This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more. + +> [!TIP] +> Dynamic quantization can significantly change the distribution of the model output, so you need to change the `residual_diff_threshold` to a larger value for it to take effect. + + + + +```python +import time +import torch +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16, +).to("cuda") + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe( + pipe, + residual_diff_threshold=0.12, # Use a larger value to make the cache take effect +) + +from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only + +quantize_(pipe.text_encoder, float8_weight_only()) +quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) +pipe.transformer = torch.compile( + pipe.transformer, mode="max-autotune-no-cudagraphs", +) + +# Enable memory savings +# pipe.enable_model_cpu_offload() +# pipe.enable_sequential_cpu_offload() + +for i in range(2): + begin = time.time() + image = pipe( + "A cat holding a sign that says hello world", + num_inference_steps=28, + ).images[0] + end = time.time() + if i == 0: + print(f"Warm up time: {end - begin:.2f}s") + else: + print(f"Time: {end - begin:.2f}s") + +print("Saving image to flux.png") +image.save("flux.png") +``` + +fp8 dynamic quantization and torch.compile reduced the inference speed to 7.56 seconds compared to the baseline, or 3.48x faster. + + + + +```python +import time +import torch +from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel +from diffusers.utils import export_to_video + +model_id = "tencent/HunyuanVideo" +transformer = HunyuanVideoTransformer3DModel.from_pretrained( + model_id, + subfolder="transformer", + torch_dtype=torch.bfloat16, + revision="refs/pr/18", +) +pipe = HunyuanVideoPipeline.from_pretrained( + model_id, + transformer=transformer, + torch_dtype=torch.float16, + revision="refs/pr/18", +).to("cuda") + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe(pipe) + +from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only + +quantize_(pipe.text_encoder, float8_weight_only()) +quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) +pipe.transformer = torch.compile( + pipe.transformer, mode="max-autotune-no-cudagraphs", +) + +# Enable memory savings +pipe.vae.enable_tiling() +# pipe.enable_model_cpu_offload() +# pipe.enable_sequential_cpu_offload() + +for i in range(2): + begin = time.time() + output = pipe( + prompt="A cat walks on the grass, realistic", + height=720, + width=1280, + num_frames=129, + num_inference_steps=1 if i == 0 else 30, + ).frames[0] + end = time.time() + if i == 0: + print(f"Warm up time: {end - begin:.2f}s") + else: + print(f"Time: {end - begin:.2f}s") + +print("Saving video to hunyuan_video.mp4") +export_to_video(output, "hunyuan_video.mp4", fps=15) +``` + +A NVIDIA L20 GPU only has 48GB memory and could face out-of-memory (OOM) errors after compilation and if `enable_model_cpu_offload` isn't called because HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. For GPUs with less than 80GB of memory, you can try reducing the resolution and number of frames to avoid OOM errors. + +Large video generation models are usually bottlenecked by the attention computations rather than the fully connected layers. These models don't significantly benefit from quantization and torch.compile. + + + + +## Context Parallelism + +Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization. + +> [!TIP] +> Refer to the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main) repository for detailed instructions and examples of how to scale inference with multiple GPUs. + +If the inference process needs to be persistent and serviceable, it is suggested to use [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html) to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model. + + + + +The code sample below combines First Block Cache, fp8 dynamic quantization, torch.compile, and Context Parallelism for the fastest inference speed. + +```python +import time +import torch +import torch.distributed as dist +from diffusers import FluxPipeline + +dist.init_process_group() + +torch.cuda.set_device(dist.get_rank()) + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16, +).to("cuda") + +from para_attn.context_parallel import init_context_parallel_mesh +from para_attn.context_parallel.diffusers_adapters import parallelize_pipe +from para_attn.parallel_vae.diffusers_adapters import parallelize_vae + +mesh = init_context_parallel_mesh( + pipe.device.type, + max_ring_dim_size=2, +) +parallelize_pipe( + pipe, + mesh=mesh, +) +parallelize_vae(pipe.vae, mesh=mesh._flatten()) + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe( + pipe, + residual_diff_threshold=0.12, # Use a larger value to make the cache take effect +) + +from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only + +quantize_(pipe.text_encoder, float8_weight_only()) +quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) +torch._inductor.config.reorder_for_compute_comm_overlap = True +pipe.transformer = torch.compile( + pipe.transformer, mode="max-autotune-no-cudagraphs", +) + +# Enable memory savings +# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) +# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) + +for i in range(2): + begin = time.time() + image = pipe( + "A cat holding a sign that says hello world", + num_inference_steps=28, + output_type="pil" if dist.get_rank() == 0 else "pt", + ).images[0] + end = time.time() + if dist.get_rank() == 0: + if i == 0: + print(f"Warm up time: {end - begin:.2f}s") + else: + print(f"Time: {end - begin:.2f}s") + +if dist.get_rank() == 0: + print("Saving image to flux.png") + image.save("flux.png") + +dist.destroy_process_group() +``` + +Save to `run_flux.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html). + +```bash +# Use --nproc_per_node to specify the number of GPUs +torchrun --nproc_per_node=2 run_flux.py +``` + +Inference speed is reduced to 8.20 seconds compared to the baseline, or 3.21x faster, with 2 NVIDIA L20 GPUs. On 4 L20s, inference speed is 3.90 seconds, or 6.75x faster. + + + + +The code sample below combines First Block Cache and Context Parallelism for the fastest inference speed. + +```python +import time +import torch +import torch.distributed as dist +from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel +from diffusers.utils import export_to_video + +dist.init_process_group() + +torch.cuda.set_device(dist.get_rank()) + +model_id = "tencent/HunyuanVideo" +transformer = HunyuanVideoTransformer3DModel.from_pretrained( + model_id, + subfolder="transformer", + torch_dtype=torch.bfloat16, + revision="refs/pr/18", +) +pipe = HunyuanVideoPipeline.from_pretrained( + model_id, + transformer=transformer, + torch_dtype=torch.float16, + revision="refs/pr/18", +).to("cuda") + +from para_attn.context_parallel import init_context_parallel_mesh +from para_attn.context_parallel.diffusers_adapters import parallelize_pipe +from para_attn.parallel_vae.diffusers_adapters import parallelize_vae + +mesh = init_context_parallel_mesh( + pipe.device.type, +) +parallelize_pipe( + pipe, + mesh=mesh, +) +parallelize_vae(pipe.vae, mesh=mesh._flatten()) + +from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe + +apply_cache_on_pipe(pipe) + +# from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only +# +# torch._inductor.config.reorder_for_compute_comm_overlap = True +# +# quantize_(pipe.text_encoder, float8_weight_only()) +# quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) +# pipe.transformer = torch.compile( +# pipe.transformer, mode="max-autotune-no-cudagraphs", +# ) + +# Enable memory savings +pipe.vae.enable_tiling() +# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) +# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) + +for i in range(2): + begin = time.time() + output = pipe( + prompt="A cat walks on the grass, realistic", + height=720, + width=1280, + num_frames=129, + num_inference_steps=1 if i == 0 else 30, + output_type="pil" if dist.get_rank() == 0 else "pt", + ).frames[0] + end = time.time() + if dist.get_rank() == 0: + if i == 0: + print(f"Warm up time: {end - begin:.2f}s") + else: + print(f"Time: {end - begin:.2f}s") + +if dist.get_rank() == 0: + print("Saving video to hunyuan_video.mp4") + export_to_video(output, "hunyuan_video.mp4", fps=15) + +dist.destroy_process_group() +``` + +Save to `run_hunyuan_video.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html). + +```bash +# Use --nproc_per_node to specify the number of GPUs +torchrun --nproc_per_node=8 run_hunyuan_video.py +``` + +Inference speed is reduced to 649.23 seconds compared to the baseline, or 5.66x faster, with 8 NVIDIA L20 GPUs. + + + + +## Benchmarks + + + + +| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup | +| - | - | - | - | - | +| NVIDIA L20 | 1 | Baseline | 26.36 | 1.00x | +| NVIDIA L20 | 1 | FBCache (rdt=0.08) | 17.01 | 1.55x | +| NVIDIA L20 | 1 | FP8 DQ | 13.40 | 1.96x | +| NVIDIA L20 | 1 | FBCache (rdt=0.12) + FP8 DQ | 7.56 | 3.48x | +| NVIDIA L20 | 2 | FBCache (rdt=0.12) + FP8 DQ + CP | 4.92 | 5.35x | +| NVIDIA L20 | 4 | FBCache (rdt=0.12) + FP8 DQ + CP | 3.90 | 6.75x | + + + + +| GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup | +| - | - | - | - | - | +| NVIDIA L20 | 1 | Baseline | 3675.71 | 1.00x | +| NVIDIA L20 | 1 | FBCache | 2271.06 | 1.62x | +| NVIDIA L20 | 2 | FBCache + CP | 1132.90 | 3.24x | +| NVIDIA L20 | 4 | FBCache + CP | 718.15 | 5.12x | +| NVIDIA L20 | 8 | FBCache + CP | 649.23 | 5.66x | + + + diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md new file mode 100644 index 000000000000..56c1f3af5957 --- /dev/null +++ b/docs/source/en/optimization/pruna.md @@ -0,0 +1,187 @@ +# Pruna + +[Pruna](https://github.com/PrunaAI/pruna) is a model optimization framework that offers various optimization methods - quantization, pruning, caching, compilation - for accelerating inference and reducing memory usage. A general overview of the optimization methods are shown below. + + +| Technique | Description | Speed | Memory | Quality | +|--------------|-----------------------------------------------------------------------------------------------|:-----:|:------:|:-------:| +| `batcher` | Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing processing time. | ✅ | ❌ | ➖ | +| `cacher` | Stores intermediate results of computations to speed up subsequent operations. | ✅ | ➖ | ➖ | +| `compiler` | Optimises the model with instructions for specific hardware. | ✅ | ➖ | ➖ | +| `distiller` | Trains a smaller, simpler model to mimic a larger, more complex model. | ✅ | ✅ | ❌ | +| `quantizer` | Reduces the precision of weights and activations, lowering memory requirements. | ✅ | ✅ | ❌ | +| `pruner` | Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. | ✅ | ✅ | ❌ | +| `recoverer` | Restores the performance of a model after compression. | ➖ | ➖ | ✅ | +| `factorizer` | Factorization batches several small matrix multiplications into one large fused operation. | ✅ | ➖ | ➖ | +| `enhancer` | Enhances the model output by applying post-processing algorithms such as denoising or upscaling. | ❌ | - | ✅ | + +✅ (improves), ➖ (approx. the same), ❌ (worsens) + +Explore the full range of optimization methods in the [Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). + +## Installation + +Install Pruna with the following command. + +```bash +pip install pruna +``` + + +## Optimize Diffusers models + +A broad range of optimization algorithms are supported for Diffusers models as shown below. + +
+ Overview of the supported optimization algorithms for diffusers models +
+ +The example below optimizes [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) +with a combination of factorizer, compiler, and cacher algorithms. This combination accelerates inference by up to 4.2x and cuts peak GPU memory usage from 34.7GB to 28.0GB, all while maintaining virtually the same output quality. + +> [!TIP] +> Refer to the [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html) docs to learn more about the optimization techniques used in this example. + +
+ Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms +
+ +Start by defining a `SmashConfig` with the optimization algorithms to use. To optimize the model, wrap the pipeline and the `SmashConfig` with `smash` and then use the pipeline as normal for inference. + +```python +import torch +from diffusers import FluxPipeline + +from pruna import PrunaModel, SmashConfig, smash + +# load the model +# Try segmind/Segmind-Vega or black-forest-labs/FLUX.1-schnell with a small GPU memory +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16 +).to("cuda") + +# define the configuration +smash_config = SmashConfig() +smash_config["factorizer"] = "qkv_diffusers" +smash_config["compiler"] = "torch_compile" +smash_config["torch_compile_target"] = "module_list" +smash_config["cacher"] = "fora" +smash_config["fora_interval"] = 2 + +# for the best results in terms of speed you can add these configs +# however they will increase your warmup time from 1.5 min to 10 min +# smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs" +# smash_config["quantizer"] = "torchao" +# smash_config["torchao_quant_type"] = "fp8dq" +# smash_config["torchao_excluded_modules"] = "norm+embedding" + +# optimize the model +smashed_pipe = smash(pipe, smash_config) + +# run the model +smashed_pipe("a knitted purple prune").images[0] +``` + +
+ +
+ +After optimization, we can share and load the optimized model using the Hugging Face Hub. + +```python +# save the model +smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") + +# load the model +smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") +``` + +## Evaluate and benchmark Diffusers models + +Pruna provides the [EvaluationAgent](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) to evaluate the quality of your optimized models. + +We can metrics we care about, such as total time and throughput, and the dataset to evaluate on. We can define a model and pass it to the `EvaluationAgent`. + + + + +We can load and evaluate an optimized model by using the `EvaluationAgent` and pass it to the `Task`. + +```python +import torch +from diffusers import FluxPipeline + +from pruna import PrunaModel +from pruna.data.pruna_datamodule import PrunaDataModule +from pruna.evaluation.evaluation_agent import EvaluationAgent +from pruna.evaluation.metrics import ( + ThroughputMetric, + TorchMetricWrapper, + TotalTimeMetric, +) +from pruna.evaluation.task import Task + +# define the device +device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" + +# load the model +# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory +smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") + +# Define the metrics +metrics = [ + TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), + ThroughputMetric(n_iterations=20, n_warmup_iterations=5), + TorchMetricWrapper("clip"), +] + +# Define the datamodule +datamodule = PrunaDataModule.from_string("LAION256") +datamodule.limit_datasets(10) + +# Define the task and evaluation agent +task = Task(metrics, datamodule=datamodule, device=device) +eval_agent = EvaluationAgent(task) + +# Evaluate smashed model and offload it to CPU +smashed_pipe.move_to_device(device) +smashed_pipe_results = eval_agent.evaluate(smashed_pipe) +smashed_pipe.move_to_device("cpu") +``` + + + + +Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper and run the `EvaluationAgent` on it. + +```python +import torch +from diffusers import FluxPipeline + +from pruna import PrunaModel + +# load the model +# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16 +).to("cpu") +wrapped_pipe = PrunaModel(model=pipe) +``` + + + + +Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. + +> [!TIP] +> For more details about benchmarking Flux, check out the [Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) blog post and the [InferBench](https://huggingface.co/spaces/PrunaAI/InferBench) Space. + +## Reference + +- [Pruna](https://github.com/pruna-ai/pruna) +- [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) +- [Pruna evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) +- [Pruna tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) + diff --git a/docs/source/en/optimization/speed-memory-optims.md b/docs/source/en/optimization/speed-memory-optims.md new file mode 100644 index 000000000000..80c6c79a3c83 --- /dev/null +++ b/docs/source/en/optimization/speed-memory-optims.md @@ -0,0 +1,203 @@ + + +# Compiling and offloading quantized models + +Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). + +> [!TIP] +> Check the [torch.compile](./fp16#torchcompile) guide to learn more about compilation and how they can be applied here. For example, regional compilation can significantly reduce compilation time without giving up any speedups. + +For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective for image generation because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU. + +For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound. + +The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux. + +| combination | latency (s) | memory-usage (GB) | +|---|---|---| +| quantization | 32.602 | 14.9453 | +| quantization, torch.compile | 25.847 | 14.9448 | +| quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 | + +These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the benchmarking script if you're interested in evaluating your own model. + +This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes. + +```bash +pip install -U bitsandbytes +``` + +## Quantization and torch.compile + +Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference. + +Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig + +torch._dynamo.config.capture_dynamic_output_shape_ops = True + +# quantize +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder_2"], +) +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +).to("cuda") + +# compile +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.transformer.compile(mode="max-autotune", fullgraph=True) +pipeline(""" + cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California + highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +).images[0] +``` + +## Quantization, torch.compile, and offloading + +In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations. + +Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation and set `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models. + + + + +[Model CPU offloading](./memory#model-offloading) moves an individual pipeline component, like the transformer model, to the GPU when it is needed for computation. Otherwise, it is offloaded to the CPU. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig + +torch._dynamo.config.cache_size_limit = 1000 +torch._dynamo.config.capture_dynamic_output_shape_ops = True + +# quantize +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder_2"], +) +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +).to("cuda") + +# model CPU offloading +pipeline.enable_model_cpu_offload() + +# compile +pipeline.transformer.compile() +pipeline( + "cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain" +).images[0] +``` + + + + +[Group offloading](./memory#group-offloading) moves the internal layers of an individual pipeline component, like the transformer model, to the GPU for computation and offloads it when it's not required. At the same time, it uses the [CUDA stream](./memory#cuda-stream) feature to prefetch the next layer for execution. + +By overlapping computation and data transfer, it is faster than model CPU offloading while also saving memory. + +```py +# pip install ftfy +import torch +from diffusers import AutoModel, DiffusionPipeline +from diffusers.hooks import apply_group_offloading +from diffusers.utils import export_to_video +from diffusers.quantizers import PipelineQuantizationConfig +from transformers import UMT5EncoderModel + +torch._dynamo.config.cache_size_limit = 1000 +torch._dynamo.config.capture_dynamic_output_shape_ops = True + +# quantize +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder"], +) + +text_encoder = UMT5EncoderModel.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16 +) +pipeline = DiffusionPipeline.from_pretrained( + "Wan-AI/Wan2.1-T2V-14B-Diffusers", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +).to("cuda") + +# group offloading +onload_device = torch.device("cuda") +offload_device = torch.device("cpu") + +pipeline.transformer.enable_group_offload( + onload_device=onload_device, + offload_device=offload_device, + offload_type="leaf_level", + use_stream=True, + non_blocking=True +) +pipeline.vae.enable_group_offload( + onload_device=onload_device, + offload_device=offload_device, + offload_type="leaf_level", + use_stream=True, + non_blocking=True +) +apply_group_offloading( + pipeline.text_encoder, + onload_device=onload_device, + offload_type="leaf_level", + use_stream=True, + non_blocking=True +) + +# compile +pipeline.transformer.compile() + +prompt = """ +The camera rushes from far to near in a low-angle shot, +revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in +for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. +Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic +shadows and warm highlights. Medium composition, front view, low angle, with depth of field. +""" +negative_prompt = """ +Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, +low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, +misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards +""" + +output = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + num_frames=81, + guidance_scale=5.0, +).frames[0] +export_to_video(output, "output.mp4", fps=16) +``` + + + \ No newline at end of file diff --git a/docs/source/en/optimization/tgate.md b/docs/source/en/optimization/tgate.md new file mode 100644 index 000000000000..90e0bc32f71b --- /dev/null +++ b/docs/source/en/optimization/tgate.md @@ -0,0 +1,182 @@ +# T-GATE + +[T-GATE](https://github.com/HaozheLiu-ST/T-GATE/tree/main) accelerates inference for [Stable Diffusion](../api/pipelines/stable_diffusion/overview), [PixArt](../api/pipelines/pixart), and [Latency Consistency Model](../api/pipelines/latent_consistency_models.md) pipelines by skipping the cross-attention calculation once it converges. This method doesn't require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like [DeepCache](./deepcache). + +Before you begin, make sure you install T-GATE. + +```bash +pip install tgate +pip install -U torch diffusers transformers accelerate DeepCache +``` + + +To use T-GATE with a pipeline, you need to use its corresponding loader. + +| Pipeline | T-GATE Loader | +|---|---| +| PixArt | TgatePixArtLoader | +| Stable Diffusion XL | TgateSDXLLoader | +| Stable Diffusion XL + DeepCache | TgateSDXLDeepCacheLoader | +| Stable Diffusion | TgateSDLoader | +| Stable Diffusion + DeepCache | TgateSDDeepCacheLoader | + +Next, create a `TgateLoader` with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the `tgate` method on the pipeline with a prompt, gate step, and the number of inference steps. + +Let's see how to enable this for several different pipelines. + + + + +Accelerate `PixArtAlphaPipeline` with T-GATE: + +```py +import torch +from diffusers import PixArtAlphaPipeline +from tgate import TgatePixArtLoader + +pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16) + +gate_step = 8 +inference_step = 25 +pipe = TgatePixArtLoader( + pipe, + gate_step=gate_step, + num_inference_steps=inference_step, +).to("cuda") + +image = pipe.tgate( + "An alpaca made of colorful building blocks, cyberpunk.", + gate_step=gate_step, + num_inference_steps=inference_step, +).images[0] +``` + + + +Accelerate `StableDiffusionXLPipeline` with T-GATE: + +```py +import torch +from diffusers import StableDiffusionXLPipeline +from diffusers import DPMSolverMultistepScheduler +from tgate import TgateSDXLLoader + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + variant="fp16", + use_safetensors=True, +) +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) + +gate_step = 10 +inference_step = 25 +pipe = TgateSDXLLoader( + pipe, + gate_step=gate_step, + num_inference_steps=inference_step, +).to("cuda") + +image = pipe.tgate( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", + gate_step=gate_step, + num_inference_steps=inference_step +).images[0] +``` + + + +Accelerate `StableDiffusionXLPipeline` with [DeepCache](https://github.com/horseee/DeepCache) and T-GATE: + +```py +import torch +from diffusers import StableDiffusionXLPipeline +from diffusers import DPMSolverMultistepScheduler +from tgate import TgateSDXLDeepCacheLoader + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + variant="fp16", + use_safetensors=True, +) +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) + +gate_step = 10 +inference_step = 25 +pipe = TgateSDXLDeepCacheLoader( + pipe, + cache_interval=3, + cache_branch_id=0, +).to("cuda") + +image = pipe.tgate( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", + gate_step=gate_step, + num_inference_steps=inference_step +).images[0] +``` + + + +Accelerate `latent-consistency/lcm-sdxl` with T-GATE: + +```py +import torch +from diffusers import StableDiffusionXLPipeline +from diffusers import UNet2DConditionModel, LCMScheduler +from diffusers import DPMSolverMultistepScheduler +from tgate import TgateSDXLLoader + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + unet=unet, + torch_dtype=torch.float16, + variant="fp16", +) +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +gate_step = 1 +inference_step = 4 +pipe = TgateSDXLLoader( + pipe, + gate_step=gate_step, + num_inference_steps=inference_step, + lcm=True +).to("cuda") + +image = pipe.tgate( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", + gate_step=gate_step, + num_inference_steps=inference_step +).images[0] +``` + + + +T-GATE also supports [`StableDiffusionPipeline`] and [PixArt-alpha/PixArt-LCM-XL-2-1024-MS](https://hf.co/PixArt-alpha/PixArt-LCM-XL-2-1024-MS). + +## Benchmarks +| Model | MACs | Param | Latency | Zero-shot 10K-FID on MS-COCO | +|-----------------------|----------|-----------|---------|---------------------------| +| SD-1.5 | 16.938T | 859.520M | 7.032s | 23.927 | +| SD-1.5 w/ T-GATE | 9.875T | 815.557M | 4.313s | 20.789 | +| SD-2.1 | 38.041T | 865.785M | 16.121s | 22.609 | +| SD-2.1 w/ T-GATE | 22.208T | 815.433 M | 9.878s | 19.940 | +| SD-XL | 149.438T | 2.570B | 53.187s | 24.628 | +| SD-XL w/ T-GATE | 84.438T | 2.024B | 27.932s | 22.738 | +| Pixart-Alpha | 107.031T | 611.350M | 61.502s | 38.669 | +| Pixart-Alpha w/ T-GATE | 65.318T | 462.585M | 37.867s | 35.825 | +| DeepCache (SD-XL) | 57.888T | - | 19.931s | 23.755 | +| DeepCache w/ T-GATE | 43.868T | - | 14.666s | 23.999 | +| LCM (SD-XL) | 11.955T | 2.570B | 3.805s | 25.044 | +| LCM w/ T-GATE | 11.171T | 2.024B | 3.533s | 25.028 | +| LCM (Pixart-Alpha) | 8.563T | 611.350M | 4.733s | 36.086 | +| LCM w/ T-GATE | 7.623T | 462.585M | 4.543s | 37.048 | + +The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with [calflops](https://github.com/MrYxJ/calculate-flops.pytorch), and the FID is calculated with [PytorchFID](https://github.com/mseitzer/pytorch-fid). diff --git a/docs/source/en/optimization/tome.md b/docs/source/en/optimization/tome.md index 9f2208765a43..ab368c9ccbb9 100644 --- a/docs/source/en/optimization/tome.md +++ b/docs/source/en/optimization/tome.md @@ -1,4 +1,4 @@ - - -# PyTorch 2.0 - -🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: - -1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. -2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. - -Both of these optimizations require PyTorch 2.0 or later and 🤗 Diffusers > 0.13.0. - -```bash -pip install --upgrade torch diffusers -``` - -## Scaled dot product attention - -[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🤗 Diffusers, so you don't need to add anything to your code. - -However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: - -```diff - import torch - from diffusers import DiffusionPipeline -+ from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_attn_processor(AttnProcessor2_0()) - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. - -In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: - -```diff - import torch - from diffusers import DiffusionPipeline - - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_default_attn_processor() - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -## torch.compile - -The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🤗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] -``` - -Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. - -Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. - -For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. - -> [!TIP] -> Learn more about other ways PyTorch 2.0 can help optimize your model in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion) tutorial. - -## Benchmark - -We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🤗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). - -Expand the dropdown below to find the code used to benchmark each pipeline: - -
- -### Stable Diffusion text-to-image - -```python -from diffusers import DiffusionPipeline -import torch - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - images = pipe(prompt=prompt).images -``` - -### Stable Diffusion image-to-image - -```python -from diffusers import StableDiffusionImg2ImgPipeline -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### Stable Diffusion inpainting - -```python -from diffusers import StableDiffusionInpaintPipeline -from diffusers.utils import load_image -import torch - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).resize((512, 512)) -mask_image = load_image(mask_url).resize((512, 512)) - -path = "runwayml/stable-diffusion-inpainting" - -run_compile = True # Set True / False - -pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] -``` - -### ControlNet - -```python -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -) - -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) -pipe.controlnet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### DeepFloyd IF text-to-image + upscaling - -```python -from diffusers import DiffusionPipeline -import torch - -run_compile = True # Set True / False - -pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_1.to("cuda") -pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_2.to("cuda") -pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) -pipe_3.to("cuda") - - -pipe_1.unet.to(memory_format=torch.channels_last) -pipe_2.unet.to(memory_format=torch.channels_last) -pipe_3.unet.to(memory_format=torch.channels_last) - -if run_compile: - pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True) - pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) - pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "the blue hulk" - -prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) -neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) - -for _ in range(3): - image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images -``` -
- -The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. - -![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) - -To give you an even better idea of how this speed-up holds for the other pipelines, consider the following -graph for an A100 with PyTorch 2.0 and `torch.compile`: - -![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) - -In the following tables, we report our findings in terms of the *number of iterations/second*. - -### A100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 | -| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 | -| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 | -| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 | -| IF | 20.21 /
13.84 /
24.00 | 20.12 /
13.70 /
24.03 | ❌ | 97.34 /
27.23 /
111.66 | -| SDXL - txt2img | 8.64 | 9.9 | - | - | - -### A100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 | -| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 | -| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 | -| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 | -| IF | 25.02 | 18.04 | ❌ | 48.47 | -| SDXL - txt2img | 2.44 | 2.74 | - | - | - -### A100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 | -| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 | -| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 | -| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 | -| IF | 8.78 | 9.82 | ❌ | 16.77 | -| SDXL - txt2img | 0.64 | 0.72 | - | - | - -### V100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 | -| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 | -| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 | -| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 | -| IF | 20.01 /
9.08 /
23.34 | 19.79 /
8.98 /
24.10 | ❌ | 55.75 /
11.57 /
57.67 | - -### V100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 | -| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 | -| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 | -| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 | -| IF | 15.41 | 14.76 | ❌ | 22.95 | - -### V100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 | -| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 | -| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 | -| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 | -| IF | 5.43 | 5.29 | ❌ | 7.06 | - -### T4 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 | -| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 | -| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 | -| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 | -| IF | 17.42 /
2.47 /
18.52 | 16.96 /
2.45 /
18.69 | ❌ | 24.63 /
2.47 /
23.39 | -| SDXL - txt2img | 1.15 | 1.16 | - | - | - -### T4 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 | -| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 | -| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 | -| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 | -| IF | 5.79 | 5.61 | ❌ | 7.39 | -| SDXL - txt2img | 0.288 | 0.289 | - | - | - -### T4 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s | -| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s | -| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s | -| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup | -| IF * | 1.44 | 1.44 | ❌ | 1.94 | -| SDXL - txt2img | OOM | OOM | - | - | - -### RTX 3090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 | -| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 | -| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 | -| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 | -| IF | 27.08 /
9.07 /
31.23 | 26.75 /
8.92 /
31.47 | ❌ | 68.08 /
11.16 /
65.29 | - -### RTX 3090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 | -| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 | -| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 | -| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 | -| IF | 16.81 | 16.62 | ❌ | 21.57 | - -### RTX 3090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 | -| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 | -| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 | -| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 | -| IF | 5.01 | 5.00 | ❌ | 6.33 | - -### RTX 4090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 | -| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 | -| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 | -| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 | -| IF | 69.71 /
18.78 /
85.49 | 69.13 /
18.80 /
85.56 | ❌ | 124.60 /
26.37 /
138.79 | -| SDXL - txt2img | 6.8 | 8.18 | - | - | - -### RTX 4090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 | -| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 | -| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 | -| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 | -| IF | 31.88 | 31.14 | ❌ | 43.92 | -| SDXL - txt2img | 2.19 | 2.35 | - | - | - -### RTX 4090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 | -| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 | -| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 | -| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 | -| IF | 9.26 | 9.2 | ❌ | 13.31 | -| SDXL - txt2img | 0.52 | 0.53 | - | - | - -## Notes - -* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. -* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. - -*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* diff --git a/docs/source/en/optimization/xdit.md b/docs/source/en/optimization/xdit.md new file mode 100644 index 000000000000..ecf45635684a --- /dev/null +++ b/docs/source/en/optimization/xdit.md @@ -0,0 +1,121 @@ +# xDiT + +[xDiT](https://github.com/xdit-project/xDiT) is an inference engine designed for the large scale parallel deployment of Diffusion Transformers (DiTs). xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations. + +There are four parallel methods supported in xDiT, including [Unified Sequence Parallelism](https://huggingface.co/papers/2405.07719), [PipeFusion](https://huggingface.co/papers/2405.14430), CFG parallelism and data parallelism. The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware. + +Optimization orthogonal to parallelization focuses on accelerating single GPU performance. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as torch.compile and onediff. + +The overview of xDiT is shown as follows. + +
+ +
+You can install xDiT using the following command: + + +```bash +pip install xfuser +``` + +Here's an example of using xDiT to accelerate inference of a Diffusers model. + +```diff + import torch + from diffusers import StableDiffusion3Pipeline + + from xfuser import xFuserArgs, xDiTParallel + from xfuser.config import FlexibleArgumentParser + from xfuser.core.distributed import get_world_group + + def main(): ++ parser = FlexibleArgumentParser(description="xFuser Arguments") ++ args = xFuserArgs.add_cli_args(parser).parse_args() ++ engine_args = xFuserArgs.from_cli_args(args) ++ engine_config, input_config = engine_args.create_config() + + local_rank = get_world_group().local_rank + pipe = StableDiffusion3Pipeline.from_pretrained( + pretrained_model_name_or_path=engine_config.model_config.model, + torch_dtype=torch.float16, + ).to(f"cuda:{local_rank}") + +# do anything you want with pipeline here + ++ pipe = xDiTParallel(pipe, engine_config, input_config) + + pipe( + height=input_config.height, + width=input_config.height, + prompt=input_config.prompt, + num_inference_steps=input_config.num_inference_steps, + output_type=input_config.output_type, + generator=torch.Generator(device="cuda").manual_seed(input_config.seed), + ) + ++ if input_config.output_type == "pil": ++ pipe.save("results", "stable_diffusion_3") + +if __name__ == "__main__": + main() + +``` + +As you can see, we only need to use xFuserArgs from xDiT to get configuration parameters, and pass these parameters along with the pipeline object from the Diffusers library into xDiTParallel to complete the parallelization of a specific pipeline in Diffusers. + +xDiT runtime parameters can be viewed in the command line using `-h`, and you can refer to this [usage](https://github.com/xdit-project/xDiT?tab=readme-ov-file#2-usage) example for more details. + +xDiT needs to be launched using torchrun to support its multi-node, multi-GPU parallel capabilities. For example, the following command can be used for 8-GPU parallel inference: + +```bash +torchrun --nproc_per_node=8 ./inference.py --model models/FLUX.1-dev --data_parallel_degree 2 --ulysses_degree 2 --ring_degree 2 --prompt "A snowy mountain" "A small dog" --num_inference_steps 50 +``` + +## Supported models + +A subset of Diffusers models are supported in xDiT, such as Flux.1, Stable Diffusion 3, etc. The latest supported models can be found [here](https://github.com/xdit-project/xDiT?tab=readme-ov-file#-supported-dits). + +## Benchmark +We tested different models on various machines, and here is some of the benchmark data. + +### Flux.1-schnell +
+ +
+ + +
+ +
+ +### Stable Diffusion 3 +
+ +
+ +
+ +
+ +### HunyuanDiT +
+ +
+ +
+ +
+ +
+ +
+ +More detailed performance metric can be found on our [github page](https://github.com/xdit-project/xDiT?tab=readme-ov-file#perf). + +## Reference + +[xDiT-project](https://github.com/xdit-project/xDiT) + +[USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](https://huggingface.co/papers/2405.07719) + +[PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models](https://huggingface.co/papers/2405.14430) \ No newline at end of file diff --git a/docs/source/en/optimization/xformers.md b/docs/source/en/optimization/xformers.md index 4ef0da9e890d..523e81559547 100644 --- a/docs/source/en/optimization/xformers.md +++ b/docs/source/en/optimization/xformers.md @@ -1,4 +1,4 @@ - + +# bitsandbytes + +[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. + +4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. + +This guide demonstrates how quantization can enable running +[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) +on less than 16GB of VRAM and even on a free Google +Colab instance. + +![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png) + +To use bitsandbytes, make sure you have the following libraries installed: + +```bash +pip install diffusers transformers accelerate bitsandbytes -U +``` + +Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. + + + + +Quantizing a model in 8-bit halves the memory-usage: + +bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the +[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. + +For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. + +> [!TIP] +> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. + +```py +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig +import torch +from diffusers import AutoModel +from transformers import T5EncoderModel + +quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,) + +text_encoder_2_8bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,) + +transformer_8bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. + +```diff +transformer_8bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, ++ torch_dtype=torch.float32, +) +``` + +Let's generate an image using our quantized models. + +Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the +CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. + +```py +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer_8bit, + text_encoder_2=text_encoder_2_8bit, + torch_dtype=torch.float16, + device_map="auto", +) + +pipe_kwargs = { + "prompt": "A cat holding a sign that says hello world", + "height": 1024, + "width": 1024, + "guidance_scale": 3.5, + "num_inference_steps": 50, + "max_sequence_length": 512, +} + +image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] +``` + +
+ +
+ +When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. + +Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`]. + +
+ + +Quantizing a model in 4-bit reduces your memory-usage by 4x: + +bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the +[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. + +For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. + +> [!TIP] +> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. + +```py +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig +import torch +from diffusers import AutoModel +from transformers import T5EncoderModel + +quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,) + +text_encoder_2_4bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,) + +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. + +```diff +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, ++ torch_dtype=torch.float32, +) +``` + +Let's generate an image using our quantized models. + +Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. + +```py +from diffusers import FluxPipeline + +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer_4bit, + text_encoder_2=text_encoder_2_4bit, + torch_dtype=torch.float16, + device_map="auto", +) + +pipe_kwargs = { + "prompt": "A cat holding a sign that says hello world", + "height": 1024, + "width": 1024, + "guidance_scale": 3.5, + "num_inference_steps": 50, + "max_sequence_length": 512, +} + +image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] +``` + +
+ +
+ +When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. + +Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`]. + +
+
+ +> [!WARNING] +> Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. + +Check your memory footprint with the `get_memory_footprint` method: + +```py +print(model.get_memory_footprint()) +``` + +Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements. + +Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters: + +```py +from diffusers import AutoModel, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True) + +model_4bit = AutoModel.from_pretrained( + "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer" +) +``` + +## 8-bit (LLM.int8() algorithm) + +> [!TIP] +> Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! + +This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion. + +### Outlier threshold + +An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). + +To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: + +```py +from diffusers import AutoModel, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig( + load_in_8bit=True, llm_int8_threshold=10, +) + +model_8bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quantization_config, +) +``` + +### Skip module conversion + +For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: + +```py +from diffusers import SD3Transformer2DModel, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig( + load_in_8bit=True, llm_int8_skip_modules=["proj_out"], +) + +model_8bit = SD3Transformer2DModel.from_pretrained( + "stabilityai/stable-diffusion-3-medium-diffusers", + subfolder="transformer", + quantization_config=quantization_config, +) +``` + + +## 4-bit (QLoRA algorithm) + +> [!TIP] +> Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). + +This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. + + +### Compute data type + +To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: + +```py +import torch +from diffusers import BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +``` + +### Normal Float 4 (NF4) + +NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: + +```py +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig + +from diffusers import AutoModel +from transformers import T5EncoderModel + +quant_config = TransformersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +text_encoder_2_4bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +``` + +For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. + +### Nested quantization + +Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. + +```py +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig + +from diffusers import AutoModel +from transformers import T5EncoderModel + +quant_config = TransformersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +text_encoder_2_4bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +``` + +## Dequantizing `bitsandbytes` models + +Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. + +```python +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig + +from diffusers import AutoModel +from transformers import T5EncoderModel + +quant_config = TransformersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +text_encoder_2_4bit = T5EncoderModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="text_encoder_2", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +quant_config = DiffusersBitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) + +text_encoder_2_4bit.dequantize() +transformer_4bit.dequantize() +``` + +## torch.compile + +Speed up inference with `torch.compile`. Make sure you have the latest `bitsandbytes` installed and we also recommend installing [PyTorch nightly](https://pytorch.org/get-started/locally/). + + + +```py +torch._dynamo.config.capture_dynamic_output_shape_ops = True + +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +transformer_4bit.compile(fullgraph=True) +``` + + + + +```py +quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True) +transformer_4bit = AutoModel.from_pretrained( + "black-forest-labs/FLUX.1-dev", + subfolder="transformer", + quantization_config=quant_config, + torch_dtype=torch.float16, +) +transformer_4bit.compile(fullgraph=True) +``` + + + +On an RTX 4090 with compilation, 4-bit Flux generation completed in 25.809 seconds versus 32.570 seconds without. + +Check out the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) for more details. + +## Resources + +* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4) +* [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization) \ No newline at end of file diff --git a/docs/source/en/quantization/gguf.md b/docs/source/en/quantization/gguf.md new file mode 100644 index 000000000000..47804c102da2 --- /dev/null +++ b/docs/source/en/quantization/gguf.md @@ -0,0 +1,120 @@ + + +# GGUF + +The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported. + +The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant. + +Before starting please install gguf in your environment + +```shell +pip install -U gguf +``` + +Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`]. + +When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`. + +The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade). + +```python +import torch + +from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig + +ckpt_path = ( + "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf" +) +transformer = FluxTransformer2DModel.from_single_file( + ckpt_path, + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + torch_dtype=torch.bfloat16, +) +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer, + torch_dtype=torch.bfloat16, +) +pipe.enable_model_cpu_offload() +prompt = "A cat holding a sign that says hello world" +image = pipe(prompt, generator=torch.manual_seed(0)).images[0] +image.save("flux-gguf.png") +``` + +## Using Optimized CUDA Kernels with GGUF + +Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library: + +```shell +pip install -U kernels +``` + +Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`. + +## Supported Quantization Types + +- BF16 +- Q4_0 +- Q4_1 +- Q5_0 +- Q5_1 +- Q8_0 +- Q2_K +- Q3_K +- Q4_K +- Q5_K +- Q6_K + +## Convert to GGUF + +Use the Space below to convert a Diffusers checkpoint into the GGUF format for inference. +run conversion: + + + + +```py +import torch + +from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig + +ckpt_path = ( + "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf" +) +transformer = FluxTransformer2DModel.from_single_file( + ckpt_path, + quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), + config="black-forest-labs/FLUX.1-dev", + subfolder="transformer", + torch_dtype=torch.bfloat16, +) +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + transformer=transformer, + torch_dtype=torch.bfloat16, +) +pipe.enable_model_cpu_offload() +prompt = "A cat holding a sign that says hello world" +image = pipe(prompt, generator=torch.manual_seed(0)).images[0] +image.save("flux-gguf.png") +``` + +When using Diffusers format GGUF checkpoints, it's a must to provide the model `config` path. If the +model config resides in a `subfolder`, that needs to be specified, too. \ No newline at end of file diff --git a/docs/source/en/quantization/modelopt.md b/docs/source/en/quantization/modelopt.md new file mode 100644 index 000000000000..06933d47c221 --- /dev/null +++ b/docs/source/en/quantization/modelopt.md @@ -0,0 +1,141 @@ + + +# NVIDIA ModelOpt + +[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed. + +Before you begin, make sure you have nvidia_modelopt installed. + +```bash +pip install -U "nvidia_modelopt[hf]" +``` + +Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. + +The example below only quantizes the weights to FP8. + +```python +import torch +from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig + +model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" +dtype = torch.bfloat16 + +quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") +transformer = AutoModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=dtype, +) +pipe = SanaPipeline.from_pretrained( + model_id, + transformer=transformer, + torch_dtype=dtype, +) +pipe.to("cuda") + +print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") + +prompt = "A cat holding a sign that says hello world" +image = pipe( + prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 +).images[0] +image.save("output.png") +``` + +> **Note:** +> +> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration. +> +> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples). + +## NVIDIAModelOptConfig + +The `NVIDIAModelOptConfig` class accepts three parameters: +- `quant_type`: A string value mentioning one of the quantization types below. +- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`. +- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead. +- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details. +- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only. +- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`. + +## Supported quantization types + +ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference. + +Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. + +The quantization methods supported are as follows: + +| **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** | +|-----------------------|-----------------------|---------------------|----------------------| +| **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | +| **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | +| **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| +| **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` | +| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| + + +Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available. + +## Serializing and Deserializing quantized models + +To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method. + +```python +import torch +from diffusers import AutoModel, NVIDIAModelOptConfig +from modelopt.torch.opt import enable_huggingface_checkpointing + +enable_huggingface_checkpointing() + +model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" +quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"} +quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8) +model = AutoModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quant_config_fp8, + torch_dtype=torch.bfloat16, +) +model.save_pretrained('path/to/sana_fp8', safe_serialization=False) +``` + +To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method. + +```python +import torch +from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline +from modelopt.torch.opt import enable_huggingface_checkpointing + +enable_huggingface_checkpointing() + +quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") +transformer = AutoModel.from_pretrained( + "path/to/sana_fp8", + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +pipe = SanaPipeline.from_pretrained( + "Efficient-Large-Model/Sana_600M_1024px_diffusers", + transformer=transformer, + torch_dtype=torch.bfloat16, +) +pipe.to("cuda") +prompt = "A cat holding a sign that says hello world" +image = pipe( + prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 +).images[0] +image.save("output.png") +``` diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md new file mode 100644 index 000000000000..38abeeac6d4d --- /dev/null +++ b/docs/source/en/quantization/overview.md @@ -0,0 +1,141 @@ + + +# Getting started + +Quantization focuses on representing data with fewer bits while also trying to preserve the precision of the original data. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. + +Diffusers supports multiple quantization backends to make large diffusion models like [Flux](../api/pipelines/flux) more accessible. This guide shows how to use the [`~quantizers.PipelineQuantizationConfig`] class to quantize a pipeline during its initialization from a pretrained or non-quantized checkpoint. + +## Pipeline-level quantization + +There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending on how much customization you want to apply to the quantization configuration. + +- for basic use cases, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments +- for granular quantization control, define a `quant_mapping` that provides the quantization configuration for individual model components + +### Basic quantization + +Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters. + +- `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`. +- `quant_kwargs` specifies the quantization arguments to use. + +> [!TIP] +> These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend. + +- `components_to_quantize` specifies which component(s) of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact. + + `components_to_quantize` accepts either a list for multiple models or a string for a single model. + +The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder_2"], +) +``` + +Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. + +```py +pipe = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +).to("cuda") + +image = pipe("photo of a cute dog").images[0] +``` + + +### Advanced quantization + +The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends. + +Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder. + +The example below uses two quantization backends, [`~quantizers.quantization_config.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from diffusers.quantizers.quantization_config import QuantoConfig +from diffusers.quantizers import PipelineQuantizationConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_mapping={ + "transformer": QuantoConfig(weights_dtype="int8"), + "text_encoder_2": TransformersBitsAndBytesConfig( + load_in_4bit=True, compute_dtype=torch.bfloat16 + ), + } +) +``` + +There is a separate bitsandbytes backend in [Transformers](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig). You need to import and use [`transformers.BitsAndBytesConfig`] for components that come from Transformers. For example, `text_encoder_2` in [`FluxPipeline`] is a [`~transformers.T5EncoderModel`] from Transformers so you need to use [`transformers.BitsAndBytesConfig`] instead of [`diffusers.BitsAndBytesConfig`]. + +> [!TIP] +> Use the [basic quantization](#basic-quantization) method above if you don't want to manage these distinct imports or aren't sure where each pipeline component comes from. + +```py +import torch +from diffusers import DiffusionPipeline +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig +from diffusers.quantizers import PipelineQuantizationConfig +from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_mapping={ + "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16), + "text_encoder_2": TransformersBitsAndBytesConfig( + load_in_4bit=True, compute_dtype=torch.bfloat16 + ), + } +) +``` + +Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. + +```py +pipe = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantization_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, +).to("cuda") + +image = pipe("photo of a cute dog").images[0] +``` + +## Resources + +Check out the resources below to learn more about quantization. + +- If you are new to quantization, we recommend checking out the following beginner-friendly courses in collaboration with DeepLearning.AI. + + - [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/) + - [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/) + +- Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) if you're interested in adding a new quantization method. + +- The Transformers quantization [Overview](https://huggingface.co/docs/transformers/quantization/overview#when-to-use-what) provides an overview of the pros and cons of different quantization backends. + +- Read the [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization) blog post for a brief introduction to each quantization backend, how to choose a backend, and combining quantization with other memory optimizations. diff --git a/docs/source/en/quantization/quanto.md b/docs/source/en/quantization/quanto.md new file mode 100644 index 000000000000..d322d76be267 --- /dev/null +++ b/docs/source/en/quantization/quanto.md @@ -0,0 +1,148 @@ + + +# Quanto + +[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind: + +- All features are available in eager mode (works with non-traceable models) +- Supports quantization aware training +- Quantized models are compatible with `torch.compile` +- Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU) + +In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate` + +```shell +pip install optimum-quanto accelerate +``` + +Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto. + +```python +import torch +from diffusers import FluxTransformer2DModel, QuantoConfig + +model_id = "black-forest-labs/FLUX.1-dev" +quantization_config = QuantoConfig(weights_dtype="float8") +transformer = FluxTransformer2DModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) + +pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype) +pipe.to("cuda") + +prompt = "A cat holding a sign that says hello world" +image = pipe( + prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 +).images[0] +image.save("output.png") +``` + +## Skipping Quantization on specific modules + +It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict` + +```python +import torch +from diffusers import FluxTransformer2DModel, QuantoConfig + +model_id = "black-forest-labs/FLUX.1-dev" +quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"]) +transformer = FluxTransformer2DModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +``` + +## Using `from_single_file` with the Quanto Backend + +`QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`. + +```python +import torch +from diffusers import FluxTransformer2DModel, QuantoConfig + +ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors" +quantization_config = QuantoConfig(weights_dtype="float8") +transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16) +``` + +## Saving Quantized models + +Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method. + +The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized +with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained` + +```python +import torch +from diffusers import FluxTransformer2DModel, QuantoConfig + +model_id = "black-forest-labs/FLUX.1-dev" +quantization_config = QuantoConfig(weights_dtype="float8") +transformer = FluxTransformer2DModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +# save quantized model to reuse +transformer.save_pretrained("") + +# you can reload your quantized model with +model = FluxTransformer2DModel.from_pretrained("") +``` + +## Using `torch.compile` with Quanto + +Currently the Quanto backend supports `torch.compile` for the following quantization types: + +- `int8` weights + +```python +import torch +from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig + +model_id = "black-forest-labs/FLUX.1-dev" +quantization_config = QuantoConfig(weights_dtype="int8") +transformer = FluxTransformer2DModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True) + +pipe = FluxPipeline.from_pretrained( + model_id, transformer=transformer, torch_dtype=torch_dtype +) +pipe.to("cuda") +images = pipe("A cat holding a sign that says hello").images[0] +images.save("flux-quanto-compile.png") +``` + +## Supported Quantization Types + +### Weights + +- float8 +- int8 +- int4 +- int2 + + diff --git a/docs/source/en/quantization/torchao.md b/docs/source/en/quantization/torchao.md new file mode 100644 index 000000000000..18cc109e0785 --- /dev/null +++ b/docs/source/en/quantization/torchao.md @@ -0,0 +1,189 @@ + + +# torchao + +[torchao](https://github.com/pytorch/ao) provides high-performance dtypes and optimizations based on quantization and sparsity for inference and training PyTorch models. It is supported for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. + +Make sure Pytorch 2.5+ and torchao are installed with the command below. + +```bash +uv pip install -U torch torchao +``` + +Each quantization dtype is available as a separate instance of a [AOBaseConfig](https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) class. This provides more flexible configuration options by exposing more available arguments. + +Pass the `AOBaseConfig` of a quantization dtype, like [Int4WeightOnlyConfig](https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int4WeightOnlyConfig) to [`TorchAoConfig`] in [`~ModelMixin.from_pretrained`]. + +```py +import torch +from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig +from torchao.quantization import Int8WeightOnlyConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128)))} +) +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantzation_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, + device_map="cuda" +) +``` + +For simple use cases, you could also provide a string identifier in [`TorchAo`] as shown below. + +```py +import torch +from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_mapping={"transformer": TorchAoConfig("int8wo")} +) +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantzation_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, + device_map="cuda" +) +``` + +## torch.compile + +torchao supports [torch.compile](../optimization/fp16#torchcompile) which can speed up inference with one line of code. + +```python +import torch +from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig +from torchao.quantization import Int4WeightOnlyConfig + +pipeline_quant_config = PipelineQuantizationConfig( + quant_mapping={"transformer": TorchAoConfig(Int4WeightOnlyConfig(group_size=128)))} +) +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + quantzation_config=pipeline_quant_config, + torch_dtype=torch.bfloat16, + device_map="cuda" +) + +pipeline.transformer.compile(transformer, mode="max-autotune", fullgraph=True) +``` + +Refer to this [table](https://github.com/huggingface/diffusers/pull/10009#issue-2688781450) for inference speed and memory usage benchmarks with Flux and CogVideoX. More benchmarks on various hardware are also available in the torchao [repository](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks). + +> [!TIP] +> The FP8 post-training quantization schemes in torchao are effective for GPUs with compute capability of at least 8.9 (RTX-4090, Hopper, etc.). FP8 often provides the best speed, memory, and quality trade-off when generating images and videos. We recommend combining FP8 and torch.compile if your GPU is compatible. + +## autoquant + +torchao provides [autoquant](https://docs.pytorch.org/ao/stable/generated/torchao.quantization.autoquant.html#torchao.quantization.autoquant) an automatic quantization API. Autoquantization chooses the best quantization strategy by comparing the performance of each strategy on chosen input types and shapes. This is only supported in Diffusers for individual models at the moment. + +```py +import torch +from diffusers import DiffusionPipeline +from torchao.quantization import autoquant + +# Load the pipeline +pipeline = DiffusionPipeline.from_pretrained( + "black-forest-labs/FLUX.1-schnell", + torch_dtype=torch.bfloat16, + device_map="cuda" +) + +transformer = autoquant(pipeline.transformer) +``` + +## Supported quantization types + +torchao supports weight-only quantization and weight and dynamic-activation quantization for int8, float3-float8, and uint1-uint7. + +Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. + +Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly. + +The quantization methods supported are as follows: + +| **Category** | **Full Function Names** | **Shorthands** | +|--------------|-------------------------|----------------| +| **Integer quantization** | `int4_weight_only`, `int8_dynamic_activation_int4_weight`, `int8_weight_only`, `int8_dynamic_activation_int8_weight` | `int4wo`, `int4dq`, `int8wo`, `int8dq` | +| **Floating point 8-bit quantization** | `float8_weight_only`, `float8_dynamic_activation_float8_weight`, `float8_static_activation_float8_weight` | `float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`, `float8dq_e4m3_tensor`, `float8dq_e4m3_row` | +| **Floating point X-bit quantization** | `fpx_weight_only` | `fpX_eAwB` where `X` is the number of bits (1-7), `A` is exponent bits, and `B` is mantissa bits. Constraint: `X == A + B + 1` | +| **Unsigned Integer quantization** | `uintx_weight_only` | `uint1wo`, `uint2wo`, `uint3wo`, `uint4wo`, `uint5wo`, `uint6wo`, `uint7wo` | + +Some quantization methods are aliases (for example, `int8wo` is the commonly used shorthand for `int8_weight_only`). This allows using the quantization methods described in the torchao docs as-is, while also making it convenient to remember their shorthand notations. + +Refer to the [official torchao documentation](https://docs.pytorch.org/ao/stable/index.html) for a better understanding of the available quantization methods and the exhaustive list of configuration options available. + +## Serializing and Deserializing quantized models + +To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method. + +```python +import torch +from diffusers import AutoModel, TorchAoConfig + +quantization_config = TorchAoConfig("int8wo") +transformer = AutoModel.from_pretrained( + "black-forest-labs/Flux.1-Dev", + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, +) +transformer.save_pretrained("/path/to/flux_int8wo", safe_serialization=False) +``` + +To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method. + +```python +import torch +from diffusers import FluxPipeline, AutoModel + +transformer = AutoModel.from_pretrained("/path/to/flux_int8wo", torch_dtype=torch.bfloat16, use_safetensors=False) +pipe = FluxPipeline.from_pretrained("black-forest-labs/Flux.1-Dev", transformer=transformer, torch_dtype=torch.bfloat16) +pipe.to("cuda") + +prompt = "A cat holding a sign that says hello world" +image = pipe(prompt, num_inference_steps=30, guidance_scale=7.0).images[0] +image.save("output.png") +``` + +If you are using `torch<=2.6.0`, some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source. + +```python +import torch +from accelerate import init_empty_weights +from diffusers import FluxPipeline, AutoModel, TorchAoConfig + +# Serialize the model +transformer = AutoModel.from_pretrained( + "black-forest-labs/Flux.1-Dev", + subfolder="transformer", + quantization_config=TorchAoConfig("uint4wo"), + torch_dtype=torch.bfloat16, +) +transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB") +# ... + +# Load the model +state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu") +with init_empty_weights(): + transformer = AutoModel.from_config("/path/to/flux_uint4wo/config.json") +transformer.load_state_dict(state_dict, strict=True, assign=True) +``` + +> [!TIP] +> The [`AutoModel`] API is supported for PyTorch >= 2.6 as shown in the examples below. + +## Resources + +- [TorchAO Quantization API](https://docs.pytorch.org/ao/stable/index.html) +- [Diffusers-TorchAO examples](https://github.com/sayakpaul/diffusers-torchao) diff --git a/docs/source/en/quicktour.md b/docs/source/en/quicktour.md index 3cc8567cdad2..1ccc8eeadcc2 100644 --- a/docs/source/en/quicktour.md +++ b/docs/source/en/quicktour.md @@ -1,4 +1,4 @@ - -[[open-in-colab]] +# Quickstart -# Quicktour +Diffusers is a library for developers and researchers that provides an easy inference API for generating images, videos and audio, as well as the building blocks for implementing new workflows. -Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. 🧨 Diffusers is a library aimed at making diffusion models widely accessible to everyone. +Diffusers provides many optimizations out-of-the-box that makes it possible to load and run large models on setups with limited memory or to accelerate inference. -Whether you're a developer or an everyday user, this quicktour will introduce you to 🧨 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: +This Quickstart will give you an overview of Diffusers and get you up and generating quickly. -* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. -* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. -* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. +> [!TIP] +> Before you begin, make sure you have a Hugging Face [account](https://huggingface.co/join) in order to use gated models like [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev). -The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`]. - - - -The quicktour is a simplified version of the introductory 🧨 Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about 🧨 Diffusers' goal, design philosophy, and additional details about its core API, check out the notebook! - - - -Before you begin, make sure you have all the necessary libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install --upgrade diffusers accelerate transformers -``` - -- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training. -- [🤗 Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview). +Follow the [Installation](./installation) guide to install Diffusers if it's not already installed. ## DiffusionPipeline -The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧨 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. +A diffusion model combines multiple components to generate outputs in any modality based on an input, such as a text description, image or both. -| **Task** | **Description** | **Pipeline** -|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| -| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | -| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | -| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | -| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | -| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | +For a standard text-to-image model: -Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download. -You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. -In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image generation. +1. A text encoder turns a prompt into embeddings that guide the denoising process. Some models have more than one text encoder. +2. A scheduler contains the algorithmic specifics for gradually denoising initial random noise into clean outputs. Different schedulers affect generation speed and quality. +3. A UNet or diffusion transformer (DiT) is the workhorse of a diffusion model. - + At each step, it performs the denoising predictions, such as how much noise to remove or the general direction in which to steer the noise to generate better quality outputs. -For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧨 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. + The UNet or DiT repeats this loop for a set amount of steps to generate the final output. + +4. A variational autoencoder (VAE) encodes and decodes pixels to a spatially compressed latent-space. *Latents* are compressed representations of an image and are more efficient to work with. The UNet or DiT operates on latents, and the clean latents at the end are decoded back into images. - +The [`DiffusionPipeline`] packages all these components into a single class for inference. There are several arguments in [`~DiffusionPipeline.__call__`] you can change, such as `num_inference_steps`, that affect the diffusion process. Try different values and arguments to see how they change generation quality or speed. -Load the model with the [`~DiffusionPipeline.from_pretrained`] method: +Load a model with [`~DiffusionPipeline.from_pretrained`] and describe what you'd like to generate. The example below uses the default argument values. -```python ->>> from diffusers import DiffusionPipeline + + ->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) -``` - -The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things: +Use `.images[0]` to access the generated image output. ```py ->>> pipeline -StableDiffusionPipeline { - "_class_name": "StableDiffusionPipeline", - "_diffusers_version": "0.21.4", - ..., - "scheduler": [ - "diffusers", - "PNDMScheduler" - ], - ..., - "unet": [ - "diffusers", - "UNet2DConditionModel" - ], - "vae": [ - "diffusers", - "AutoencoderKL" - ] -} -``` - -We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. -You can move the generator object to a GPU, just like you would in PyTorch: +import torch +from diffusers import DiffusionPipeline -```python ->>> pipeline.to("cuda") -``` - -Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. +pipeline = DiffusionPipeline.from_pretrained( + "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" +) -```python ->>> image = pipeline("An image of a squirrel in Picasso style").images[0] ->>> image +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] ``` -
- -
+
+ -Save the image by calling `save`: - -```python ->>> image.save("image_of_squirrel_painting.png") -``` - -### Local pipeline - -You can also use the pipeline locally. The only difference is you need to download the weights first: - -```bash -!git lfs install -!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -``` - -Then load the saved weights into the pipeline: - -```python ->>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) -``` - -Now, you can run the pipeline as you would in the section above. - -### Swapping schedulers - -Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧨 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method: +Use `.frames[0]` to access the generated video output and [`~utils.export_to_video`] to save the video. ```py ->>> from diffusers import EulerDiscreteScheduler - ->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) -``` - -Try generating an image with the new scheduler and see if you notice a difference! - -In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat. - -## Models - -Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. - -Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images: +import torch +from diffusers import AutoencoderKLWan, DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig +from diffusers.utils import export_to_video + +vae = AutoencoderKLWan.from_pretrained( + "Wan-AI/Wan2.2-T2V-A14B-Diffusers", + subfolder="vae", + torch_dtype=torch.float32 +) +pipeline = DiffusionPipeline.from_pretrained( + "Wan-AI/Wan2.2-T2V-A14B-Diffusers", + vae=vae + torch_dtype=torch.bfloat16, + device_map="cuda" +) + +prompt = """ +Cinematic video of a sleek cat lounging on a colorful inflatable in a crystal-clear turquoise pool in Palm Springs, +sipping a salt-rimmed margarita through a straw. Golden-hour sunlight glows over mid-century modern homes and swaying palms. +Shot in rich Sony a7S III: with moody, glamorous color grading, subtle lens flares, and soft vintage film grain. +Ripples shimmer as a warm desert breeze stirs the water, blending luxury and playful charm in an epic, gorgeously composed frame. +""" +video = pipeline(prompt=prompt, num_frames=81, num_inference_steps=40).frames[0] +export_to_video(video, "output.mp4", fps=16) +``` + + +
+ +## LoRA + +Adapters insert a small number of trainable parameters to the original base model. Only the inserted parameters are fine-tuned while the rest of the model weights remain frozen. This makes it fast and cheap to fine-tune a model on a new style. Among adapters, [LoRA's](./tutorials/using_peft_for_inference) are the most popular. + +Add a LoRA to a pipeline with the [`~loaders.QwenImageLoraLoaderMixin.load_lora_weights`] method. Some LoRA's require a special word to trigger it, such as `Realism`, in the example below. Check a LoRA's model card to see if it requires a trigger word. ```py ->>> from diffusers import UNet2DModel - ->>> repo_id = "google/ddpm-cat-256" ->>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) -``` +import torch +from diffusers import DiffusionPipeline -To access the model parameters, call `model.config`: +pipeline = DiffusionPipeline.from_pretrained( + "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" +) +pipeline.load_lora_weights( + "flymy-ai/qwen-image-realism-lora", +) -```py ->>> model.config +prompt = """ +super Realism cinematic film still of a cat sipping a margarita in a pool in Palm Springs in the style of umempart, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] ``` -The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. - -Some of the most important parameters are: +Check out the [LoRA](./tutorials/using_peft_for_inference) docs or Adapters section to learn more. -* `sample_size`: the height and width dimension of the input sample. -* `in_channels`: the number of input channels of the input sample. -* `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture. -* `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks. -* `layers_per_block`: the number of ResNet blocks present in each UNet block. +## Quantization -To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: +[Quantization](./quantization/overview) stores data in fewer bits to reduce memory usage. It may also speed up inference because it takes less time to perform calculations with fewer bits. -```py ->>> import torch +Diffusers provides several quantization backends and picking one depends on your use case. For example, [bitsandbytes](./quantization/bitsandbytes) and [torchao](./quantization/torchao) are both simple and easy to use for inference, but torchao supports more [quantization types](./quantization/torchao#supported-quantization-types) like fp8. ->>> torch.manual_seed(0) - ->>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) ->>> noisy_sample.shape -torch.Size([1, 3, 256, 256]) -``` - -For inference, pass the noisy image and a `timestep` to the model. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: +Configure [`PipelineQuantizationConfig`] with the backend to use, the specific arguments (refer to the [API](./api/quantization) reference for available arguments) for that backend, and which components to quantize. The example below quantizes the model to 4-bits and only uses 14.93GB of memory. ```py ->>> with torch.no_grad(): -... noisy_residual = model(sample=noisy_sample, timestep=2).sample -``` - -To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. - -## Schedulers - -Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`. - - +import torch +from diffusers import DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig -🧨 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. +quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder"], +) +pipeline = DiffusionPipeline.from_pretrained( + "Qwen/Qwen-Image", + torch_dtype=torch.bfloat16, + quantization_config=quant_config, + device_map="cuda" +) - - -For the quicktour, you'll instantiate the [`DDPMScheduler`] with its [`~diffusers.ConfigMixin.from_config`] method: - -```py ->>> from diffusers import DDPMScheduler - ->>> scheduler = DDPMScheduler.from_pretrained(repo_id) ->>> scheduler -DDPMScheduler { - "_class_name": "DDPMScheduler", - "_diffusers_version": "0.21.4", - "beta_end": 0.02, - "beta_schedule": "linear", - "beta_start": 0.0001, - "clip_sample": true, - "clip_sample_range": 1.0, - "dynamic_thresholding_ratio": 0.995, - "num_train_timesteps": 1000, - "prediction_type": "epsilon", - "sample_max_value": 1.0, - "steps_offset": 0, - "thresholding": false, - "timestep_spacing": "leading", - "trained_betas": null, - "variance_type": "fixed_small" -} +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] +print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` - - -💡 Unlike a model, a scheduler does not have trainable weights and is parameter-free! +Take a look at the [Quantization](./quantization/overview) section for more details. - +## Optimizations -Some of the most important parameters are: +> [!TIP] +> Optimization is dependent on hardware specs such as memory. Use this [Space](https://huggingface.co/spaces/diffusers/optimized-diffusers-code) to generate code examples that include all of Diffusers' available memory and speed optimization techniques for any model you're using. -* `num_train_timesteps`: the length of the denoising process or, in other words, the number of timesteps required to process random Gaussian noise into a data sample. -* `beta_schedule`: the type of noise schedule to use for inference and training. -* `beta_start` and `beta_end`: the start and end noise values for the noise schedule. +Modern diffusion models are very large and have billions of parameters. The iterative denoising process is also computationally intensive and slow. Diffusers provides techniques for reducing memory usage and boosting inference speed. These techniques can be combined with quantization to optimize for both memory usage and inference speed. -To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`. +### Memory usage -```py ->>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample ->>> less_noisy_sample.shape -torch.Size([1, 3, 256, 256]) -``` +The text encoders and UNet or DiT can use up as much as ~30GB of memory, exceeding the amount available on many free-tier or consumer GPUs. -The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisy! Let's bring it all together now and visualize the entire denoising process. +Offloading stores weights that aren't currently used on the CPU and only moves them to the GPU when they're needed. There are a few offloading types and the example below uses [model offloading](./optimization/memory#model-offloading). This moves an entire model, like a text encoder or transformer, to the CPU when it isn't actively being used. -First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: +Call [`~DiffusionPipeline.enable_model_cpu_offload`] to activate it. By combining quantization and offloading, the following example only requires ~12.54GB of memory. ```py ->>> import PIL.Image ->>> import numpy as np +import torch +from diffusers import DiffusionPipeline +from diffusers.quantizers import PipelineQuantizationConfig +quant_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, + components_to_quantize=["transformer", "text_encoder"], +) +pipeline = DiffusionPipeline.from_pretrained( + "Qwen/Qwen-Image", + torch_dtype=torch.bfloat16, + quantization_config=quant_config, + device_map="cuda" +) +pipeline.enable_model_cpu_offload() ->>> def display_sample(sample, i): -... image_processed = sample.cpu().permute(0, 2, 3, 1) -... image_processed = (image_processed + 1.0) * 127.5 -... image_processed = image_processed.numpy().astype(np.uint8) - -... image_pil = PIL.Image.fromarray(image_processed[0]) -... display(f"Image at step {i}") -... display(image_pil) +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] +print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` -To speed up the denoising process, move the input and model to a GPU: +Refer to the [Reduce memory usage](./optimization/memory) docs to learn more about other memory reducing techniques. -```py ->>> model.to("cuda") ->>> noisy_sample = noisy_sample.to("cuda") -``` +### Inference speed -Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: +The denoising loop performs a lot of computations and can be slow. Methods like [torch.compile](./optimization/fp16#torchcompile) increases inference speed by compiling the computations into an optimized kernel. Compilation is slow for the first generation but successive generations should be much faster. -```py ->>> import tqdm +The example below uses [regional compilation](./optimization/fp16#regional-compilation) to only compile small regions of a model. It reduces cold-start latency while also providing a runtime speed up. ->>> sample = noisy_sample +Call [`~ModelMixin.compile_repeated_blocks`] on the model to activate it. ->>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): -... # 1. predict noise residual -... with torch.no_grad(): -... residual = model(sample, t).sample +```py +import torch +from diffusers import DiffusionPipeline -... # 2. compute less noisy image and set x_t -> x_t-1 -... sample = scheduler.step(residual, t, sample).prev_sample +pipeline = DiffusionPipeline.from_pretrained( + "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" +) -... # 3. optionally look at image -... if (i + 1) % 50 == 0: -... display_sample(sample, i + 1) +pipeline.transformer.compile_repeated_blocks( + fullgraph=True, +) +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] ``` -Sit back and watch as a cat is generated from nothing but noise! 😻 - -
- -
- -## Next steps - -Hopefully, you generated some cool images with 🧨 Diffusers in this quicktour! For your next steps, you can: - -* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. -* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. -* Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. -* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion) guide. -* Dive deeper into speeding up 🧨 Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx). +Check out the [Accelerate inference](./optimization/fp16) or [Caching](./optimization/cache) docs for more methods that speed up inference. \ No newline at end of file diff --git a/docs/source/en/stable_diffusion.md b/docs/source/en/stable_diffusion.md index 877a4ac70823..93e399d3db88 100644 --- a/docs/source/en/stable_diffusion.md +++ b/docs/source/en/stable_diffusion.md @@ -1,4 +1,4 @@ - -# Effective and efficient diffusion - [[open-in-colab]] -Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. - -This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. - -This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. - -Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: - -```python -from diffusers import DiffusionPipeline - -model_id = "runwayml/stable-diffusion-v1-5" -pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) -``` - -The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: - -```python -prompt = "portrait photo of a old warrior chief" -``` +# Basic performance -## Speed +Diffusion is a random process that is computationally demanding. You may need to run the [`DiffusionPipeline`] several times before getting a desired output. That's why it's important to carefully balance generation speed and memory usage in order to iterate faster, - +This guide recommends some basic performance tips for using the [`DiffusionPipeline`]. Refer to the Inference Optimization section docs such as [Accelerate inference](./optimization/fp16) or [Reduce memory usage](./optimization/memory) for more detailed performance guides. -💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! +## Memory usage - - -One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: - -```python -pipeline = pipeline.to("cuda") -``` +Reducing the amount of memory used indirectly speeds up generation and can help a model fit on device. -To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility): +The [`~DiffusionPipeline.enable_model_cpu_offload`] method moves a model to the CPU when it is not in use to save GPU memory. -```python +```py import torch +from diffusers import DiffusionPipeline -generator = torch.Generator("cuda").manual_seed(0) -``` - -Now you can generate an image: +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.bfloat16, + device_map="cuda" +) +pipeline.enable_model_cpu_offload() -```python -image = pipeline(prompt, generator=generator).images[0] -image +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" +pipeline(prompt).images[0] +print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` -
- -
+## Inference speed -This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. +Denoising is the most computationally demanding process during diffusion. Methods that optimizes this process accelerates inference speed. Try the following methods for a speed up. -Let's start by loading the model in `float16` and generate an image: +- Add `device_map="cuda"` to place the pipeline on a GPU. Placing a model on an accelerator, like a GPU, increases speed because it performs computations in parallel. +- Set `torch_dtype=torch.bfloat16` to execute the pipeline in half-precision. Reducing the data type precision increases speed because it takes less time to perform computations in a lower precision. -```python +```py import torch +import time +from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) -pipeline = pipeline.to("cuda") -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator).images[0] -image +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.bfloat16, + device_map="cuda +) ``` -
- -
- -This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! - - - -💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. - - - -Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: - -```python -pipeline.scheduler.compatibles -[ - diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, - diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, - diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, - diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, - diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, - diffusers.schedulers.scheduling_ddpm.DDPMScheduler, - diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, - diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler, - diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, - diffusers.schedulers.scheduling_pndm.PNDMScheduler, - diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, - diffusers.schedulers.scheduling_ddim.DDIMScheduler, -] -``` - -The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`~ConfigMixin.from_config`] method to load a new scheduler: - -```python -from diffusers import DPMSolverMultistepScheduler +- Use a faster scheduler, such as [`DPMSolverMultistepScheduler`], which only requires ~20-25 steps. +- Set `num_inference_steps` to a lower value. Reducing the number of inference steps reduces the overall number of computations. However, this can result in lower generation quality. +```py pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) -``` - -Now set the `num_inference_steps` to 20: - -```python -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] -image -``` - -
- -
- -Great, you've managed to cut the inference time to just 4 seconds! ⚡️ - -## Memory - -The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). -Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. +prompt = """ +cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California +highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain +""" -```python -def get_inputs(batch_size=1): - generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] - prompts = batch_size * [prompt] - num_inference_steps = 20 +start_time = time.perf_counter() +image = pipeline(prompt).images[0] +end_time = time.perf_counter() - return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} +print(f"Image generation took {end_time - start_time:.3f} seconds") ``` -Start with `batch_size=4` and see how much memory you've consumed: +## Generation quality -```python -from diffusers.utils import make_image_grid +Many modern diffusion models deliver high-quality images out-of-the-box. However, you can still improve generation quality by trying the following. -images = pipeline(**get_inputs(batch_size=4)).images -make_image_grid(images, 2, 2) -``` - -Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: - -```python -pipeline.enable_attention_slicing() -``` - -Now try increasing the `batch_size` to 8! - -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. - -## Quality - -In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. - -### Better checkpoints - -The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. - -As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! +- Try a more detailed and descriptive prompt. Include details such as the image medium, subject, style, and aesthetic. A negative prompt may also help by guiding a model away from undesirable features by using words like low quality or blurry. -### Better pipeline components + ```py + import torch + from diffusers import DiffusionPipeline -You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autoencoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: + pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.bfloat16, + device_map="cuda" + ) -```python -from diffusers import AutoencoderKL + prompt = """ + cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California + highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain + """ + negative_prompt = "low quality, blurry, ugly, poor details" + pipeline(prompt, negative_prompt=negative_prompt).images[0] + ``` -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") -pipeline.vae = vae -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -### Better prompt engineering - -The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: - -- How is the image or similar images of the one I want to generate stored on the internet? -- What additional detail can I give that steers the model towards the style I want? - -With this in mind, let's improve the prompt to include color and higher quality details: - -```python -prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" -prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" -``` - -Generate a batch of images with the new prompt: + For more details about creating better prompts, take a look at the [Prompt techniques](./using-diffusers/weighted_prompts) doc. -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` +- Try a different scheduler, like [`HeunDiscreteScheduler`] or [`LMSDiscreteScheduler`], that gives up generation speed for quality. -
- -
+ ```py + import torch + from diffusers import DiffusionPipeline, HeunDiscreteScheduler -Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: + pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.bfloat16, + device_map="cuda" + ) + pipeline.scheduler = HeunDiscreteScheduler.from_config(pipeline.scheduler.config) -```python -prompts = [ - "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", -] - -generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] -images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images -make_image_grid(images, 2, 2) -``` - -
- -
+ prompt = """ + cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California + highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain + """ + negative_prompt = "low quality, blurry, ugly, poor details" + pipeline(prompt, negative_prompt=negative_prompt).images[0] + ``` ## Next steps -In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: - -- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! -- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. -- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). +Diffusers offers more advanced and powerful optimizations such as [group-offloading](./optimization/memory#group-offloading) and [regional compilation](./optimization/fp16#regional-compilation). To learn more about how to maximize performance, take a look at the Inference Optimization section. \ No newline at end of file diff --git a/docs/source/en/training/adapt_a_model.md b/docs/source/en/training/adapt_a_model.md index 57bc1a37e05b..f528c8bfb656 100644 --- a/docs/source/en/training/adapt_a_model.md +++ b/docs/source/en/training/adapt_a_model.md @@ -6,12 +6,12 @@ This guide will show you how to adapt a pretrained text-to-image model for inpai ## Configure UNet2DConditionModel parameters -A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) and take a look at the number of `in_channels`: +A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) and take a look at the number of `in_channels`: ```py from diffusers import StableDiffusionPipeline -pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) pipeline.unet.config["in_channels"] 4 ``` @@ -26,15 +26,15 @@ pipeline.unet.config["in_channels"] 9 ``` -To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. +To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now. ```py -from diffusers import UNet2DConditionModel +from diffusers import AutoModel -model_id = "runwayml/stable-diffusion-v1-5" -unet = UNet2DConditionModel.from_pretrained( +model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" +unet = AutoModel.from_pretrained( model_id, subfolder="unet", in_channels=9, diff --git a/docs/source/en/training/cogvideox.md b/docs/source/en/training/cogvideox.md new file mode 100644 index 000000000000..d0700c4da763 --- /dev/null +++ b/docs/source/en/training/cogvideox.md @@ -0,0 +1,291 @@ + +# CogVideoX + +CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods. + +- a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy. + +- an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos. + +The actual test of the video instruction dimension found that CogVideoX has good effects on consistent theme, dynamic information, consistent background, object information, smooth motion, color, scene, appearance style, and temporal style but cannot achieve good results with human action, spatial relationship, and multiple objects. + +Finetuning with Diffusers can help make up for these poor results. + +## Data Preparation + +The training scripts accepts data in two formats. + +The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training. In the future, Diffusers will support the `