Improve hf dense model throughput (no need actually) #5

3outeille · 2025-12-09T10:22:33Z

This PR needs the following #7 to function properly. Otherwise, torch.compile + tensor parallel will not work

Expecting

There is no performance drop

Testing methodology

373M llama model (config is created in debug_local.sh)
Fixed in torchtitan the SDPA backend

 self.sdpa_backends = [
        # SDPBackend.CUDNN_ATTENTION,
        # SDPBackend.FLASH_ATTENTION,
        SDPBackend.EFFICIENT_ATTENTION,
        # SDPBackend.MATH,
    ]

 ./tooling_dev/debug_local.sh debugperf_large
 ./tooling_dev/debug_local.sh debugperf_large --compile

Results

v4.55.4
- python ./tooling_dev/test_hf_integration.py compare_throughput --torchtitan_dir v4.55.4/llama3/debugperf_large --hf_dir v4.55.4/meta-llama/Llama-3.2-1B/debugperf_large --hide_status
- with torch.compile: python ./tooling_dev/test_hf_integration.py compare_throughput --torchtitan_dir v4.55.4_compile/llama3/debugperf_large --hf_dir v4.55.4_compile/meta-llama/Llama-3.2-1B/debugperf_large --hide_status
v4.57.1
- python ./tooling_dev/test_hf_integration.py compare_throughput --torchtitan_dir v4.57.1/llama3/debugperf_large --hf_dir v4.57.1/meta-llama/Llama-3.2-1B/debugperf_large --hide_status
- with torch.compile: python ./tooling_dev/test_hf_integration.py compare_throughput --torchtitan_dir v.4.57.1_compile/llama3/debugperf_large --hf_dir v.4.57.1_compile/meta-llama/Llama-3.2-1B/debugperf_large --hide_status

…d jobs now

…flags for compilation and selective checkpointing)

3outeille added 6 commits November 24, 2025 10:03

add tooling

4f36924

add check checkpoint correctness

5a63932

add compare_throughput feature + able to run torchtitan and hf backen…

6b4400b

…d jobs now

make tie weight embedding configurable

08fb8c7

add debugperf model for fair comparison

66998c2

Add debugperf_large model configuration and profiling support

d78b0e6

3outeille mentioned this pull request Dec 9, 2025

improve throughput of HF dense model (no need actually) pytorch/torchtitan#2122

Draft

3outeille closed this Dec 9, 2025

improve debug_local.sh (swithc between flavor selection + additional …

51417d4

…flags for compilation and selective checkpointing)

3outeille reopened this Dec 9, 2025

3outeille mentioned this pull request Dec 15, 2025

[transformers_modeling_backend] Upgrade transformers from 4.57.1 to 5.0.0rc0 pytorch/torchtitan#2154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve hf dense model throughput (no need actually) #5

Improve hf dense model throughput (no need actually) #5

Uh oh!

3outeille commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve hf dense model throughput (no need actually) #5

Are you sure you want to change the base?

Improve hf dense model throughput (no need actually) #5

Uh oh!

Conversation

3outeille commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Expecting

Testing methodology

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3outeille commented Dec 9, 2025 •

edited

Loading