Use joint trace in transform_for_execution by beverlylytle · Pull Request #2102 · Lightning-AI/lightning-thunder

beverlylytle · 2025-05-20T14:07:39Z

This PR aims to use a joint forward-backward trace in transform_for_execution while jitting, instead of separately processing a forward trace and a backward trace. This change is behind the compile option flag delay_trace_split, which currently defaults to True. Provided no performance or memory issues appear, this will allow for a follow-up PR which can remove the flag and delete ~300 lines from torch_autograd.py and ~300 lines from rematerialization.py along with the relevant tests.

thunder/executors/torchex.py

IvanYashchuk

In thunder/tests/distributed/test_ddp.py it's okay to skip the test grad bucketing in Thunder's DDP. I only get two failures with TransformerEngine (the same error as in #2060):

FAILED thunder/tests/distributed/test_ddp.py::test_ddp_transformer_engine_torch_cuda_thunder.dtypes.float32 - ValueError: not enough values to unpack (expected 5, got 4)
FAILED thunder/tests/distributed/test_ddp.py::test_ddp_transformer_engine_llama_sanity_torch_cuda_thunder.dtypes.float32 - ValueError: not enough values to unpack (expected 5, got 4)

Similarly, in thunder/tests/distributed/test_fsdp.py it's okay to skip failing tests for Thunder's FSDP bucketing and no_sync.

thunder/tests/test_torch_compile_executor.py

thunder/__init__.py

for more information, see https://pre-commit.ci

Copilot

Pull Request Overview

This pull request updates several components of the autodiff and distributed transforms along with modifications to the executor implementations and corresponding tests to support new behaviors in the CI. Key changes include:

Introducing a new utility function (_group_get_grad_bsyms) in the autodiff transform.
Adjusting test expectations and xfail markers for distributed traces.
Refining conditionals in passes and executors to appropriately handle get_grad operations.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
thunder/transforms/autodiff.py	Added _group_get_grad_bsyms and updated gradient grouping logic.
thunder/tests/test_examine_memory.py	Updated test expectations for memory estimates.
thunder/tests/distributed/test_fsdp.py	Updated unshard parameter names and corrected trace index usage.
thunder/tests/distributed/test_ddp.py	Added xfail marker for grad bucketing test.
thunder/executors/torchex.py	Minor whitespace addition before shallow_copy registration.
thunder/executors/torch_compile.py	Excluded GET_GRAD from implementation mapping in executor.
thunder/executors/torch_autograd.py	Early return if bw_trace is None added.
thunder/executors/passes.py	Extended condition to pass through GET_GRAD symbols.
thunder/core/transform_common.py	Skipping further processing for GET_GRAD symbols now.
thunder/core/rematerialization.py	Enhanced filtering of parameter names during rematerialization.
thunder/init.py	Revised trace-split logic under the delay_trace_split branch.

thunder/transforms/autodiff.py

thunder/__init__.py

thunder/transforms/autodiff.py

lantiga · 2025-05-27T13:35:25Z

@IvanYashchuk about skipping bucketing with ddp and fsdp, we are actually counting on that to work properly for our distributed work. Do you think this can be tackled on your end?

lantiga · 2025-05-27T14:27:39Z

Thanks for the clarification @IvanYashchuk. I agree we can forego bucketing for now and eventually circle back to it at a later stage.

thunder/core/transform_common.py

thunder/executors/torch_compile.py

thunder/transforms/autodiff.py

beverlylytle · 2025-06-03T10:08:17Z

I've benchmarked this change with the following baseline and result:

@main
Model name: Llama-3-8B
Seq Length: 8192
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 1.00B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: block
Compiler: dynamo_thunder
Low Precision Mode: none
Average iter time: 782.73 ms
Memory used: 72.61 GB
Tokens/s: 83707.48
Tokens/s/GPU: 10463.43
TFLOP/s: 4847.74

@reautograd2
Model name: Llama-3-8B
Seq Length: 8192
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 1.00B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: block
Compiler: dynamo_thunder
Low Precision Mode: none
Average iter time: 781.49 ms
Memory used: 72.61 GB
Tokens/s: 83855.93
Tokens/s/GPU: 10481.99
TFLOP/s: 4856.34

t-vi · 2025-06-03T10:57:15Z

Does it work with the thunder.jit? Could we also benchmark with "thunder" as compiler?

beverlylytle · 2025-06-03T11:23:47Z

@main
Model name: Llama-3-8B
Seq Length: 8192
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 1.00B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder
Low Precision Mode: none
Average iter time: 799.59 ms
Memory used: 75.75 GB
Saved for backward size: 58448.60 MiB
Saved for backward number of tensors: 775
Tokens/s: 81988.63
Tokens/s/GPU: 10248.58
TFLOP/s: 4748.20

@retrograd2
Model name: Llama-3-8B
Seq Length: 8192
Micro BS: 1
Global BS: 8
Number of Layers: 32
Number of parameters: 1.00B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder
Low Precision Mode: none
Average iter time: 808.52 ms
Memory used: 71.53 GB
Saved for backward size: 62988.09 MiB
Saved for backward number of tensors: 712
Tokens/s: 81032.54
Tokens/s/GPU: 10129.07
TFLOP/s: 4692.83

beverlylytle · 2025-06-04T07:53:14Z

@riccardofelluga has made an excellent point. The bound symbols which create tensors that are to be recomputed in the backward pass are only inserted once the joint trace is split. This means that those operations do not get fused optimally.

t-vi · 2025-06-04T08:10:16Z

@beverlylytle Thank you for running the benchmark.

But isn't it then the fusion or part of it that gets duplicated for recomputation?
At least that should be our target.

The other question is whether everything between the autodiff and the split propagates the tag properly. We would probably need to take a good look.
For Transform for Operator Execution, the trace processor would (want to) do it.
For the fusion passes, this may be more tricky currently.

What we need to do is likely look for recompute tags in the fusion subsymbols and then duplicate the fusion and then dce the bits away that we don't need. WDYT?

beverlylytle · 2025-06-04T08:56:21Z

Yes, it is the fusion or part of it that gets duplicated, but a fusion region may compute more than what is intended to be recomputed in backward, right? Moreover, the fusion regions for the basic backward together with the fusion regions for the stuff to be recomputed in the backward may not maximally fused.

t-vi · 2025-06-04T09:37:31Z

a fusion region may compute more than what is intended to be recomputed in backward, right?

Yes, but this could be DCEd and rerun the codegen?

Moreover, the fusion regions for the basic backward together with the fusion regions for the stuff to be recomputed in the backward may not maximally fused.

Indeed, and it is tricky. But our fusion algorithms all seem wild currently. The one used by JIT by default fuses little (only adjacent regions) and the one used by thunderfx reoders operations that can wreck memory consumption (we saw this for checkpointing in particular).
I think if we can rerun the codegen, we could at least re-fuse adjacent regions.

But so I seem to remember that the interaction of fusions with recomputation was one of the key things you wanted to get out of the joint trace logic, so it might be more tricky yet.

The other thing on my list of larger project that might impact our choices is to have something like

def fn(inp, target):
  out = model(inp)
  loss = lossfn(out, target)
  loss.backward()
   
jfn = jit(fn, models=(model,))

as one trace. This would let us not split the trace and likely is an important next step (followed by adding optimizer.step()) for optimizing training.

beverlylytle · 2025-06-06T08:35:37Z

Not only is it important that the duplicates of the bsyms that should be recomputed exist during the fusion pass, it is also important that they exist during rematerialization. I am going to try an approach where the duplication logic is moved from the splitting up to the creation of the joint trace. I will modify CSE so that no commonality will be found across the forward-backward divide.

t-vi · 2025-06-06T12:51:22Z

To my mind, we could merge this PR as is and do what is needed in a follow up. WDYT? (also @IvanYashchuk )

t-vi

Exciting to have this go in!
Thank you @beverlylytle @riccardofelluga @IvanYashchuk @lantiga

beverlylytle added 3 commits May 20, 2025 17:06

Refactor connection to autograd with new joint trace creation

578d06e

apply update_fusion_call_ctx

b04654a

check bw for None

0c19438

t-vi reviewed May 21, 2025

View reviewed changes

thunder/executors/torchex.py Outdated Show resolved Hide resolved

beverlylytle and others added 11 commits May 23, 2025 11:50

don't fuse get_grad

f0182ee

group get_grads together for torch compile fusions

440bc96

remove torchex impl of get_grad in favor of OpExProcessor exception

6bb4293

Merge branch 'main' into reautograd2

01dea9d

hide behind flag and clean up

f45c92d

Xfail test_ddp_grad_bucketing

eb32063

Xfail test_limit_in_flight_allgathers with bucketing

32b9675

Xfail test_fsdp_with_no_sync_grad_accumulation

675ea02

Xfail test_fsdp_grad_parity_with_without_bucketing

aafc899

Fix test_rematerialize_all_gather

9b4b252

Restore test_torch_compile_cat_rope_single_fusion

a27f390

IvanYashchuk reviewed May 27, 2025

View reviewed changes

thunder/tests/test_torch_compile_executor.py Outdated Show resolved Hide resolved

thunder/__init__.py Outdated Show resolved Hide resolved

[pre-commit.ci] auto fixes from pre-commit.com hooks

56625f6

for more information, see https://pre-commit.ci

IvanYashchuk requested a review from Copilot May 27, 2025 12:30

Copilot AI reviewed May 27, 2025

View reviewed changes

thunder/transforms/autodiff.py Outdated Show resolved Hide resolved

remove extra rematerialization

971c52e

beverlylytle commented May 27, 2025

View reviewed changes

thunder/__init__.py Show resolved Hide resolved

beverlylytle commented May 27, 2025

View reviewed changes

thunder/transforms/autodiff.py Outdated Show resolved Hide resolved

IvanYashchuk mentioned this pull request May 27, 2025

Remove "block" and "layer" bucketing modes from FSDP implementation #2144

Open

beverlylytle changed the title ~~[WIP2]~~ Use joint trace in transform_for_execution May 28, 2025

beverlylytle marked this pull request as ready for review May 28, 2025 07:40

beverlylytle requested review from lantiga and mruberry as code owners May 28, 2025 07:40

Merge branch 'main' into reautograd2

d059353

beverlylytle mentioned this pull request May 28, 2025

[WIP] #2037

Closed

4 tasks

riccardofelluga reviewed May 28, 2025

View reviewed changes

thunder/core/transform_common.py Show resolved Hide resolved

thunder/executors/torch_compile.py Outdated Show resolved Hide resolved

thunder/transforms/autodiff.py Outdated Show resolved Hide resolved

remove outdated change

9cc8c78

riccardofelluga mentioned this pull request Jun 2, 2025

[WIP] TE v2 activation checkpointing support #2094

Closed

2 tasks

beverlylytle added 3 commits June 3, 2025 11:48

Merge branch 'main' into reautograd2

67cde6e

clean up after merge

e1308e1

more clean up

b73702a

Merge branch 'main' into reautograd2

d39f270

riccardofelluga mentioned this pull request Jun 3, 2025

Autodiff does not place the correct producer for recomputation(checkpointing) #2112

Open

kshitij12345 mentioned this pull request Jun 5, 2025

TE: Fix redundant compute for PEFT using transform #2138

Open

1 task

IvanYashchuk approved these changes Jun 6, 2025

View reviewed changes

t-vi approved these changes Jun 6, 2025

View reviewed changes

t-vi enabled auto-merge (squash) June 6, 2025 13:27

t-vi merged commit 290a52e into main Jun 6, 2025
49 checks passed

t-vi deleted the reautograd2 branch June 6, 2025 13:28

This was referenced Jun 6, 2025

Dont delay split in benchmarks #2201

Closed

fix activation checkpointing in the joint trace #2203

Merged

kshitij12345 mentioned this pull request Jun 12, 2025

TE: fix related to delayed forward-backward split #2222

Merged

Conversation

beverlylytle commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

IvanYashchuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lantiga commented May 27, 2025

Uh oh!

lantiga commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beverlylytle commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jun 3, 2025

Uh oh!

beverlylytle commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beverlylytle commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

t-vi commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beverlylytle commented Jun 4, 2025

Uh oh!

t-vi commented Jun 4, 2025

Uh oh!

beverlylytle commented Jun 6, 2025

Uh oh!

t-vi commented Jun 6, 2025

Uh oh!

t-vi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

beverlylytle commented May 20, 2025 •

edited

Loading

beverlylytle commented Jun 3, 2025 •

edited

Loading

beverlylytle commented Jun 3, 2025 •

edited

Loading

beverlylytle commented Jun 4, 2025 •

edited

Loading

t-vi commented Jun 4, 2025 •

edited

Loading