[RFC] Consolidate simple_fsdp and compiler_toolkit experiments#2360
Closed
yiming0416 wants to merge 1 commit intomainfrom
Closed
[RFC] Consolidate simple_fsdp and compiler_toolkit experiments#2360yiming0416 wants to merge 1 commit intomainfrom
yiming0416 wants to merge 1 commit intomainfrom
Conversation
57f47e9 to
00b5640
Compare
00b5640 to
17b18ec
Compare
Contributor
|
Can we use |
tianyu-l
reviewed
Feb 12, 2026
Contributor
tianyu-l
left a comment
There was a problem hiding this comment.
sorry I'm doing a massive refactoring of torchtitan config system. I'm almost done (with the first version) and I'd prefer we rebase after I'm done.. (pray)
Contributor
Author
|
moved to #2457 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR merges the
simple_fsdpandcompiler_toolkitexperiments into a new unified experiment calledgraph_based_training(name to be discussed later).The two experiments shared the same DTensor-based SimpleFSDP model authoring but had separate compilation paths:
simple_fsdpused JIT compilation (torch.compile) and compiler_toolkit used AOT joint graph capture. The new experiment unifies them under a singlecompile.modeconfig field ("jit"or"aot"), with a shared pass registry that validates pass/mode compatibility.This PR creates a separate new folder. No existing files in simple_fsdp/ or compiler_toolkit/ are modified.
File change breakdown
Files copied without changes:
simple_fsdp.py— Copied fromsimple_fsdp/simple_fsdp.py.reshard_after_forward.py— Copied fromsimple_fsdp/reshard_after_forward.py.cudagraph.py— Copied fromcompiler_toolkit/cudagraph.py.Files copied with import path changes only
common_utils.py— Adapted fromcompiler_toolkit/common_utils.py.graph_utils.py— Adapted fromcompiler_toolkit/graph_utils.py.jit_backend.py— Adapted fromsimple_fsdp/backend.py.train.py— Adapted fromcompiler_toolkit/train.py.llama3/__init__.py— Adapted fromsimple_fsdp/llama3/__init__.py.llama3/model.py— Adapted fromsimple_fsdp/llama3/model.py.deepseek_v3/__init__.py— Adapted fromsimple_fsdp/deepseek_v3/__init__.py.deepseek_v3/model.py— Adapted fromsimple_fsdp/deepseek_v3/model.py.Files adapted with non-trivial changes
passes.py— Unified pass registrycompilation.py— Unified compilation dispatcher routing to_apply_jit()and_apply_aot()job_config.py— Merged config with modeaot,jitllama3/parallelize.py— Unified parallelize function merging logic fromsimple_fsdpandcompiler_toolkitdeepseek_v3/parallelize.py— Same as above, but for DSv3.tests/integration_tests.py— Merged integration tests from both experiments.