Skip to content

Conversation

@SherlockNoMad
Copy link
Contributor

@SherlockNoMad SherlockNoMad commented Oct 31, 2025

Need to run with fix in pytorch/pytorch#166702

NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Current output: P2016557983

Observations

  • I see each TransformerBlock becomes one subgraph, look for subgraph_0, subgraph_2... This is not what we want. we should see 1 instance of subgraph_0, and multiple invoke_subgraph nodes on the same subgraph_0, with different layer weights.
  • Due to AC, we also have hop.tag_activation_checkpoint(subgraph_1), where subgraph_1 internally calls invoke_subgraph for he transformerblock. We are getting into nested HOP/subgraph region.
  • dynamo_graph_capture passing. currently failing on aot_export_joint. Looks like DTensor x Dynaomo softness.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025
@miladm
Copy link

miladm commented Nov 5, 2025

cc @williamwen42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants