3outeille/transformers backend (Dense model only)#2048
3outeille/transformers backend (Dense model only)#2048tianyu-l merged 132 commits intopytorch:mainfrom
Conversation
… gradnorm and less tps with HF model
wwwjn
left a comment
There was a problem hiding this comment.
Thanks for the great work again, let some comments
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
Could you quicly remind me why we need to use Identity() here?
There was a problem hiding this comment.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
Seems like PP ranks are restored perfectly because we have perfect match with Qwen but not with Llama for example (cf the screenshot at huggingface#4)
| ) | ||
|
|
||
|
|
||
| def apply_fsdp( |
There was a problem hiding this comment.
By reading this function, the function is the same as the apply_fsdp function in llama4/parallelize (I know we will keep MoE capability for next PR), can we reuse the apply_fsdp function from llama4 and avoid keeping multiple copies?
There was a problem hiding this comment.
Oh I see the difference - The only difference is moe_block = transformer_block.mlp line 337, in transformers models, the MoE module is named mlp, instead of moe. Can we use the same getter/setter way to rename it in model.py, so we can reuse the apply_fsdp function from llama4.
I don't have strong opinion on this, but I'm a little bit concerned if we have several copies, they will become diverged easily in the future
There was a problem hiding this comment.
Valid concern. i'll reuse fsdp from llama3 for now as this PR handles only dense. It will make more sense to handle the getter/setter in the MoE PR
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
tianyu-l
left a comment
There was a problem hiding this comment.
Please address final comments.
torchtitan/experiments/transformers_backend/tests/integration_tests.py
Outdated
Show resolved
Hide resolved
| setattr(model, module_name, None) | ||
| # Replace with Identity or None based on configuration | ||
| replacement = ( | ||
| nn.Identity() if use_identity_for_missing_modules else None |
There was a problem hiding this comment.
I think it's because HF define their models without things like if toke_embeddings is None.
I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?
cc @fegin if you know this definitively.
There was a problem hiding this comment.
It sounds the changes are caused by specific ways transformers define models. Then let's fork the changed functions into experiments/transformers_backend/. I apologize for the back & forth.
There was a problem hiding this comment.
but isnt the compromise good enough ? Copy pasting means not noticing changes in Pipeline parallel later on
There was a problem hiding this comment.
For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.
Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.
Thanks!
There was a problem hiding this comment.
For rotary_emb, torchtitan doesn't own the model definition, so has no visibility about this module and no guarantee on the correctness. That's why I think it's better for transformers_backend folder to own this function.
Regarding use_identity_for_missing_modules, I'm not convinced that it would work with DCP. If it's a transformers specific decision, we should also limit the scope of change to transformers_backend folder.
Thanks!
torchtitan/distributed/utils.py
Outdated
| torch.backends.cudnn.deterministic = True | ||
| torch.backends.cudnn.benchmark = False | ||
| # Otherwise, Huggignface modeling register buffer for ROPE (inv_freq) and this will be by default be initialized to Nan | ||
| torch.utils.deterministic.fill_uninitialized_memory = False |
There was a problem hiding this comment.
If you think this is hf specific and can be put in model.py, let's do it.
| OverrideDefinitions( | ||
| [ | ||
| [ | ||
| "--model.name meta-llama/Llama-3.2-1B", |
There was a problem hiding this comment.
CI seems failing because of this -- should change to transformers_backend and specify --hf_transformers.model?
Regarding |
bcf5355 to
c0c273c
Compare
tianyu-l
left a comment
There was a problem hiding this comment.
It seems CI is not running on this change, please see inline comments.
I also left some other remaining minor comments.
torchtitan/experiments/README.md
Outdated
| | [moe_symm_mem_kernels](./moe_symm_mem_kernels/) | TBA | [@kwen2501](https://github.com/kwen2501) | | ||
| | [gpt_oss](./gpt_oss/) | TBA | [@jianiw](https://github.com/jianiw) | | ||
| | [compiler_toolkit](./compiler_toolkit/) | [](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu_compiler_toolkit.yaml?query=branch%3Amain) | [@SherlockNoMad](https://github.com/SherlockNoMad) [@yiming0416](https://github.com/yiming0416) | | ||
| | [transformers_backend](./transformers_backend/) |  | [@3outeille](https://github.com/3outeille) | |
There was a problem hiding this comment.
This is not properly linked to the actual tests. Please refer to how others are done.
| same root config file. | ||
| """ | ||
| integration_tests_flavors = [ | ||
| OverrideDefinitions( |
There was a problem hiding this comment.
This is missing --model.name transformers_backend, so actually the CI is running llama3 in the ./tests/integration_tests/base_config.toml file
https://github.com/pytorch/torchtitan/actions/runs/19500238760/job/55877797776?pr=2048#step:16:392
|
|
||
| [parallelism] | ||
| data_parallel_replicate_degree = 1 | ||
| data_parallel_shard_degree = 2 |
There was a problem hiding this comment.
Let's restore this to -1, and others to 1, so it's consistent with other debug tomls.
| mixed_precision_param = "float32" # force float32 for comparison | ||
| mixed_precision_reduce = "float32" |
There was a problem hiding this comment.
can we remove these two fields, so default mixed_precision_param is bf16?
|
|
||
| [model] | ||
| name = "transformers_backend" | ||
| flavor = "debugmodel" |
There was a problem hiding this comment.
I think it doesn't hurt to create two toml, one has debugmodel with c4_test dataset, the other has full and uses c4 dataset.
There was a problem hiding this comment.
Let's still name this to pipeline.py (just convention, no real reasons).
|
Kudos everyone! Thanks |
# Context Reference PR: huggingface#1 This PR enables: - Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested: - `meta-llama/Llama-3.2-1B` - `microsoft/phi-2` - `Qwen/Qwen2.5-7B` - `mistralai/Mistral-7B-v0.1` - `ByteDance-Seed/Seed-Coder-8B-Instruct` - `Qwen/Qwen3-4B-Instruct-2507` - `arcee-ai/AFM-4.5B` - `ibm-granite/granite-3b-code-base-2k` - `baidu/ERNIE-4.5-0.3B-Base-PT` - `kyutai/helium-1-preview-2b` - `allenai/OLMo-7B-hf` - `mistralai/Ministral-8B-Instruct-2410` - Patching HF models weights initialisation. Without this, the the `loss` and `grad_norm` starts very high # Usage - Requirements `transformers==4.57.1` - Config: `torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml` ```diff ... [model] - name = "llama3" + name = "transformers_backend" flavor = "debugmodel" hf_assets_path = "./tests/assets/tokenizer" +[hf_transformers] +model = "Qwen/Qwen3-4B-Instruct-2507" ... ``` - Train: `LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable` <img width="1334" height="453" alt="image" src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c" /> # Testing methodology <img width="2672" height="2018" alt="image" src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7" /> - Following the [converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md) guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other //-ism>` - More precisely, the `test_hf_integration.py`is going to do: ```bash results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ... ``` - Here is the grid search to test the HF modelling ```shell #!/usr/bin/bash model_names=( "meta-llama/Llama-3.2-1B" "microsoft/phi-2" "Qwen/Qwen2.5-7B" "mistralai/Mistral-7B-v0.1" "ByteDance-Seed/Seed-Coder-8B-Instruct" "Qwen/Qwen3-4B-Instruct-2507" "arcee-ai/AFM-4.5B" "ibm-granite/granite-3b-code-base-2k" "baidu/ERNIE-4.5-0.3B-Base-PT" "kyutai/helium-1-preview-2b" "allenai/OLMo-7B-hf" "mistralai/Ministral-8B-Instruct-2410" ) for model_name in "${model_names[@]}"; do rm -rf slurm_results/${model_name} python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do echo "Waiting for seed checkpoint from ${model_name} to complete ..." sleep 1 done python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high echo "================" done ``` # Further tasks - Moe (handle in PR huggingface#3) - Missing `build_optimizers_with_moe_load_balancing` support for MoE - Missing TP/PP/EP supports for MoE - When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss` and `grad_norm` not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in huggingface#4) - Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching) - the HF modeling has lower MFU than Torchtitan MFU - NOTE: `import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for graph when using `torch.compile` and `activation checkpointing`
Context
Reference PR: huggingface#1
This PR enables:
meta-llama/Llama-3.2-1Bmicrosoft/phi-2Qwen/Qwen2.5-7Bmistralai/Mistral-7B-v0.1ByteDance-Seed/Seed-Coder-8B-InstructQwen/Qwen3-4B-Instruct-2507arcee-ai/AFM-4.5Bibm-granite/granite-3b-code-base-2kbaidu/ERNIE-4.5-0.3B-Base-PTkyutai/helium-1-preview-2ballenai/OLMo-7B-hfmistralai/Ministral-8B-Instruct-2410lossandgrad_normstarts very highUsage
transformers==4.57.1torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.tomlLOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enableTesting methodology
FSDP=2vsFSDP=2 & <other //-ism>test_hf_integration.pyis going to do:results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ...Further tasks
build_optimizers_with_moe_load_balancingsupport for MoEFSDP=2 vs FSDP=2 + PP=2, thelossandgrad_normnot bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise huggingface/torchtitan#4)import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128to avoid recomputation for graph when usingtorch.compileandactivation checkpointing