Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
740beac
[torchtitan][replicate] experimenting new replicate integration with …
anshul-si Sep 15, 2025
82ccb85
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 16, 2025
b946784
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 16, 2025
23421c2
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 23, 2025
25632be
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 23, 2025
c383089
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 29, 2025
0951da7
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Sep 30, 2025
ba2a3fc
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 4, 2025
6364c23
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 5, 2025
b7bd2f0
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 5, 2025
9ea3fee
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 5, 2025
f61c56a
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 5, 2025
92e54e8
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 5, 2025
9effc88
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Nov 6, 2025
de0a0bc
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 6, 2026
f6816e2
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 9, 2026
03debc5
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 11, 2026
096ce87
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 11, 2026
dfebee0
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 11, 2026
688c5a5
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 11, 2026
dd30603
Update on "[torchtitan][replicate] experimenting new replicate integr…
anshul-si Feb 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Update on "[torchtitan][replicate] experimenting new replicate integr…
…ation with torchtitan"

**Summary:** During this experiment to integrate the new replicate function into torchtitan, I used pytorch/pytorch#162021, which has not been landed. However, since this is more about making replicate more efficient rather than changing replicate's core code, pytorch/pytorch#160135, which has landed, should be fine. pytorch/pytorch#160133 is the last time replicate_with_fsdp.py and its replicate api were touched. 

In order to enable the new replicate, which uses a 2D device mesh (since it is a specialized version of HSDP), I changed the parallelism code to include dp_shard dim = 1 only if dp_replicate > 1, and created device mesh that I pass down in apply_ddp. 

The numeric tests for tp + replicate and pp + replicate can be seen below. In order to ensure that they worked, I also compared them with HSDP (n, 1) (replicate, shard).

<img width="950" height="485" alt="image" src="https://github.com/user-attachments/assets/a7bede55-54af-43f4-9fa0-4430f1992d73" />

https://fburl.com/mlhub/5k9v43w3

**Test Case**
1. CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh (set replicate to 8)

Expected output of this experiment should be something like:
[rank0]:[titan] 2025-09-15 17:38:26,676 - root - INFO - Starting job: Llama 3 debug training
[rank0]:[titan] 2025-09-15 17:38:29,094 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
**[rank0]:[titan] 2025-09-15 17:38:29,097 - root - INFO - Building 2-D device mesh with ['dp_replicate', 'dp_shard'], [8, 1]**
[rank0]:[titan] 2025-09-15 17:38:29,104 - root - INFO - [GC] Initial GC collection 0.00 seconds
[rank0]:NCCL version 2.27.5+cuda12.6
[rank0]:[titan] 2025-09-15 17:38:35,439 - root - INFO - Loading tokenizer from tokenizer.json
[rank0]:[titan] 2025-09-15 17:38:35,441 - root - INFO - Preparing c4_test dataset from tests/assets/c4_test
[rank0]:[titan] 2025-09-15 17:38:35,894 - root - INFO - Building llama3 debugmodel with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=256, n_layers=6, n_heads=16, n_kv_heads=None, vocab_size=2000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=500000, max_seq_len=2048, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=0)
[rank0]:[titan] 2025-09-15 17:38:35,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-09-15 17:38:35,950 - root - INFO - Model llama3 debugmodel size: 6,139,136 total parameters
[rank0]:[titan] 2025-09-15 17:38:35,951 - root - INFO - Applied selective activation checkpointing to the model
**[rank0]:[titan] 2025-09-15 17:38:35,972 - root - INFO - Applied DDP to the model**
[rank0]:[titan] 2025-09-15 17:38:36,153 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-09-15 17:38:36,153 - root - INFO - CUDA memory usage for model: 0.04GiB(0.04%)
[rank0]:[titan] 2025-09-15 17:38:36,154 - root - WARNING - model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json.                     Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[rank0]:[titan] 2025-09-15 17:38:36,154 - root - INFO - Mixed precision training is handled by AMP
[rank0]:[titan] 2025-09-15 17:38:36,154 - root - INFO - Trainer is initialized with local batch size 8, global batch size 64, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)




[ghstack-poisoned]
  • Loading branch information
anshul-si committed Feb 12, 2026
commit dd30603314c6d4610372408f6bf7fd4b959361ce

This merge commit was added into this branch cleanly.

There are no new changes to show, but you can still view the diff.