TorchTitan e2e test on torchcomms device mesh by mori360 · Pull Request #1847 · pytorch/torchtitan

mori360 · 2025-10-09T18:29:55Z

Summary:
Composability testing with TorchComms and distributed training in TorchTitan.

Training with torchcomms.new_comm
Device mesh initialization with torchcomms.init_device_mesh
Integration and testing with fully_shard

Differential Revision: D82171763

Test plan:
TEST_BACKEND=nccl TRAIN_FILE=torchtitan.experiments.torchcomms.train ./run_train.sh --model.name torchcomms

Loss curve:
running 1000 steps on llama3_8b.toml

meta-codesync · 2025-10-09T18:30:01Z

@mori360 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D82171763.

Summary: Composability testing with TorchComms and distributed training in TorchTitan. - Training with `torchcomms.new_comm` - Device mesh initialization with `torchcomms.init_device_mesh` - Integration and testing with `fully_shard` Differential Revision: D82171763

tianyu-l

Thanks for the PR! The change looks interesting!

Is this PR for exploration or is it ready to ship to the community? If it's the former, could be start with branch / fork, instead of experiments?

Sorry to put a hold before we get more context.

tianyu-l

OK got clarifications offline. I think it's OK to host this experiment. To land, we'll need

simplify the code, as I believe a lot existing components could be reused
set up PoC for this folder (let's work together on this)

mori360 · 2025-10-09T21:04:33Z

Thanks for the PR! The change looks interesting!

Is this PR for exploration or is it ready to ship to the community? If it's the former, could be start with branch / fork, instead of experiments?

Sorry for the late reply, it's planed to ship to the community as a use case for torchcomms.

simplify the code, as I believe a lot existing components could be reused

I was cleaning the code, the main change here is

init communication and device mesh in (class ParallelDimsForComms in) parallel_dims.py
call ParallelDimsForComms in train.py
For train.py, all the other parts are reused except init

set up PoC for this folder

How can I set up the PoC?

Summary: Composability testing with TorchComms and distributed training in TorchTitan. - Training with `torchcomms.new_comm` - Device mesh initialization with `torchcomms.init_device_mesh` - Integration and testing with `fully_shard` Differential Revision: D82171763

tianyu-l · 2025-10-09T21:38:05Z

How can I set up the PoC?

I'll do this with a PR shortly. Should we assign you as the PoC?

mori360 · 2025-10-09T21:41:10Z

How can I set up the PoC?

I'll do this with a PR shortly. Should we assign you as the PoC?

Yeah, please. There would be some further changes to enable other parallelisms and relative tests.

fegin

It is reasonable to duplicate ParallelDims._build_mesh_without_ep but Trainer.init seems to be mostly the same. And Trainer.init is very long. So it is not easy to debug the difference. Can you point out what changes in Trainer.init? We can brainstorm how to further minimize the duplications.

fegin · 2025-10-09T22:24:03Z

torchtitan/experiments/torchcomms/README.md

+---
+#### Example
+```bash
+TEST_BACKEND=nccl ./run_train.sh --model.name torchcomms


This doesn't seem to be correct. You will at least need to specify CONFIG_FILE.

fegin · 2025-10-09T22:24:48Z

torchtitan/experiments/torchcomms/README.md

+  - Training with `torchcomms.new_comm`
+  - Device mesh initialization with `torchcomms.init_device_mesh`
+- **Composability Testing**
+  - Integration and testing with `fully_shard` (FSDP)


Is this FSDP2 only? I thought you also verified it with TP. cc., @fduwjj

We are working on ND now, will update readme later

@fegin there are still some gaps on the N-D side, so we aim at first merging this PR with 1D only. This is to scale down the scope of this PR and then we will have more PRs down the road.

fegin · 2025-10-09T22:27:37Z

torchtitan/experiments/torchcomms/train.py

+        # init distributed and build meshes
+        dist_utils.init_distributed(
+            job_config.comm,
+            enable_cpu_backend=job_config.training.enable_cpu_offload,
+            base_folder=job_config.job.dump_folder,
+        )
+        world_size = int(os.environ["WORLD_SIZE"])
+        parallelism_config = job_config.parallelism
+        self.parallel_dims = parallel_dims = ParallelDimsForComms(
+            dp_shard=parallelism_config.data_parallel_shard_degree,
+            dp_replicate=parallelism_config.data_parallel_replicate_degree,
+            cp=parallelism_config.context_parallel_degree,
+            tp=parallelism_config.tensor_parallel_degree,
+            pp=parallelism_config.pipeline_parallel_degree,
+            ep=parallelism_config.expert_parallel_degree,
+            etp=parallelism_config.expert_tensor_parallel_degree,
+            world_size=world_size,
+        )


iiuc, only this part of the initialization is changed. Is this correct? Or can you point out some other things you changed?

Yeah, had some other changes before, but now that's the only changes.
Will try some way to call ParallelDimsForComms here but avoiding copy train.init

You can let the original Trainer have one class variable called parallel_dims_cls and use that variable in the init to construct self.parallel_dims = parallel_dims. Then you can just create a CommTrainer and replace that class variable.

Another approach is to make the following code as a method, def create_parallel_dims(self, config) -> None:.

self.parallel_dims = parallel_dims = ParallelDimsForComms( dp_shard=parallelism_config.data_parallel_shard_degree, dp_replicate=parallelism_config.data_parallel_replicate_degree, cp=parallelism_config.context_parallel_degree, tp=parallelism_config.tensor_parallel_degree, pp=parallelism_config.pipeline_parallel_degree, ep=parallelism_config.expert_parallel_degree, etp=parallelism_config.expert_tensor_parallel_degree, world_size=world_size, )

Both approaches should work. cc., @tianyu-l @wwwjn

Sounds OK. I prefer the second option as it sounds a bit more straightforward. Maybe should call it _create_parallel_dims as it's not supposed to be called outside.

tianyu-l

@mori360
Once #1859 lands, please rebase and add owner to the experiment

mori360 · 2025-10-13T17:18:29Z

@mori360 Once #1859 lands, please rebase and add owner to the experiment

Thanks for the reminder, can I add @d4l3k and @fduwjj as the owner as well?

fduwjj

Can you run a job and paste the loss curve from tensor board here?

Summary: Composability testing with TorchComms and distributed training in TorchTitan. - Training with `torchcomms.new_comm` - Device mesh initialization with `torchcomms.init_device_mesh` - Integration and testing with `fully_shard` Differential Revision: D82171763

fduwjj

TorchComm integration side looks good to me and will let @tianyu-l and @fegin to decide on the Titan integration part.

fduwjj · 2025-10-13T23:22:53Z

Also we will have more converge and perf test down the road as follow-up PRs.

tianyu-l · 2025-10-13T23:34:18Z

torchtitan/experiments/torchcomms/README.md

@@ -0,0 +1,20 @@
+# TorchTitan & TorchComms Composability Testing
+
+This repository provides a framework for composability testing with **TorchComms** and distributed training in **TorchTitan**. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.


This is currently in bold font and look a bit obtrusive. Could you adjust them to use plain font?

tianyu-l · 2025-10-13T23:35:34Z

torchtitan/experiments/torchcomms/README.md

@@ -0,0 +1,20 @@
+# TorchTitan & TorchComms Composability Testing
+
+This repository provides a framework for composability testing with **TorchComms** and distributed training in **TorchTitan**. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.


Suggested change

This repository provides a framework for composability testing with **TorchComms** and distributed training in **TorchTitan**. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.

This folder provides a framework for composability testing with **TorchComms** and distributed training in **TorchTitan**. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.

tianyu-l · 2025-10-13T23:36:09Z

torchtitan/experiments/torchcomms/README.md

+
+This repository provides a framework for composability testing with **TorchComms** and distributed training in **TorchTitan**. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.
+---
+#### Example


mention that the command below uses Llama 3 as an example, but should work on all models.

tianyu-l · 2025-10-13T23:36:47Z

torchtitan/experiments/torchcomms/README.md

+---
+#### Example
+```bash
+TEST_BACKEND={backend} TRAIN_FILE=torchtitan.experiments.torchcomms.train ./run_train.sh --model.name torchcomms


TEST_BACKEND={backend}

What should this be?

users can input backend they want to use, e.g. nccl or other backend
It's a bit confusing here, will change to TEST_BACKEND=nccl

Can we mention all the available backends? From the readme it's hard to tell what people should put here.

let's mention nccl, gloo or any other user defined customized backend for now. Also let's mention that the user customized backend needs to implement torchComm wrapper. (We just don't mention the backend which cannot be mentioned at this moment.)

tianyu-l · 2025-10-13T23:37:42Z

torchtitan/experiments/torchcomms/__init__.py

+from torchtitan.models.llama3.infra.parallelize import parallelize_llama
+from torchtitan.protocols.train_spec import register_train_spec, TrainSpec
+
+register_train_spec(


why do you need to register this TrainSpec?

tianyu-l · 2025-10-13T23:45:12Z

torchtitan/experiments/torchcomms/README.md

+---
+#### Example
+```bash
+TEST_BACKEND={backend} TRAIN_FILE=torchtitan.experiments.torchcomms.train ./run_train.sh --model.name torchcomms


Let's set CONFIG_FILE here, too. You can refer to examples in main README.md

tianyu-l · 2025-10-13T23:46:07Z

torchtitan/train.py

            f"(warmup {job_config.lr_scheduler.warmup_steps})"
        )

+    def create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:


Suggested change

def create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:

def _create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:

tianyu-l · 2025-10-13T23:46:21Z

torchtitan/experiments/torchcomms/train.py

+class CommsTrainer(Trainer):
+    parallel_dims: ParallelDimsForComms
+
+    def create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:


Suggested change

def create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:

def _create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:

tianyu-l · 2025-10-13T23:46:29Z

torchtitan/experiments/torchcomms/train.py

+from .parallel_dims import ParallelDimsForComms
+
+
+class CommsTrainer(Trainer):


Suggested change

class CommsTrainer(Trainer):

class TorchCommsTrainer(Trainer):

tianyu-l · 2025-10-13T23:46:48Z

torchtitan/experiments/torchcomms/parallel_dims.py

+
+
+@dataclass
+class ParallelDimsForComms(ParallelDims):


Suggested change

class ParallelDimsForComms(ParallelDims):

class TorchCommsParallelDims(ParallelDims):

meta-codesync · 2025-10-14T01:43:34Z

@mori360 has imported this pull request. If you are a Meta employee, you can view this in D82171763.

tianyu-l · 2025-10-14T01:56:21Z

torchtitan/experiments/torchcomms/llama3_8b.toml

shall we remove this file?

tianyu-l · 2025-10-14T01:56:35Z

torchtitan/experiments/torchcomms/__init__.py

I think we can remove this file for now.

tianyu-l · 2025-10-14T01:56:46Z

torchtitan/experiments/torchcomms/README.md

@@ -0,0 +1,22 @@
+# TorchTitan & TorchComms Composability Testing
+
+This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.


seems font hasn't been fixed

tianyu-l · 2025-10-14T01:56:58Z

torchcomms

what's this module?

For now, we cannot mention too much details. We will add more context when it goes public. We need to merge this PR first so that the titan integration can go with the release of torchcomm.

@mori360 let's add a TODO here to add more explanation once the torchcomm goes public.

fduwjj · 2025-10-14T02:28:40Z

Looks like you have lint error as well?

fduwjj · 2025-10-14T02:41:09Z

if you choose this:

Would that help make CI happy?

fduwjj

Thanks for doing this, looks good to me now.

fduwjj · 2025-10-14T02:43:04Z

torchtitan/experiments/torchcomms/README.md

+  - Integration and testing with `fully_shard` (FSDP)
+---
+### To Be Added
+- Integration and testing with additional parallelism strategies (e.g., tensor, pipeline, model parallelism) other than fully_shard


can you remove model parallelism or replace it with context parallelism? Thanks

tianyu-l

LGTM

Summary: Composability testing with TorchComms and distributed training in TorchTitan. - Training with `torchcomms.new_comm` - Device mesh initialization with `torchcomms.init_device_mesh` - Integration and testing with `fully_shard` Differential Revision: D82171763 Test plan: TEST_BACKEND=nccl TRAIN_FILE=torchtitan.experiments.torchcomms.train ./run_train.sh --model.name torchcomms Loss curve: running 1000 steps on llama3_8b.toml <img width="1095" height="469" alt="Screenshot 2025-10-13 at 4 14 46 PM" src="https://github.com/user-attachments/assets/3d9ddf06-af76-44cf-ac75-b9f92e6d0f06" />

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 9, 2025

fduwjj requested review from fegin, tianyu-l and wwwjn October 9, 2025 18:36

mori360 force-pushed the export-D82171763 branch 2 times, most recently from 36ab517 to 7c6435e Compare October 9, 2025 18:45

mori360 force-pushed the export-D82171763 branch from 7c6435e to 8998f1c Compare October 9, 2025 18:52

tianyu-l requested changes Oct 9, 2025

View reviewed changes

tianyu-l reviewed Oct 9, 2025

View reviewed changes

mori360 force-pushed the export-D82171763 branch from 8998f1c to a6b3a47 Compare October 9, 2025 21:18

mori360 marked this pull request as draft October 9, 2025 21:25

mori360 force-pushed the export-D82171763 branch from a6b3a47 to 09e6610 Compare October 9, 2025 21:26

mori360 force-pushed the export-D82171763 branch from 09e6610 to 288f37f Compare October 9, 2025 21:37

mori360 marked this pull request as ready for review October 9, 2025 21:41

fegin reviewed Oct 9, 2025

View reviewed changes

tianyu-l reviewed Oct 12, 2025

View reviewed changes

fduwjj reviewed Oct 13, 2025

View reviewed changes

mori360 added 5 commits October 13, 2025 15:13

correct error message

073a0bd

update readme

15ba529

add requirements

650afcf

remove integration_tests.py

b270b5a

mori360 force-pushed the export-D82171763 branch from 7de8101 to b270b5a Compare October 13, 2025 22:24

fduwjj reviewed Oct 13, 2025

View reviewed changes

tianyu-l reviewed Oct 13, 2025

View reviewed changes

address comments

3110d10

mori360 requested a review from tianyu-l October 14, 2025 01:43

tianyu-l reviewed Oct 14, 2025

View reviewed changes

mori360 added 5 commits October 13, 2025 19:30

change font, remove some files

775a559

update readme

27debfb

update readme

201001e

update readme

83d546d

lint

33d6a83

mori360 requested review from fduwjj and tianyu-l October 14, 2025 02:39

fduwjj approved these changes Oct 14, 2025

View reviewed changes

add details in readme

5100ae1

tianyu-l approved these changes Oct 14, 2025

View reviewed changes

tianyu-l merged commit cd304c7 into pytorch:main Oct 14, 2025
10 of 11 checks passed

		@@ -0,0 +1,20 @@
		# TorchTitan & TorchComms Composability Testing

		This repository provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.

	This repository provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.
	This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.

	def create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:
	def _create_parallel_dims(self, parallelism_config, world_size) -> ParallelDims:

		from .parallel_dims import ParallelDimsForComms


		class CommsTrainer(Trainer):

	class CommsTrainer(Trainer):
	class TorchCommsTrainer(Trainer):

	class ParallelDimsForComms(ParallelDims):
	class TorchCommsParallelDims(ParallelDims):

		@@ -0,0 +1,22 @@
		# TorchTitan & TorchComms Composability Testing

		This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch.

Conversation

mori360 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Oct 9, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

mori360 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Oct 9, 2025

Uh oh!

mori360 commented Oct 9, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

mori360 commented Oct 13, 2025

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Oct 14, 2025

mori360 commented Oct 9, 2025 •

edited

Loading

mori360 commented Oct 9, 2025 •

edited

Loading