Skip to content

[sglang] fix: Bug in megatron+sglang TP16 update_weights.#2336

Merged
zhaochenyang20 merged 4 commits intoverl-project:mainfrom
SuperCB:megatron_sg_bug
Jul 9, 2025
Merged

[sglang] fix: Bug in megatron+sglang TP16 update_weights.#2336
zhaochenyang20 merged 4 commits intoverl-project:mainfrom
SuperCB:megatron_sg_bug

Conversation

@SuperCB
Copy link
Contributor

@SuperCB SuperCB commented Jul 3, 2025

What does this PR do?

We observe the following when using Megatron + Sglang + TP16:

image

After investigation, we found that this was caused by the cudaipc mechanism not supporting cross-machine access. We have resolved and fixed this bug.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

High-Level Design

image

Specific Changes

List the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Jul 3, 2025

CLA assistant check
All committers have signed the CLA.

@zhaochenyang20
Copy link
Collaborator

Could you give me an end to end docs to reproduce your fix and verify, like this:

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/tool_examples/debug.md

@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 3, 2025

Could you give me an end to end docs to reproduce your fix and verify, like this:

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/tool_examples/debug.md

This might take some time because our code and configurations are for training internal, self-developed models. We need to prepare reproducible configurations for training on open-source models.

@ETOgaosion
Copy link
Collaborator

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

@zhaochenyang20
Copy link
Collaborator

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

Let me discuss with our team

@Hecate0821
Copy link
Contributor

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

When passing a tensor across processes, it should be serialized first. You can find a similar implementation in fsdp_sglang.py.

@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 4, 2025

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

It's a good idea.

@zhaochenyang20
Copy link
Collaborator

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.
@zhaochenyang20 What do you think of integrating this in SGLang?

It's a good idea.

I think it's better to be serialized in verl and only pass handle tuple to SGLang. Like FSDP does:

def _preprocess_tensor_for_update_weights(tensor: torch.Tensor):
    if isinstance(tensor, DTensor):
        return tensor.full_tensor()
    return tensor

async def update_weights(self, params):
    named_tensors = [(k, v) for k, v in params.items()]
    load_format = None
    for tensor_index, (name, tensor) in enumerate(named_tensors):
        serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

        if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

        if self.device_mesh["infer_tp"].get_local_rank() == 0:
            await self.inference_engine.update_weights_from_tensor(
                named_tensors=[
                    (
                        name,
                        LocalSerializedTensor(values=gathered_serialized_tensors),
                    )
                ],
                load_format=load_format,
                flush_cache=tensor_index == len(named_tensors) - 1,
            )

@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 4, 2025

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

@Hecate0821
Copy link
Contributor

Hecate0821 commented Jul 4, 2025

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

@zhaochenyang20
Copy link
Collaborator

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

@Hecate0821
Copy link
Contributor

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

You're right, that's a better approach. What do you think, @SuperCB?

@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 5, 2025

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

You're right, that's a better approach. What do you think, @SuperCB?

I agree

@SuperCB SuperCB changed the title [fix][rollout,sglang,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16. [fix][rollout,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16. Jul 7, 2025
@ETOgaosion
Copy link
Collaborator

The key point is that whether RL frameworks provide training side collected tensors or serialized tensors. Current implementation is OK since SGLang's IPC weight sync design.

I leave 1 questions here:

For extendability, when SGLang can use EP/PP in verl, how to design the sharding process?

@ETOgaosion
Copy link
Collaborator

@SuperCB Could you sign the CLA and register your email in this commit to your GitHub account?

@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 7, 2025

The key point is that whether RL frameworks provide training side collected tensors or serialized tensors. Current implementation is OK since SGLang's IPC weight sync design.

I leave 1 questions here:

For extendability, when SGLang can use EP/PP in verl, how to design the sharding process?

This is a great question. Our team is also thinking about this issue. If you have any ideas, we can collaborate.

@SuperCB SuperCB requested review from wuxibin89 and zw0610 as code owners July 7, 2025 06:53
@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 7, 2025

I accidentally messed up the commit history,

@SuperCB SuperCB force-pushed the megatron_sg_bug branch from bfd4fd9 to 35d8af8 Compare July 7, 2025 07:18
@SuperCB SuperCB changed the title [fix][rollout,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16. [rollout,megatron]fix the bug when calling update_weights in megatron+sglang TP16. Jul 8, 2025
@SuperCB SuperCB changed the title [rollout,megatron]fix the bug when calling update_weights in megatron+sglang TP16. [rollout,megatron]fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025
@SuperCB SuperCB changed the title [rollout,megatron]fix:Bug in megatron+sglang TP16 update_weights. [sglang]fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025
@ETOgaosion ETOgaosion changed the title [sglang]fix:Bug in megatron+sglang TP16 update_weights. [sglang] fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025
@ETOgaosion ETOgaosion changed the title [sglang] fix:Bug in megatron+sglang TP16 update_weights. [sglang] fix: Bug in megatron+sglang TP16 update_weights. Jul 8, 2025
@SuperCB
Copy link
Contributor Author

SuperCB commented Jul 9, 2025

image Why would my changes cause this problem? That looks like really unbelievable.

@zhaochenyang20
Copy link
Collaborator

image Why would my changes cause this problem? That looks like really unbelievable.

could you directly ping me in wechat?

@zhaochenyang20 zhaochenyang20 merged commit ad33564 into verl-project:main Jul 9, 2025
43 of 47 checks passed
lkc233 pushed a commit to lkc233/verl that referenced this pull request Jul 10, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
ArronHZG pushed a commit to imh966/verl that referenced this pull request Jul 10, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jan 20, 2026
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…ct#2336)

### What does this PR do?

> We observe the following when using Megatron + Sglang + TP16:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1"
/>

After investigation, we found that this was caused by the **cudaipc**
mechanism not supporting cross-machine access. We have resolved and
fixed this bug.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants