[sglang] fix: Bug in megatron+sglang TP16 update_weights. by SuperCB · Pull Request #2336 · verl-project/verl

SuperCB · 2025-07-03T03:40:28Z

What does this PR do?

We observe the following when using Megatron + Sglang + TP16:

After investigation, we found that this was caused by the cudaipc mechanism not supporting cross-machine access. We have resolved and fixed this bug.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

High-Level Design

Specific Changes

List the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.

CLAassistant · 2025-07-03T03:40:34Z

All committers have signed the CLA.

zhaochenyang20 · 2025-07-03T06:12:54Z

Could you give me an end to end docs to reproduce your fix and verify, like this:

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/tool_examples/debug.md

SuperCB · 2025-07-03T06:45:35Z

Could you give me an end to end docs to reproduce your fix and verify, like this:

https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/multi-turn/tool_examples/debug.md

This might take some time because our code and configurations are for training internal, self-developed models. We need to prepare reproducible configurations for training on open-source models.

ETOgaosion · 2025-07-03T16:08:59Z

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

zhaochenyang20 · 2025-07-03T16:16:26Z

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

Let me discuss with our team

Hecate0821 · 2025-07-03T17:42:52Z

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

When passing a tensor across processes, it should be serialized first. You can find a similar implementation in fsdp_sglang.py.

SuperCB · 2025-07-04T01:39:30Z

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.

@zhaochenyang20 What do you think of integrating this in SGLang?

It's a good idea.

zhaochenyang20 · 2025-07-04T06:49:25Z

I think that this logic should be moved to SGLang's processing input weights stage. In verl's sharding manager, we only provide gathered training full tensors and directly send into inference engine. It's better let inference engine to deal with these tensors as abstraction and generalization.
@zhaochenyang20 What do you think of integrating this in SGLang?

It's a good idea.

I think it's better to be serialized in verl and only pass handle tuple to SGLang. Like FSDP does:

def _preprocess_tensor_for_update_weights(tensor: torch.Tensor):
    if isinstance(tensor, DTensor):
        return tensor.full_tensor()
    return tensor

async def update_weights(self, params):
    named_tensors = [(k, v) for k, v in params.items()]
    load_format = None
    for tensor_index, (name, tensor) in enumerate(named_tensors):
        serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

        if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

        if self.device_mesh["infer_tp"].get_local_rank() == 0:
            await self.inference_engine.update_weights_from_tensor(
                named_tensors=[
                    (
                        name,
                        LocalSerializedTensor(values=gathered_serialized_tensors),
                    )
                ],
                load_format=load_format,
                flush_cache=tensor_index == len(named_tensors) - 1,
            )

SuperCB · 2025-07-04T08:17:40Z

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

verl/workers/sharding_manager/megatron_sglang.py

Hecate0821 · 2025-07-04T15:49:09Z

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

zhaochenyang20 · 2025-07-04T16:57:54Z

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

Hecate0821 · 2025-07-04T22:28:23Z

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

You're right, that's a better approach. What do you think, @SuperCB?

SuperCB · 2025-07-05T00:41:56Z

if self.device_mesh["infer_tp"].get_local_rank() == 0:
            gathered_serialized_tensors = [None for _ in range(self.device_mesh["infer_tp"].mesh.size()[0])]
        else:
            gathered_serialized_tensors = None
        dist.gather_object(
            obj=serialized_tensor,
            object_gather_list=gathered_serialized_tensors,
            dst=self.device_mesh["infer_tp"].mesh.tolist()[0],
            group=self.device_mesh["infer_tp"].get_group(),
        )

This functionality can be integrated into the SglangRollout class to avoid similar code appearing in FSDPShard Manager and MegatronShard Manager.

You mean add a helper function to SGLang, and call it like this?

for idx, (name, tensor) in enumerate(named_tensors):
    serialized_tensor = MultiprocessingSerializer.serialize(_preprocess_tensor_for_update_weights(tensor))

    gathered_serialized_tensors = SGLangRollout.gather_serialized_tensor(serialized_tensor, self.device_mesh)

    if self.device_mesh["infer_tp"].get_local_rank() == 0:
        await self.inference_engine.update_weights_from_tensor(
            named_tensors=[(name, LocalSerializedTensor(values=gathered_serialized_tensors))],
            ...
        )

I mean, just leave this part in verl. If you can make it a utils function, that would be better. No need to change SGLang side.

You're right, that's a better approach. What do you think, @SuperCB?

I agree

ETOgaosion · 2025-07-07T04:28:39Z

The key point is that whether RL frameworks provide training side collected tensors or serialized tensors. Current implementation is OK since SGLang's IPC weight sync design.

I leave 1 questions here:

For extendability, when SGLang can use EP/PP in verl, how to design the sharding process?

ETOgaosion · 2025-07-07T04:29:46Z

@SuperCB Could you sign the CLA and register your email in this commit to your GitHub account?

SuperCB · 2025-07-07T04:33:46Z

The key point is that whether RL frameworks provide training side collected tensors or serialized tensors. Current implementation is OK since SGLang's IPC weight sync design.

I leave 1 questions here:

For extendability, when SGLang can use EP/PP in verl, how to design the sharding process?

This is a great question. Our team is also thinking about this issue. If you have any ideas, we can collaborate.

SuperCB · 2025-07-07T06:56:11Z

I accidentally messed up the commit history,

SuperCB · 2025-07-09T02:16:29Z

Why would my changes cause this problem? That looks like really unbelievable.

zhaochenyang20 · 2025-07-09T02:32:59Z

Why would my changes cause this problem? That looks like really unbelievable.

could you directly ping me in wechat?

…ct#2336) ### What does this PR do? > We observe the following when using Megatron + Sglang + TP16: <img width="1236" alt="image" src="https://github.com/user-attachments/assets/875d83e6-325a-41c4-b778-81b457b508a1" /> After investigation, we found that this was caused by the **cudaipc** mechanism not supporting cross-machine access. We have resolved and fixed this bug. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

fix bug

f36bdd3

hiyouga requested review from ETOgaosion and zhaochenyang20 July 3, 2025 03:53

ETOgaosion reviewed Jul 4, 2025

View reviewed changes

verl/workers/sharding_manager/megatron_sglang.py Show resolved Hide resolved

fix bug

fc34810

SuperCB changed the title ~~[fix][rollout,sglang,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16.~~ [fix][rollout,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16. Jul 7, 2025

Merge branch 'main' into megatron_sg_bug

35d8af8

SuperCB force-pushed the megatron_sg_bug branch from 37ac981 to bfd4fd9 Compare July 7, 2025 06:53

SuperCB requested review from PeterSH6, SwordFaith, chenhaiq, eric-haibin-lin, tongyx361 and vermouth1992 as code owners July 7, 2025 06:53

SuperCB requested review from wuxibin89 and zw0610 as code owners July 7, 2025 06:53

SuperCB force-pushed the megatron_sg_bug branch from bfd4fd9 to 35d8af8 Compare July 7, 2025 07:18

SuperCB changed the title ~~[fix][rollout,megatron]Fixed the bug when calling update_weights in megatron+sglang TP16.~~ [rollout,megatron]fix the bug when calling update_weights in megatron+sglang TP16. Jul 8, 2025

SuperCB changed the title ~~[rollout,megatron]fix the bug when calling update_weights in megatron+sglang TP16.~~ [rollout,megatron]fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025

SuperCB changed the title ~~[rollout,megatron]fix:Bug in megatron+sglang TP16 update_weights.~~ [sglang]fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025

fix

16fe2ac

ETOgaosion changed the title ~~[sglang]fix:Bug in megatron+sglang TP16 update_weights.~~ [sglang] fix:Bug in megatron+sglang TP16 update_weights. Jul 8, 2025

ETOgaosion changed the title ~~[sglang] fix:Bug in megatron+sglang TP16 update_weights.~~ [sglang] fix: Bug in megatron+sglang TP16 update_weights. Jul 8, 2025

zhaochenyang20 approved these changes Jul 9, 2025

View reviewed changes

zhaochenyang20 merged commit ad33564 into verl-project:main Jul 9, 2025
43 of 47 checks passed

Yangruipis mentioned this pull request Jul 11, 2025

[Feature] Performance boost for VerlEngine's update_weights_from_tensor sgl-project/sglang#6762

Closed

2 tasks

hebiao064 mentioned this pull request Jul 12, 2025

RL SGLang Update Weight in Tensor Slow for MOE zhaochenyang20/Awesome-ML-SYS-Tutorial#169

Closed

Conversation

SuperCB commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

High-Level Design

Specific Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Jul 3, 2025

Uh oh!

SuperCB commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ETOgaosion commented Jul 3, 2025

Uh oh!

zhaochenyang20 commented Jul 3, 2025

Uh oh!

Hecate0821 commented Jul 3, 2025

Uh oh!

SuperCB commented Jul 4, 2025

Uh oh!

zhaochenyang20 commented Jul 4, 2025

Uh oh!

SuperCB commented Jul 4, 2025

Uh oh!

Uh oh!

Hecate0821 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Jul 4, 2025

Uh oh!

Hecate0821 commented Jul 4, 2025

Uh oh!

SuperCB commented Jul 5, 2025

Uh oh!

ETOgaosion commented Jul 7, 2025

Uh oh!

ETOgaosion commented Jul 7, 2025

Uh oh!

SuperCB commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperCB commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SuperCB commented Jul 9, 2025

Uh oh!

zhaochenyang20 commented Jul 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SuperCB commented Jul 3, 2025 •

edited

Loading

CLAassistant commented Jul 3, 2025 •

edited

Loading

SuperCB commented Jul 3, 2025 •

edited

Loading

Hecate0821 commented Jul 4, 2025 •

edited

Loading

SuperCB commented Jul 7, 2025 •

edited

Loading

SuperCB commented Jul 7, 2025 •

edited

Loading