[sglang, megatron, perf] feat: speed up megatron sglang weight update by 10x (verl-project#2418)

Yangruipis · hebiao064 · web-flow · commit 72219f730019 · 2025-07-15T14:46:45.000-07:00
### What does this PR do? optimize the performance of sglang+megatron weight update refer to the bucketing implementation of [`THUDM/slime`](https://github.com/THUDM/slime/blob/fb7605cc5fb09af0f9369d37f7192f12bddee577/slime/ray/ppo_actor.py#L452). |model| bucket size MB |boost | | ---- | ----- | ---- | | Moonlight16B @ 8xH20 | 512MB | 175s -> 18s | |DeepseekV3 671B @ 512xH20| 512MB | ONGOING | releated to issues verl-project#2419 , sgl-project/sglang#6762 zhaochenyang20/Awesome-ML-SYS-Tutorial#169 similar fixes for FSDP: verl-project#2499 > We are from the Large Model Post-Training Team of 📕 Xiaohongshu's AI Platform Technology Department , dedicated to developing high-performance, easily-scalable distributed post-training engines. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). --------- Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
diff --git a/tests/special_e2e/run_ppo_trainer_megatron.sh b/tests/special_e2e/run_ppo_trainer_megatron.sh
@@ -175,6 +175,7 @@ python3 -m verl.trainer.main_ppo --config-path=config \
     actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
     actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
+    actor_rollout_ref.rollout.update_weights_bucket_megabytes=128 \
     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
     actor_rollout_ref.ref.megatron.use_mbridge=${USE_MBRIDGE} \
diff --git a/tests/workers/rollout/test_sglang_rollout_sharding_manager.py b/tests/workers/rollout/test_sglang_rollout_sharding_manager.py
@@ -0,0 +1,57 @@
+# Copyright 2023-2024 SGLang Team
+# Copyright 2025 ModelBest Inc. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pytest
+import torch
+
+from verl.workers.rollout.sglang_rollout.utils import get_named_tensor_buckets
+
+_TENSOR_1MB = torch.zeros(512, 512)
+_BYTES_1MB = 1 << 20
+
+
+@pytest.mark.parametrize(
+    "named_tensors, bucket_size_mb, gt_groups",
+    [
+        (
+            [("a", _TENSOR_1MB), ("b", _TENSOR_1MB)],
+            0.5 * _BYTES_1MB,
+            [["a"], ["b"]],
+        ),
+        (
+            [("a", _TENSOR_1MB), ("b", _TENSOR_1MB)],
+            1 * _BYTES_1MB,
+            [["a"], ["b"]],
+        ),
+        (
+            [("a", _TENSOR_1MB), ("b", _TENSOR_1MB)],
+            1.5 * _BYTES_1MB,
+            [["a"], ["b"]],
+        ),
+        (
+            [("a", _TENSOR_1MB), ("b", _TENSOR_1MB)],
+            2 * _BYTES_1MB,
+            [["a", "b"]],
+        ),
+    ],
+)
+def test_get_named_tensor_buckets(named_tensors, bucket_size_mb, gt_groups: list[list[str]]):
+    named_tensors_iter = iter(named_tensors)
+    groups = list(get_named_tensor_buckets(named_tensors_iter, bucket_size_mb))
+    assert len(groups) == len(gt_groups)
+    for group, gt_group in zip(groups, gt_groups, strict=True):
+        assert len(group) == len(gt_group)
+        for (name, _), (gt_name) in zip(group, gt_group, strict=True):
+            assert name == gt_name
diff --git a/verl/trainer/config/_generated_ppo_trainer.yaml b/verl/trainer/config/_generated_ppo_trainer.yaml
@@ -130,6 +130,7 @@ actor_rollout_ref:
       custom_async_server:
         path: null
         name: null
+    update_weights_bucket_megabytes: 2048
     trace:
       backend: null
       token2text: false
diff --git a/verl/trainer/config/rollout/rollout.yaml b/verl/trainer/config/rollout/rollout.yaml
@@ -179,6 +179,21 @@ agent:
     # Class name of the custom async server class (e.g. AsyncvLLMServer)
     name: null
 
+# Specifies the tensor bucket size (in megabytes) for batch weight updates during rollout operations.
+# This parameter controls the maximum payload size for a single weight update request.
+#
+# https://github.com/volcengine/verl/pull/2281
+#
+# Note:
+# - Currently only supported in SGLang rollout implementations
+# - Larger values may improve throughput but increase memory overhead
+# - Default value (2GB) is optimized for typical GPU memory configurations
+# - For the best performance of `rebuild_cuda_tensor`, it is recommended to:
+#   1. Enable `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES`.
+#   2. Manually set `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
+#   when using Tensor Parallelism (TP) >= 8.
+update_weights_bucket_megabytes: 2048
+
 # trace rollout data
 trace:
   
diff --git a/verl/workers/rollout/sglang_rollout/utils.py b/verl/workers/rollout/sglang_rollout/utils.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 import pickle
-from typing import Any, Optional
+from typing import Any, Iterator, Optional
 
 import numpy as np
 import torch
@@ -66,3 +66,43 @@ def broadcast_pyobj(
         serialized_data = bytes(tensor_data.cpu().numpy())
         data = pickle.loads(serialized_data)
         return data
+
+
+def get_named_tensor_buckets(
+    iterable: Iterator[tuple[str, torch.Tensor]], bucket_bytes: int
+) -> Iterator[list[tuple[str, torch.Tensor]]]:
+    """
+    Group tensors into buckets based on a specified size in megabytes.
+
+    Args:
+        iterable: An iterator of tuples containing tensor names and tensors.
+        bucket_bytes: The maximum size of each bucket in bytes.
+
+    Yields:
+        Lists of tuples, where each tuple contains a tensor name and its corresponding tensor.
+
+    Example:
+        >>> tensors = [('tensor1', torch.randn(1000, 1000)), ('tensor2', torch.randn(2000, 2000))]
+        >>> for bucket in get_named_tensor_buckets(tensors, bucket_size_mb=10):
+        ...     print(bucket)
+        [('tensor1', tensor(...)), ('tensor2', tensor(...))]
+
+    """
+    if bucket_bytes <= 0:
+        raise ValueError(f"bucket_bytes must be greater than 0, got {bucket_bytes}")
+
+    current_bucket = []
+    current_size = 0
+    for name, tensor in iterable:
+        tensor_size = tensor.element_size() * tensor.numel()
+        if current_size + tensor_size > bucket_bytes:
+            if current_bucket:
+                yield current_bucket
+            current_bucket = [(name, tensor)]
+            current_size = tensor_size
+        else:
+            current_bucket.append((name, tensor))
+            current_size += tensor_size
+
+    if current_bucket:
+        yield current_bucket
diff --git a/verl/workers/sharding_manager/megatron_sglang.py b/verl/workers/sharding_manager/megatron_sglang.py
@@ -37,6 +37,7 @@
     per_tensor_generator,
 )
 from verl.utils.profiler import GPUMemoryLogger, log_gpu_memory_usage, simple_timer
+from verl.workers.rollout.sglang_rollout.utils import get_named_tensor_buckets
 
 from .base import BaseShardingManager
 
@@ -130,37 +131,76 @@ def __exit__(self, exc_type, exc_value, traceback):
         loop.run_until_complete(self.sleep())
 
     async def update_weights(self, params):
+        """
+        Update model weights using tensor buckets, similar to THUDM/slime's implementation.
+
+        Notes:
+          - For the best performance of `rebuild_cuda_tensor`, it is recommended to:
+              1. Enable `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES`.
+              2. Manually set `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
+            when using Tensor Parallelism (TP >= 8).
+          - See reference implementations in SLIME:
+            - Main logic: https://github.com/THUDM/slime/blob/fb7605cc5fb09af0f9369d37f7192f12bddee577/slime/ray/ppo_actor.py#L452
+            - runtime envs: https://github.com/THUDM/slime/blob/fb7605cc5fb09af0f9369d37f7192f12bddee577/slime/ray/ppo_actor.py#L39
+        """
         if self.device_mesh["tp"].get_local_rank() == 0 and self.rollout_config.free_cache_engine:
             await self.inference_engine.resume_memory_occupation()
         named_tensors = params
         load_format = None
-        for tensor_index, (name, tensor) in enumerate(named_tensors):
-            serialized_tensor = MultiprocessingSerializer.serialize(tensor.detach())
+
+        update_weights_bucket_bytes = int(self.rollout_config.update_weights_bucket_megabytes) << 20
+        for batch in get_named_tensor_buckets(named_tensors, update_weights_bucket_bytes):
+            # On each rank, serialize a batch of (name, tensor) tuples.
+            # named_tensors_batch will be a list like:
+            # [(name0, serialized_tensor0_tp0), (name1, serialized_tensor1_tp0), ...]
+            named_tensors_batch = [
+                (name, MultiprocessingSerializer.serialize(tensor.detach())) for name, tensor in batch
+            ]
 
             if self.device_mesh["tp"].get_local_rank() == 0:
-                gathered_serialized_tensors = [None for _ in range(self.device_mesh["tp"].mesh.size()[0])]
+                # On rank 0, prepare a list to hold the gathered batches from all ranks.
+                gathered_serialized_batches = [None for _ in range(self.device_mesh["tp"].mesh.size()[0])]
             else:
-                gathered_serialized_tensors = None
+                gathered_serialized_batches = None
+
+            # Gather the named_tensors_batch from all ranks to rank 0.
+            # After this, on rank 0, gathered_serialized_batches will be a list of lists:
+            # [ [ (name0, s_t0_tp0), (name1, s_t1_tp0), ... ],  # batch from TP rank 0
+            #   [ (name0, s_t0_tp1), (name1, s_t1_tp1), ... ],  # batch from TP rank 1
+            #   ... ]
+            # On other ranks, gathered_serialized_batches will be None.
             dist.gather_object(
-                obj=serialized_tensor,
-                object_gather_list=gathered_serialized_tensors,
+                obj=named_tensors_batch,
+                object_gather_list=gathered_serialized_batches,
                 dst=self.device_mesh["tp"].mesh.tolist()[0],
                 group=self.device_mesh["tp"].get_group(),
             )
 
             if self.device_mesh["tp"].get_local_rank() == 0:
+                # Use zip(*) to "transpose" the data structure.
+                # This groups the serialized parts for each individual tensor across all TP ranks.
+                # Example: from [[(n0, t0_tp0), (n1, t1_tp0)], [(n0, t0_tp1), (n1, t1_tp1)]]
+                # to [ ( (n0, t0_tp0), (n0, t0_tp1) ), ( (n1, t1_tp0), (n1, t1_tp1) ) ]
+                logical_tensors = zip(*gathered_serialized_batches, strict=False)
                 await self.inference_engine.update_weights_from_tensor(
                     named_tensors=[
+                        # 'tensor_group' represents a single logical tensor's data from all ranks.
                         (
-                            name,
-                            LocalSerializedTensor(values=gathered_serialized_tensors),
+                            tensor_group[0][0],  # Get the name from the first rank's data.
+                            LocalSerializedTensor(
+                                # 'rank_part' is the (name, serialized_tensor) tuple from one specific rank.
+                                values=[rank_part[1] for rank_part in tensor_group]
+                            ),
                         )
+                        for tensor_group in logical_tensors
+                        # each tensor_group is like ( (n0, t0_tp0), (n0, t0_tp1) )
                     ],
                     load_format=load_format,
                     flush_cache=False,
                 )
-            if self.device_mesh["tp"].get_local_rank() == 0:
-                await self.inference_engine.flush_cache()
+
+        if self.device_mesh["tp"].get_local_rank() == 0:
+            await self.inference_engine.flush_cache()
 
     async def release_memory(self):
         if self.device_mesh["tp"].get_local_rank() == 0 and self.rollout_config.free_cache_engine: