Skip to content

Commit 5f4cef8

Browse files
authored
[async-rl-training-landscape.md] use tips correctly (#3300)
Remove custom html in favor of: ``` >[!note] > ... ```
1 parent 95d1004 commit 5f4cef8

1 file changed

Lines changed: 11 additions & 14 deletions

File tree

async-rl-training-landscape.md

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,17 @@ authors:
1414

1515
# Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries
1616

17-
<blockquote style="background-color: #f0f7ff; border-left: 4px solid #4a90d9; padding: 1em 1.5em; margin: 1.5em 0; border-radius: 4px;">
18-
19-
**TL;DR** -- For those of you who don't have time to read 5,000 words about async RL plumbing (we get it, you have models to train):
20-
21-
- **The problem:** In synchronous RL (reinforcement learning) training, data generation (model inference to create data samples) dominates wall-clock time -- a single batch of 32K-token rollouts on a 32B (32-billion parameter) model can take _hours,_ while the GPUs used for training remain idle.
22-
- **The solution everyone converged on:** Disaggregate (separate) inference and training onto different GPU pools, connect them with a rollout buffer (temporary storage for model outputs), and transfer weights asynchronously (without waiting), so neither side waits for the other.
23-
- **We surveyed 16 open-source libraries** that implement this pattern and compared them across 7 axes: orchestration primitives, buffer design, weight sync protocols, staleness management, partial rollout handling, LoRA support, and distributed training backends.
24-
- **Key findings:** Ray dominates orchestration (8/16 surveyed distributed computing libraries). The NCCL (NVIDIA Collective Communications Library) broadcast is the default method for transferring model weights. Staleness management refers to how outdated data samples are handled, ranging from simply dropping old samples to using advanced importance-sampling correction. LoRA (Low-Rank Adaptation) training is sparsely supported. Distributed MoE (Mixture of Experts) support is the emerging differentiator.
25-
26-
If you'd rather skip straight to the good part, [here's the full comparison table](#4-global-overview-sixteen-libraries-at-a-glance) (no reading required, we won't judge).
27-
28-
But seriously, if you stick around, you might learn a thing or two about why your GPUs are idle 60% of the time.
29-
30-
</blockquote>
17+
> [!NOTE]
18+
> **TL;DR** -- For those of you who don't have time to read 5,000 words about async RL plumbing (we get it, you have models to train):
19+
>
20+
> - **The problem:** In synchronous RL (reinforcement learning) training, data generation (model inference to create data samples) dominates wall-clock time -- a single batch of 32K-token rollouts on a 32B (32-billion parameter) model can take _hours,_ while the GPUs used for training remain idle.
21+
> - **The solution everyone converged on:** Disaggregate (separate) inference and training onto different GPU pools, connect them with a rollout buffer (temporary storage for model outputs), and transfer weights asynchronously (without waiting), so neither side waits for the other.
22+
> - **We surveyed 16 open-source libraries** that implement this pattern and compared them across 7 axes: orchestration primitives, buffer design, weight sync protocols, staleness management, partial rollout handling, LoRA support, and distributed training backends.
23+
> - **Key findings:** Ray dominates orchestration (8/16 surveyed distributed computing libraries). The NCCL (NVIDIA Collective Communications Library) broadcast is the default method for transferring model weights. Staleness management refers to how outdated data samples are handled, ranging from simply dropping old samples to using advanced importance-sampling correction. LoRA (Low-Rank Adaptation) training is sparsely supported. Distributed MoE (Mixture of Experts) support is the emerging differentiator.
24+
>
25+
> If you'd rather skip straight to the good part, [here's the full comparison table](#4-global-overview-sixteen-libraries-at-a-glance) (no reading required, we won't judge).
26+
>
27+
> But seriously, if you stick around, you might learn a thing or two about why your GPUs are idle 60% of the time.
3128
3229
---
3330

0 commit comments

Comments
 (0)