[rollout] feat: deprecate all rollout sharding manager#3285

Merged

vermouth1992 merged 11 commits intoverl-project:mainfrom

wuxibin89:wuxibin/refactor_rollout

Sep 3, 2025

Collaborator

wuxibin89 commented Sep 1, 2025

What does this PR do?

Deprecate all rollout sharding manager and replaced by trainer_mode and rollout_mode in hybrid worker.

wuxibin89 requested review from PeterSH6, SwordFaith, chenhaiq and zhaochenyang20 as code owners

September 1, 2025 01:16

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request refactors the rollout mechanism by deprecating the various sharding managers and introducing a unified trainer_mode and rollout_mode in the hybrid workers. This is a significant architectural improvement that simplifies the codebase. The changes are consistent across both FSDP and Megatron workers. I've found two critical issues: one is the use of os._exit() which can cause unsafe process termination, and the other is an incorrect path in the rollout registry that would lead to an ImportError.

verl/workers/rollout/base.py Show resolved Hide resolved

verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py Show resolved Hide resolved

wuxibin89 force-pushed the wuxibin/refactor_rollout branch from 63c8324 to 8d2922a Compare

September 1, 2025 10:49

wuxibin89 requested review from eric-haibin-lin, tongyx361 and vermouth1992 as code owners

September 1, 2025 14:33

Collaborator Author

wuxibin89 commented Sep 2, 2025

/gemini review

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces a significant and well-executed refactoring to deprecate the rollout sharding managers, replacing them with a cleaner trainer_mode and rollout_mode abstraction in the hybrid workers. The changes are consistently applied across the codebase, including updates to recipes and tests, which simplifies the architecture. My review focuses on two key areas for improvement: enhancing the robustness of the async worker's error handling to prevent service disruption, and addressing code duplication to improve maintainability.

verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py Show resolved Hide resolved

verl/utils/fsdp_utils.py Show resolved Hide resolved

wuxibin89 added 11 commits

September 3, 2025 10:02


          [rollout] feat: deprecate vllm/sglang sharding manager

c884917


          fix unit test

f4e58ab


          fix rollout only

6bb5243


          fix vllm lora

7571be4


          fix ascend vllm wake up

218b2ed


          fix recipe

ee4902f


          fix lora

276a375


          try to fix megatron+sglang

0b02d95


          fix ci

b27a3f8


          disable expandable_segments:True in megatron+sglang in ci

cef0c08


          increase sglang timeout

31c2a64

wuxibin89 force-pushed the wuxibin/refactor_rollout branch from 71f5ca1 to 31c2a64 Compare

September 3, 2025 02:36

vermouth1992 approved these changes

View reviewed changes

vermouth1992 merged commit 19020f6 into verl-project:main

60 of 63 checks passed

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

5c5d734

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

cczitong123 pushed a commit to cczitong123/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

4a094da

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

4eacefa

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

HanlinDu mentioned this pull request

[trainer, recipe] feat: fully async training recipe #2981

Merged

7 tasks

wuxibin89 mentioned this pull request

[rollout, vllm] feat: support blockwise fp8 rollout #3519

Merged

7 tasks

WncFht pushed a commit to WncFht/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

4a2d266

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

wuxibin89 mentioned this pull request

[sglang, recipe] feat: add SGLang as rollout engine for one-step-off-policy #3531

Merged

7 tasks

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

91c9b07

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

wuxibin89 mentioned this pull request

[trainer] fix: missing offload parameter and optimizer to cpu when no checkpoint #3861

Merged

wuxibin89 added a commit that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

2617b90

… checkpoint (#3861)

### What does this PR do?

Fix a bug introduce in #3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

sunnweiwei pushed a commit to sunnweiwei/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

7616cbd

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

techkang pushed a commit to techkang/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

ba2be8f

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

2700b1f

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

d757ffd

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

AlexJJ009 pushed a commit to AlexJJ009/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

db796f7

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

06f449c

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

f1eb312

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

2b2474f

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

e18e5e3

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

paolo328 added a commit to paolo328/Verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

6fbddd6

… checkpoint (#3861)

### What does this PR do?

Fix a bug introduce in verl-project/verl#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

502af49

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

260d2b3

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request


          [rollout] feat: deprecate all rollout sharding manager (verl-project#…

01313d2

…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request


          [trainer] fix: missing offload parameter and optimizer to cpu when no…

… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

vermouth1992 vermouth1992 approved these changes

zhaochenyang20 Awaiting requested review from zhaochenyang20

SwordFaith Awaiting requested review from SwordFaith

chenhaiq Awaiting requested review from chenhaiq chenhaiq is a code owner

PeterSH6 Awaiting requested review from PeterSH6 PeterSH6 is a code owner

eric-haibin-lin Awaiting requested review from eric-haibin-lin eric-haibin-lin is a code owner

tongyx361 Awaiting requested review from tongyx361 tongyx361 is a code owner

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

None yet