Skip to content

[rollout] feat: deprecate all rollout sharding manager#3285

Merged
vermouth1992 merged 11 commits intoverl-project:mainfrom
wuxibin89:wuxibin/refactor_rollout
Sep 3, 2025
Merged

[rollout] feat: deprecate all rollout sharding manager#3285
vermouth1992 merged 11 commits intoverl-project:mainfrom
wuxibin89:wuxibin/refactor_rollout

Conversation

@wuxibin89
Copy link
Collaborator

What does this PR do?

Deprecate all rollout sharding manager and replaced by trainer_mode and rollout_mode in hybrid worker.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the rollout mechanism by deprecating the various sharding managers and introducing a unified trainer_mode and rollout_mode in the hybrid workers. This is a significant architectural improvement that simplifies the codebase. The changes are consistent across both FSDP and Megatron workers. I've found two critical issues: one is the use of os._exit() which can cause unsafe process termination, and the other is an incorrect path in the rollout registry that would lead to an ImportError.

@wuxibin89
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring to deprecate the rollout sharding managers, replacing them with a cleaner trainer_mode and rollout_mode abstraction in the hybrid workers. The changes are consistently applied across the codebase, including updates to recipes and tests, which simplifies the architecture. My review focuses on two key areas for improvement: enhancing the robustness of the async worker's error handling to prevent service disruption, and addressing code duplication to improve maintainability.

@wuxibin89 wuxibin89 force-pushed the wuxibin/refactor_rollout branch from 71f5ca1 to 31c2a64 Compare September 3, 2025 02:36
@vermouth1992 vermouth1992 merged commit 19020f6 into verl-project:main Sep 3, 2025
60 of 63 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Sep 4, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
cczitong123 pushed a commit to cczitong123/verl that referenced this pull request Sep 5, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request Sep 5, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
WncFht pushed a commit to WncFht/verl that referenced this pull request Oct 10, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
wuxibin89 added a commit that referenced this pull request Oct 22, 2025
… checkpoint (#3861)

### What does this PR do?

Fix a bug introduce in #3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
sunnweiwei pushed a commit to sunnweiwei/verl that referenced this pull request Oct 23, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 3, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
AlexJJ009 pushed a commit to AlexJJ009/verl that referenced this pull request Nov 5, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
… checkpoint (#3861)

### What does this PR do?

Fix a bug introduce in verl-project/verl#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…3285)

### What does this PR do?

Deprecate all rollout sharding manager and replaced by `trainer_mode`
and `rollout_mode` in hybrid worker.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
… checkpoint (verl-project#3861)

### What does this PR do?

Fix a bug introduce in verl-project#3285.
Currently the initialization steps are:
1. **build actor**:  init model and optimizer, then offload to cpu
2. **build rollout**: init model and kv cache
3. **switch to `trainer_mode`**: discard rollout weight and kv cache,
load actor model and optimizer to gpu
4. **load_checkpoint**: if checkpoint exists, load model and optimizer
from checkpoint, then offload to cpu
5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout
weight and sync weight, then offload actor model to cpu and resume kv
cache
6. **generate_sequences**

The bug is step 4: if checkpoint doesn't exists, then this step is
skipped, leading to actor and optimizer both remain in gpu(from step 3),
which may cause step 5 cuda oom.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants