[rollout] feat: deprecate all rollout sharding manager#3285
[rollout] feat: deprecate all rollout sharding manager#3285vermouth1992 merged 11 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the rollout mechanism by deprecating the various sharding managers and introducing a unified trainer_mode and rollout_mode in the hybrid workers. This is a significant architectural improvement that simplifies the codebase. The changes are consistent across both FSDP and Megatron workers. I've found two critical issues: one is the use of os._exit() which can cause unsafe process termination, and the other is an incorrect path in the rollout registry that would lead to an ImportError.
63c8324 to
8d2922a
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring to deprecate the rollout sharding managers, replacing them with a cleaner trainer_mode and rollout_mode abstraction in the hybrid workers. The changes are consistently applied across the codebase, including updates to recipes and tests, which simplifies the architecture. My review focuses on two key areas for improvement: enhancing the robustness of the async worker's error handling to prevent service disruption, and addressing code duplication to improve maintainability.
71f5ca1 to
31c2a64
Compare
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
… checkpoint (#3861) ### What does this PR do? Fix a bug introduce in #3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
… checkpoint (#3861) ### What does this PR do? Fix a bug introduce in verl-project/verl#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
…3285) ### What does this PR do? Deprecate all rollout sharding manager and replaced by `trainer_mode` and `rollout_mode` in hybrid worker.
… checkpoint (verl-project#3861) ### What does this PR do? Fix a bug introduce in verl-project#3285. Currently the initialization steps are: 1. **build actor**: init model and optimizer, then offload to cpu 2. **build rollout**: init model and kv cache 3. **switch to `trainer_mode`**: discard rollout weight and kv cache, load actor model and optimizer to gpu 4. **load_checkpoint**: if checkpoint exists, load model and optimizer from checkpoint, then offload to cpu 5. **switch to `rollout_mode`**: load actor model to gpu, resume rollout weight and sync weight, then offload actor model to cpu and resume kv cache 6. **generate_sequences** The bug is step 4: if checkpoint doesn't exists, then this step is skipped, leading to actor and optimizer both remain in gpu(from step 3), which may cause step 5 cuda oom.
What does this PR do?
Deprecate all rollout sharding manager and replaced by
trainer_modeandrollout_modein hybrid worker.