[rollout] feat: compute reward score in agent loop #3055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

vermouth1992 merged 7 commits into volcengine:main from wuxibin89:wuxibin/agent_loop_reward

Aug 19, 2025

Collaborator

wuxibin89 commented Aug 14, 2025 •

edited

Loading

What does this PR do?

Compute reward score for each prompt once the agent loop is finished, this can significantly hide the reward computation time.

wuxibin89 requested review from chenhaiq and vermouth1992

August 14, 2025 11:02

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces a feature to compute reward scores within the agent loop, which can improve performance by parallelizing reward calculation. The implementation adds a RewardManagerWorker to handle this asynchronously. The changes also update tests to accommodate and verify this new functionality.

My review has identified a critical issue with the new worker scheduling logic that could break support for GPU-based reward models. Additionally, a new test file contains a hardcoded path, which impacts test portability. Please see the detailed comments for suggestions on how to address these points.

verl/experimental/agent_loop/agent_loop.py Show resolved Hide resolved

tests/experimental/agent_loop/test_agent_loop_reward.py Show resolved Hide resolved

wuxibin89 force-pushed the wuxibin/agent_loop_reward branch from ab3e967 to cf19f70 Compare

August 14, 2025 11:12

wuxibin89 requested review from PeterSH6, eric-haibin-lin and tongyx361 as code owners

August 14, 2025 13:21

wuxibin89 mentioned this pull request

[agentic RL] multi-turn rollout and agent loop development tracking #2618

Open

7 tasks

stevewx mentioned this pull request

[rollout] feat: support reorder rollout for tackling long-tail generation problem #2200

Closed

wuxibin89 added 7 commits

August 15, 2025 14:06


          [rollout] feat: compute reward score in agent loop

7e38a8c


          fix unit test

bb30927


          fix unit test

d58321e


          fix trainer

16105e4


          fix unit test

3944d5a


          fix unit test

2c952ce


          fix unnit test

1bf2409

wuxibin89 force-pushed the wuxibin/agent_loop_reward branch from a6c715e to 1bf2409 Compare

August 15, 2025 06:32

Contributor

U-rara commented Aug 18, 2025

Great implementation! Building on this, could we integrate launch_reward_fn_async? In other words, we’d add a future_reward (a ray.ObjectRef) to the AgentLoopOutput, and only call .get() on it after compute_ref_log_prob but before the advantage calculation. I’m not sure if this is elegant.

Collaborator Author

wuxibin89 commented Aug 18, 2025 •

edited

Loading

Great implementation! Building on this, could we integrate launch_reward_fn_async? In other words, we’d add a future_reward (a ray.ObjectRef) to the AgentLoopOutput, and only call .get() on it after compute_ref_log_prob but before the advantage calculation. I’m not sure if this is elegant.

Yeah, this can further reduce the latency introduced by reward computation of last few samples in the batch.

vermouth1992 approved these changes

View reviewed changes

vermouth1992 merged commit c3c2f9a into volcengine:main

53 of 60 checks passed

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

ce98933

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

PopSoda2002 pushed a commit to PopSoda2002/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

e3c877e

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

echo-rain mentioned this pull request

[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

Merged

7 tasks

wuxibin89 mentioned this pull request

[rollout] fix: add missing extra_reward_info to AgentLoopOuput #3194

Merged

vermouth1992 pushed a commit that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (#3194)

cb5818c

### What does this PR do?

Fix #3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

PopSoda2002 pushed a commit to PopSoda2002/verl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

cc045c7

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

16ae139

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

wuxibin89 pushed a commit that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

844c929

… in agent loop (#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

157a277

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

cczitong123 pushed a commit to cczitong123/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

cczitong123 pushed a commit to cczitong123/verl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

1a1fa5c

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

cczitong123 pushed a commit to cczitong123/verl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

4b48f32

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

98b295d

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

f343c48

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

a273c29

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

wuxibin89 mentioned this pull request

math_verify reward return 0 when using async #3407

Closed

4 tasks

VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (#3194)

05a5dc6

### What does this PR do?

Fix volcengine/verl#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

27f9ff7

… in agent loop (#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine/verl#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

WncFht pushed a commit to WncFht/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

4b24279

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

29e898a

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

techkang pushed a commit to techkang/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

d1d1cd8

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

techkang pushed a commit to techkang/verl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

92922bb

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

techkang pushed a commit to techkang/verl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

336ca24

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request


          [rollout] feat: compute reward score in agent loop (volcengine#3055)

28baf5a

### What does this PR do?

Compute reward score for each prompt once the agent loop is finished,
this can significantly hide the reward computation time.

volcengine#2618

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request


          [rollout] fix: add missing extra_reward_info to AgentLoopOuput (volce…

aa6a78d

…ngine#3194)

### What does this PR do?

Fix volcengine#3055, add missing
`extra_reward_info` to AgentLoopOuput, which is needed by metrics
calculation.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request


          [BREAKING][rollout] feat: Added asynchronous reward model calculation…

4c02085

… in agent loop (volcengine#3152)

### What does this PR do?

> This PR will be based on
[PR#3055](volcengine#3055), and will
further support asynchronous calculation of reward models based on the
agent loop which only supports asynchronous reward function calculation.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> If you want to use this feature, you need to add the following
configuration to the startup script configuration item


```python
    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1 
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

vermouth1992 vermouth1992 approved these changes

chenhaiq Awaiting requested review from chenhaiq

eric-haibin-lin Awaiting requested review from eric-haibin-lin eric-haibin-lin is a code owner

tongyx361 Awaiting requested review from tongyx361 tongyx361 is a code owner

PeterSH6 Awaiting requested review from PeterSH6 PeterSH6 is a code owner

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

None yet