[ray, worker] feat: DAPO x LLM as a Judge (GenRM) #1953

llm-player-01 · 2025-06-10T18:36:18Z

Checklist Before Starting

Searched for similar PR(s).
Checked PR Title format
- In format of: [modules] type: Title
- modules are in fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt
- type is in feat, fix, doc, refactor, chore
- can involve multiple modules, seperated by , or space, like [megatron, fsdp] feat: xxx

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

Use GenerativeRewardModel as a verifier

Specific Changes

List the specific changes.

Add a new role named: GenerativeRewardModel
Create a new ray resource pool for GenerativeRewardModel rollout process
The GRM template is easily adaptable to a variety of tasks (see recipe/dapo/config/GRM_template.txt)
Rewrite train / validation logger, make it easier to add new key to logger

Usage Example

Provide usage example(s) for easier usage.

Just run "verl/recipe/dapo/run_dapo_qwen2.5_llm_judge.sh"

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

A sample from validation ( AIME)

prompt	response	GRM_prompt	GRM_response	GRM_score
systemYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.userSolve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.There exist real numbers $x$ and $y$, both greater than 1, such that $\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$.Remember to put your answer on its own line after "Answer:".assistant	To solve the given problem, let's start by analyzing the given equations:1. (\log_x(y^x) = 10)2. (\log_y(x^{4y}) = 10)(reasoning process...)Thus, the value of (xy) is:[\boxed{25}]	systemYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.userSolve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.There exist real numbers $x$ and $y$, both greater than 1, such that $\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$.Remember to put your answer on its own line after "Answer:".assistantTo solve the given problem, let's start by analyzing the given equations:1. (\log_x(y^x) = 10)2. (\log_y(x^{4y}) = 10)(reasoning process...)Multiply both sides by (2xy):[50 = 2xy \quad \Rightarrow \quad xy = 25]Thus, the value of (xy) is:[\boxed{25}]userThe above is a Q&A dialogue between a user and an assistant. It is now known that the standard answer to the user's question is 25. Please determine whether the assistant has answered the user's question clearly, precisely, and accurately. Present your reasoning and judgment in the following format:Think: Content of ThinkingJudgment: Correct / Incorrectassistant	Think: The assistant's response to the user's question follows a systematic approach to solving the given logarithmic problem. The assistant correctly uses properties of logarithms and the change of base formula to equate the expressions for (\log_x(y)) and (\log_y(x)). The assistant then multiplies these expressions and solves for (xy), arriving at the correct answer of 25. The steps provided are clear, logical, and mathematically sound.Judgment: Correct	1

Additional Info.

Issue Number: Support Generative Reward Model (GenRM) #229
Training: FSDP
Inference: vLLM

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

CLAassistant · 2025-06-10T18:36:26Z

All committers have signed the CLA.

ccclyu · 2025-06-15T06:04:12Z

could you please help review this GenRM PR @mark-myzhao? thanks!

Dada-Cloudzxy · 2025-06-30T06:34:33Z

我实现了一个类似思路的，也是复用ActorRollout这个Worker类来完成。比较不理解的是，这里为什么需要额外一个pool来分配资源，把GRM放在单独的节点上是不是有点浪费卡？

llm-player-01 · 2025-07-01T03:33:10Z

我实现了一个类似思路的，也是复用ActorRollout这个Worker类来完成。比较不理解的是，这里为什么需要额外一个pool来分配资源，把GRM放在单独的节点上是不是有点浪费卡？

I also tried to make grm and actor run in the same resource pool, but found it got cuda oom during training. In addition, vLLM rollout only support 1 rolloutworker to enable sleep_mode in each process. Thats why I use individual resource pool to setup grm @Dada-Cloudzxy

flishwang · 2025-07-06T13:19:17Z

我实现了一个类似思路的，也是复用ActorRollout这个Worker类来完成。比较不理解的是，这里为什么需要额外一个pool来分配资源，把GRM放在单独的节点上是不是有点浪费卡？

I also tried to make grm and actor run in the same resource pool, but found it got cuda oom during training. In addition, vLLM rollout only support 1 rolloutworker to enable sleep_mode in each process. Thats why I use individual resource pool to setup grm @Dada-Cloudzxy

I found ways to run extra vLLM engine in the same process, see #1193 (comment)

Dada-Cloudzxy · 2025-07-07T03:42:23Z

我实现了一个类似思路的，也是复用ActorRollout这个Worker类来完成。比较不理解的是，这里为什么需要额外一个pool来分配资源，把GRM放在单独的节点上是不是有点浪费卡？

I also tried to make grm and actor run in the same resource pool, but found it got cuda oom during training. In addition, vLLM rollout only support 1 rolloutworker to enable sleep_mode in each process. Thats why I use individual resource pool to setup grm @Dada-Cloudzxy

I found ways to run extra vLLM engine in the same process, see #1193 (comment)

Its so cool!!!!

Cesilina · 2025-07-10T07:05:46Z

我原本想的是将其封装在mcp里，可以自由定制不同任务的api

waltonfuture · 2025-07-12T11:04:37Z

@llm-player-01 大佬好，请问可以加您微信请教下吗

llm-player-01 · 2025-07-16T12:07:41Z

@llm-player-01 大佬好，请问可以加您微信请教下吗

欢迎交流哇可以私聊留个邮箱吧

oliverlwang · 2025-07-21T04:08:13Z

这个做法，是否实现了rollout和GenRM的overlap？不然直接在reward里call inference api更简单。毕竟当GenRM的size比较大时，显存不能同时放下GenRM和其他的参数。

llm-player-01 · 2025-07-21T06:18:27Z

这个做法，是否实现了rollout和GenRM的overlap？不然直接在reward里call inference api更简单。毕竟当GenRM的size比较大时，显存不能同时放下GenRM和其他的参数。

是和call api效果一样，这个方案适合不想搞常驻推理服务，伴生训练随用随起的场景

wus2000 · 2025-09-27T05:50:19Z

这个做法，是否实现了rollout和GenRM的overlap？不然直接在reward里call inference api更简单。毕竟当GenRM的size比较大时，显存不能同时放下GenRM和其他的参数。

是和call api效果一样，这个方案适合不想搞常驻推理服务，伴生训练随用随起的场景
请问我也在一个单独的resourcepool中实例化了一个单独的ActorRolloutRefWorker专门来做rollout，为什么带Actor角色的使用vLLM正常，但单独rollout的vLLM推理吞吐量非常慢，请问有遇到过这种情况吗

llm-player-01 · 2025-09-27T06:41:22Z

@wus2000 没有，actor 和 genrm 的参数都一样吗

This was referenced Jun 10, 2025

Support Generative Reward Model (GenRM) #229

Open

When to Support Generative Reward Model (GenRM) #1459

Open

[feat]DAPO x LLM as a Judge (GenRM)

d51bdd2

llm-player-01 force-pushed the GenRM_dev_zh branch from 3fd62f9 to d51bdd2 Compare June 11, 2025 07:35

ccclyu added the status: need review label Jun 15, 2025

0x404 mentioned this pull request Jun 28, 2025

How to create a Rollout only worker? #2250

Open

chenhaiq self-requested a review July 18, 2025 04:48

imboaba mentioned this pull request Jul 31, 2025

[fsdp, trainer] feat: Run vLLM-based Generative Reward Model with Multi-Workgroup Support #2845

Open

7 tasks

[ray, worker] feat: DAPO x LLM as a Judge (GenRM) #1953

Are you sure you want to change the base?

[ray, worker] feat: DAPO x LLM as a Judge (GenRM) #1953

Uh oh!

Conversation

llm-player-01 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

Specific Changes

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

CLAassistant commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccclyu commented Jun 15, 2025

Uh oh!

Dada-Cloudzxy commented Jun 30, 2025

Uh oh!

llm-player-01 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flishwang commented Jul 6, 2025

Uh oh!

Dada-Cloudzxy commented Jul 7, 2025

Uh oh!

Cesilina commented Jul 10, 2025

Uh oh!

waltonfuture commented Jul 12, 2025

Uh oh!

llm-player-01 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliverlwang commented Jul 21, 2025

Uh oh!

llm-player-01 commented Jul 21, 2025

Uh oh!

wus2000 commented Sep 27, 2025

Uh oh!

llm-player-01 commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

llm-player-01 commented Jun 10, 2025 •

edited

Loading

CLAassistant commented Jun 10, 2025 •

edited

Loading

llm-player-01 commented Jul 1, 2025 •

edited

Loading

llm-player-01 commented Jul 16, 2025 •

edited

Loading