Update dapo_ray_trainer.py by TsingZ0 · Pull Request #4789 · verl-project/verl

TsingZ0 · 2026-01-05T05:20:25Z

fix bugs for #4786

gemini-code-assist

Code Review

This pull request aims to fix a bug by modifying the prompt filtering logic in dapo_ray_trainer.py. The change adds a condition to retain prompt groups where all generated responses have the same, non-extremal reward. However, my review indicates that this change may introduce a critical issue by providing incorrect training targets to the value function (critic) when using the GRPO advantage estimator. This could lead to training instability and an incorrectly trained model. I've provided a detailed comment on this potential issue.

gemini-code-assist · 2026-01-05T05:22:17Z

recipe/dapo/dapo_ray_trainer.py

                        kept_prompt_uids = [
                            uid
                            for uid, std in prompt_uid2metric_std.items()
-                            if std > 0 or len(prompt_uid2metric_vals[uid]) == 1
+                            if std > 0 or len(prompt_uid2metric_vals[uid]) == 1 or prompt_uid2within[uid]
                        ]


The new condition or prompt_uid2within[uid] keeps prompt groups where all responses have the same non-extremal reward (a constant c where 0 < c < 1). While this might seem to add more data to the batch, it appears to introduce incorrect training signals for the critic when using the grpo advantage estimator.

Here's the breakdown of the issue:

For these newly kept prompt groups, the standard deviation of rewards is 0.

When using the grpo advantage estimator, compute_grpo_outcome_advantage will calculate an advantage of 0 for all responses in such a group.

In dapo_ray_trainer.py, the returns for the critic are set to be equal to the advantages.

Therefore, for these prompts, the critic will be trained with a target (returns) of 0.

However, the value function should predict the expected reward. Since all responses for the prompt have a reward of c, the expected reward is c, not 0.

This mismatch trains the value function towards an incorrect target, which can destabilize the training process and harm model performance.

If the intention is to handle cases where rewards are continuous, it might be better to adjust the filtering threshold (e.g., std > 1e-6) rather than adding this condition which seems to break the critic's training logic. Could you clarify the reasoning behind this change?

wuxibin89 · 2026-01-06T04:16:17Z

recipe has been move to verl-project/verl-recipe as a submodule, #4795. Please submit a PR to ver-recipe.

Update dapo_ray_trainer.py

291f6db

TsingZ0 requested review from FightingZhen, PeterSH6, ji-huazhong, tongyx361 and vermouth1992 as code owners January 5, 2026 05:20

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

wuxibin89 closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dapo_ray_trainer.py#4789

Update dapo_ray_trainer.py#4789
TsingZ0 wants to merge 1 commit intoverl-project:mainfrom
TsingZ0:fix_for_continuous_rewards

TsingZ0 commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

wuxibin89 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TsingZ0 commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants