Skip to content

Update dapo_ray_trainer.py#4789

Closed
TsingZ0 wants to merge 1 commit intoverl-project:mainfrom
TsingZ0:fix_for_continuous_rewards
Closed

Update dapo_ray_trainer.py#4789
TsingZ0 wants to merge 1 commit intoverl-project:mainfrom
TsingZ0:fix_for_continuous_rewards

Conversation

@TsingZ0
Copy link

@TsingZ0 TsingZ0 commented Jan 5, 2026

fix bugs for #4786

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug by modifying the prompt filtering logic in dapo_ray_trainer.py. The change adds a condition to retain prompt groups where all generated responses have the same, non-extremal reward. However, my review indicates that this change may introduce a critical issue by providing incorrect training targets to the value function (critic) when using the GRPO advantage estimator. This could lead to training instability and an incorrectly trained model. I've provided a detailed comment on this potential issue.

Comment on lines 257 to 261
kept_prompt_uids = [
uid
for uid, std in prompt_uid2metric_std.items()
if std > 0 or len(prompt_uid2metric_vals[uid]) == 1
if std > 0 or len(prompt_uid2metric_vals[uid]) == 1 or prompt_uid2within[uid]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new condition or prompt_uid2within[uid] keeps prompt groups where all responses have the same non-extremal reward (a constant c where 0 < c < 1). While this might seem to add more data to the batch, it appears to introduce incorrect training signals for the critic when using the grpo advantage estimator.

Here's the breakdown of the issue:

  1. For these newly kept prompt groups, the standard deviation of rewards is 0.
  2. When using the grpo advantage estimator, compute_grpo_outcome_advantage will calculate an advantage of 0 for all responses in such a group.
  3. In dapo_ray_trainer.py, the returns for the critic are set to be equal to the advantages.
  4. Therefore, for these prompts, the critic will be trained with a target (returns) of 0.
  5. However, the value function should predict the expected reward. Since all responses for the prompt have a reward of c, the expected reward is c, not 0.

This mismatch trains the value function towards an incorrect target, which can destabilize the training process and harm model performance.

If the intention is to handle cases where rewards are continuous, it might be better to adjust the filtering threshold (e.g., std > 1e-6) rather than adding this condition which seems to break the critic's training logic. Could you clarify the reasoning behind this change?

@wuxibin89
Copy link
Collaborator

recipe has been move to verl-project/verl-recipe as a submodule, #4795. Please submit a PR to ver-recipe.

@wuxibin89 wuxibin89 closed this Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants