fix: correctly handle the inconsistency in reward_extra_info and reward_tensor #1044
+32
−17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If we are using multiple datasources for training/test, some reward function may return float and some may return dict (including extra_info) depending on the datasource.
In this case, the
reward_tensorandreward_extra_inforeturned byRewardManagerwill have different shape, this causes the error in #1031.this PR fix the inconsistency by correctly padding a place holder 'unkown' for those reward function only return float, this place holder will not be processed in and has no effect on the correctness:
verl/verl/trainer/ppo/metric_utils.py
Lines 227 to 233 in dc1714a