Skip to content

[reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None#5195

Open
none0663 wants to merge 2 commits intoverl-project:mainfrom
none0663:agent-loop-preserve-non-tensor-batch
Open

[reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None#5195
none0663 wants to merge 2 commits intoverl-project:mainfrom
none0663:agent-loop-preserve-non-tensor-batch

Conversation

@none0663
Copy link
Contributor

@none0663 none0663 commented Feb 4, 2026

What does this PR do?

Problem

When the codebase is updated to 2cd9283 (migration to the new asynchronous reward manager), using colocate RM with async rollout (AgentLoopManager) causes validation to fail with: KeyError: 'data_source'

  • Where: verl/experimental/reward_loop/reward_manager/naive.py, line 42, in run_single — it accesses data_item.non_tensor_batch["data_source"].
  • Call path: _validate_compute_reward_colocate(test_output_gen_batch_padded)reward_loop_manager.compute_rm_score(batch)RewardLoopWorker.compute_score_batchcompute_scorerun_single.
  • Cause: When reward_loop_worker_handles is None (e.g. colocate RM), AgentLoopManager.generate_sequences returns a DataProto whose non_tensor_batch is built only from agent outputs (__num_turns__, multi_modal_inputs, raw_prompt). Input metadata such as data_source is never forwarded, so the batch passed to the reward manager is missing data_source and the naive reward manager raises KeyError: 'data_source'.
WeChatWorkScreenshot_0bc71e8a-a6a8-4334-b930-5a5d0bb149a2

Solution

  • Pass the input batch’s non_tensor_batch into _postprocess as **kwargs.
  • When reward_loop_worker_handles is None, merge these kwargs into the output non_tensor_batch so data_source and other input keys are preserved.
  • Colocate RM / validation then receives a batch that includes data_source, and the KeyError is fixed.

@none0663 none0663 changed the title [rollout] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None [reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None Feb 4, 2026
@wuxibin89 wuxibin89 requested a review from yyDing1 February 4, 2026 12:07
@vermouth1992
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the KeyError: 'data_source' that occurred when using colocate Reward Manager with asynchronous rollout. By passing the input batch's non_tensor_batch as keyword arguments to _postprocess and conditionally merging these arguments into the output non_tensor_batch when reward_loop_worker_handles is None, the necessary input metadata is preserved. This is a correct and targeted fix for the identified problem.

@yyDing1
Copy link
Collaborator

yyDing1 commented Feb 4, 2026

This is indeed a potential bug when using genrm for validation.
I think placing this change inside the agent loop would feel somewhat confusing for users.
We could first merge the batches using batch.union(gen_batch) (in the validation function) and then compute the reward, similar to that in the training process.

@none0663
Copy link
Contributor Author

none0663 commented Feb 5, 2026

This is indeed a potential bug when using genrm for validation.
I think placing this change inside the agent loop would feel somewhat confusing for users.
We could first merge the batches using batch.union(gen_batch) (in the validation function) and then compute the reward, similar to that in the training process

I’d prefer to keep the fix in the agent loop rather than in ray_trainer.py. The missing data_source comes from agent_loop’s generate_sequences: its return value doesn’t carry the input’s non_tensor_batch when use colocate RM, so fixing it in the trainer would mean adding merge logic in several places (validation, training, REMAX) and would be more cumbersome. Fixing it inside the agent loop addresses the cause in one place, and all callers (validation and training) get the correct non_tensor_batch without touching ray_trainer.py.

enable_async_reward = self.reward_loop_worker_handles is not None
if output.reward_score is None and enable_async_reward:
batch = TensorDict(
{
"prompts": prompts, # [1, prompt_length]
"responses": responses, # [1, response_length]
"attention_mask": attention_mask, # [1, prompt_length + response_length]
"input_ids": input_ids, # [1, prompt_length + response_length]
"position_ids": position_ids,
},
batch_size=1,
)
non_tensor_batch = {
**{k: np.array([v]) for k, v in kwargs.items()},
"__num_turns__": np.array([output.num_turns]),
"tool_extra_fields": np.array([output.extra_fields], dtype=object),
}

output.extra_fields["reward_extra_info"] = result["reward_extra_info"]

def _postprocess(self, inputs: list[_InternalAgentLoopOutput]) -> DataProto:
def _postprocess(self, inputs: list[_InternalAgentLoopOutput], **kwargs) -> DataProto:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"kwargs" parameter is a bit odd here, explicit maybe better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the name "kwargs" can be fixed.

Comment on lines 757 to 761
non_tensor_batch = {
"__num_turns__": np.array([input.num_turns for input in inputs], dtype=np.int32),
}
if self.reward_loop_worker_handles is None and kwargs:
non_tensor_batch.update(kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly define what goes into non_tensor_batch? For example, specifying fields like "data_sources" might make things clearer.

Copy link
Contributor Author

@none0663 none0663 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly define what goes into non_tensor_batch? For example, specifying fields like "data_sources" might make things clearer.

We could mirror _compute_score and pass through the whole input non_tensor_batch (all keys) into the output, instead of whitelisting specific fields like "data_sources". That would keep behavior consistent and avoid maintaining an explicit list of keys. Should we do it that way?

@none0663
Copy link
Contributor Author

none0663 commented Feb 6, 2026

The two failing checks (NPU unit tests and pre-commit) don’t appear related to the edits in this PR

@yyDing1
Copy link
Collaborator

yyDing1 commented Feb 6, 2026

Please run pre-commit run --all-files to format the modified files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants