[reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None by none0663 · Pull Request #5195 · verl-project/verl

none0663 · 2026-02-04T11:12:29Z

What does this PR do?

Problem

When the codebase is updated to 2cd9283 (migration to the new asynchronous reward manager), using colocate RM with async rollout (AgentLoopManager) causes validation to fail with: KeyError: 'data_source'

Where: verl/experimental/reward_loop/reward_manager/naive.py, line 42, in run_single — it accesses data_item.non_tensor_batch["data_source"].
Call path: _validate → _compute_reward_colocate(test_output_gen_batch_padded) → reward_loop_manager.compute_rm_score(batch) → RewardLoopWorker.compute_score_batch → compute_score → run_single.
Cause: When reward_loop_worker_handles is None (e.g. colocate RM), AgentLoopManager.generate_sequences returns a DataProto whose non_tensor_batch is built only from agent outputs (__num_turns__, multi_modal_inputs, raw_prompt). Input metadata such as data_source is never forwarded, so the batch passed to the reward manager is missing data_source and the naive reward manager raises KeyError: 'data_source'.

WeChatWorkScreenshot_0bc71e8a-a6a8-4334-b930-5a5d0bb149a2

Solution

Pass the input batch’s non_tensor_batch into _postprocess as **kwargs.
When reward_loop_worker_handles is None, merge these kwargs into the output non_tensor_batch so data_source and other input keys are preserved.
Colocate RM / validation then receives a batch that includes data_source, and the KeyError is fixed.

vermouth1992 · 2026-02-04T15:21:57Z

/gemini review

gemini-code-assist

Code Review

The pull request effectively addresses the KeyError: 'data_source' that occurred when using colocate Reward Manager with asynchronous rollout. By passing the input batch's non_tensor_batch as keyword arguments to _postprocess and conditionally merging these arguments into the output non_tensor_batch when reward_loop_worker_handles is None, the necessary input metadata is preserved. This is a correct and targeted fix for the identified problem.

yyDing1 · 2026-02-04T16:46:27Z

This is indeed a potential bug when using genrm for validation.
I think placing this change inside the agent loop would feel somewhat confusing for users.
We could first merge the batches using batch.union(gen_batch) (in the validation function) and then compute the reward, similar to that in the training process.

none0663 · 2026-02-05T02:48:25Z

This is indeed a potential bug when using genrm for validation.
I think placing this change inside the agent loop would feel somewhat confusing for users.
We could first merge the batches using batch.union(gen_batch) (in the validation function) and then compute the reward, similar to that in the training process

I’d prefer to keep the fix in the agent loop rather than in ray_trainer.py. The missing data_source comes from agent_loop’s generate_sequences: its return value doesn’t carry the input’s non_tensor_batch when use colocate RM, so fixing it in the trainer would mean adding merge logic in several places (validation, training, REMAX) and would be more cumbersome. Fixing it inside the agent loop addresses the cause in one place, and all callers (validation and training) get the correct non_tensor_batch without touching ray_trainer.py.

verl/verl/experimental/agent_loop/agent_loop.py

Lines 692 to 710 in 2320603

    
           enable_async_reward = self.reward_loop_worker_handles is not None 
        
           if output.reward_score is None and enable_async_reward: 
        
               batch = TensorDict( 
        
                   { 
        
                       "prompts": prompts,  # [1, prompt_length] 
        
                       "responses": responses,  # [1, response_length] 
        
                       "attention_mask": attention_mask,  # [1, prompt_length + response_length] 
        
                       "input_ids": input_ids,  # [1, prompt_length + response_length] 
        
                       "position_ids": position_ids, 
        
                   }, 
        
                   batch_size=1, 
        
               ) 
        
               non_tensor_batch = { 
        
                   **{k: np.array([v]) for k, v in kwargs.items()}, 
        
                   "__num_turns__": np.array([output.num_turns]), 
        
                   "tool_extra_fields": np.array([output.extra_fields], dtype=object), 
        
               }

yyDing1 · 2026-02-05T06:06:33Z

verl/experimental/agent_loop/agent_loop.py

            output.extra_fields["reward_extra_info"] = result["reward_extra_info"]

-    def _postprocess(self, inputs: list[_InternalAgentLoopOutput]) -> DataProto:
+    def _postprocess(self, inputs: list[_InternalAgentLoopOutput], **kwargs) -> DataProto:


"kwargs" parameter is a bit odd here, explicit maybe better.

yes, the name "kwargs" can be fixed.

yyDing1 · 2026-02-05T06:08:35Z

verl/experimental/agent_loop/agent_loop.py

        non_tensor_batch = {
            "__num_turns__": np.array([input.num_turns for input in inputs], dtype=np.int32),
        }
+        if self.reward_loop_worker_handles is None and kwargs:
+            non_tensor_batch.update(kwargs)


Can we explicitly define what goes into non_tensor_batch? For example, specifying fields like "data_sources" might make things clearer.

Can we explicitly define what goes into non_tensor_batch? For example, specifying fields like "data_sources" might make things clearer.

We could mirror _compute_score and pass through the whole input non_tensor_batch (all keys) into the output, instead of whitelisting specific fields like "data_sources". That would keep behavior consistent and avoid maintaining an explicit list of keys. Should we do it that way?

none0663 · 2026-02-06T08:04:44Z

The two failing checks (NPU unit tests and pre-commit) don’t appear related to the edits in this PR

yyDing1 · 2026-02-06T08:07:54Z

Please run pre-commit run --all-files to format the modified files.

key error when use_rm collocate to compute score

b36b131

none0663 changed the title ~~[rollout] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None~~ [reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None Feb 4, 2026

wuxibin89 requested a review from yyDing1 February 4, 2026 12:07

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

yyDing1 reviewed Feb 5, 2026

View reviewed changes

fix kwargs name

f66ea2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None#5195

[reward] fix: preserve input non_tensor_batch in AgentLoopManager when reward_loop_worker_handles is None#5195
none0663 wants to merge 2 commits intoverl-project:mainfrom
none0663:agent-loop-preserve-non-tensor-batch

none0663 commented Feb 4, 2026

Uh oh!

vermouth1992 commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yyDing1 commented Feb 4, 2026 •

edited

Loading

Uh oh!

none0663 commented Feb 5, 2026 •

edited

Loading

Uh oh!

yyDing1 Feb 5, 2026

Uh oh!

none0663 Feb 5, 2026

Uh oh!

yyDing1 Feb 5, 2026

Uh oh!

none0663 Feb 5, 2026 •

edited

Loading

Uh oh!

none0663 commented Feb 6, 2026

Uh oh!

yyDing1 commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

none0663 commented Feb 4, 2026

What does this PR do?

Problem

Solution

Uh oh!

vermouth1992 commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yyDing1 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

none0663 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyDing1 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

none0663 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yyDing1 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

none0663 Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

none0663 commented Feb 6, 2026

Uh oh!

yyDing1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yyDing1 commented Feb 4, 2026 •

edited

Loading

none0663 commented Feb 5, 2026 •

edited

Loading

none0663 Feb 5, 2026 •

edited

Loading

yyDing1 commented Feb 6, 2026 •

edited

Loading