Skip to content

[data] chore: add warning for multimodal length filter#5217

Open
Silas-11 wants to merge 1 commit intoverl-project:mainfrom
Silas-11:release
Open

[data] chore: add warning for multimodal length filter#5217
Silas-11 wants to merge 1 commit intoverl-project:mainfrom
Silas-11:release

Conversation

@Silas-11
Copy link
Contributor

@Silas-11 Silas-11 commented Feb 6, 2026

What does this PR do?

Add a warning log for multimodal prompt length filtering in the single-controller module, clarify the actual current implementation status (has implemented multimodal length calculation including image token counts) and the subsequent iteration plan (migrate the accurate multimodal filtering logic to AgentLoopWorker). The warning aims to remind developers of the high memory pressure risk caused by full loading of large-scale multimodal data when executing filtering in single-controller, without modifying any core filtering logic and ensuring the consistency between warning information and code behavior.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: N/A (No similar PR for multimodal filtering warning log in single-controller)
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • PR Title: [data, single_controller] chore: add warning log for multimodal filtering in single-controller
    • 说明:模块选择data(数据过滤)+single_controller(执行载体),类型选择chore(工程优化/日志补充,无核心功能修改,无API变更)

Test

This PR only adds a warning log without changing any core multimodal filtering logic — the original function (multimodal length calculation including image token counts, single-modal text filtering) remains fully effective. The warning log can be correctly triggered when filter_overlong_prompts=True and in multimodal scenes (processor is not None), with no abnormal output/process interruption. No additional experiment verification is required for pure log addition changes, and the original CI test cases can fully cover the code behavior.

API and Usage Example

This PR does not involve any API changes (no modification of CLI arguments, config files, function signatures, etc.), and the original calling method of maybe_filter_out_long_prompts remains unchanged. The warning log is automatically output during the execution of the filtering function without manual invocation, and the core logic of multimodal length calculation remains consistent.

Design & Code Changes

High-level Design: For the multimodal prompt length filtering logic (including image token counts) implemented in single-controller, add a professional warning log to clarify the actual code behavior and subsequent iteration plan, in response to the memory pressure risk of large-scale multimodal data full loading in single-controller. The warning information is fully consistent with the code logic, without any contradictory description, and no core filtering logic is modified.

Specific Changes:

  1. In the maybe_filter_out_long_prompts method of single-controller (verl/utils/dataset/rl_dataset.py), add a logger.warning log in the multimodal scene judgment branch (if processor is not None:)
  2. The log content clarifies three core points (100% consistent with code behavior):
    • Current status: Multimodal length filtering (including image token counts) has been implemented on single-controller;
    • Potential risk: May cause high memory pressure when processing large-scale multimodal data;
    • Subsequent plan: Will migrate accurate multimodal filtering logic to AgentLoopWorker for optimal performance.
  3. The log is formatted in compliance with project code specification (multi-line splicing, no overlong single line, passed pre-commit format check)
  4. No modification to other code logic, no addition/delete of dependencies, and the original filtering function (switch control, parallel processing, exception tolerance, multimodal length calculation) remains fully effective.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always (All checks passed, no code format issues)
  • Add / Update the documentation. (No documentation update required for pure warning log addition with no API/function changes)
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: This PR only adds a warning log without changing any core business logic (the original multimodal length calculation logic remains unchanged), and the original CI test cases can fully cover the code behavior, so no additional test cases are needed.
  • Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).) (Will send CI request after PR submission)
  • If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main. (No relation to the recipe submodule, no operation required)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a warning to clarify that multimodal length filtering is currently text-only. However, my review found a critical issue: the warning message is factually incorrect and contradicts the code that immediately follows it. The implementation appears to perform a full multimodal length calculation, including image and video tokens, which is the opposite of what the warning claims. This discrepancy needs to be resolved to avoid confusion.

Comment on lines +191 to +195
logger.warning(
"Multimodal length filtering currently considers text tokens only; "
"image/video token counts are excluded. "
"Accurate multimodal filtering will be implemented in AgentLoop."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The added warning states that multimodal length filtering considers only text tokens and excludes image/video tokens. However, the implementation of doc2len immediately following this warning appears to perform a full multimodal length calculation. It processes images and videos, passes them to processor(), and the length of the resulting input_ids will include tokens for all modalities. This contradicts the warning message.

This discrepancy is critical as it can mislead developers and users about the function's behavior and its memory usage. If the intention is to only filter based on text length to avoid memory pressure (as mentioned in the PR description), then the implementation of doc2len for the multimodal case should be modified to reflect that. As it stands, the warning is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant