[data] chore: add warning for multimodal length filter#5217
[data] chore: add warning for multimodal length filter#5217Silas-11 wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a warning to clarify that multimodal length filtering is currently text-only. However, my review found a critical issue: the warning message is factually incorrect and contradicts the code that immediately follows it. The implementation appears to perform a full multimodal length calculation, including image and video tokens, which is the opposite of what the warning claims. This discrepancy needs to be resolved to avoid confusion.
| logger.warning( | ||
| "Multimodal length filtering currently considers text tokens only; " | ||
| "image/video token counts are excluded. " | ||
| "Accurate multimodal filtering will be implemented in AgentLoop." | ||
| ) |
There was a problem hiding this comment.
The added warning states that multimodal length filtering considers only text tokens and excludes image/video tokens. However, the implementation of doc2len immediately following this warning appears to perform a full multimodal length calculation. It processes images and videos, passes them to processor(), and the length of the resulting input_ids will include tokens for all modalities. This contradicts the warning message.
This discrepancy is critical as it can mislead developers and users about the function's behavior and its memory usage. If the intention is to only filter based on text length to avoid memory pressure (as mentioned in the PR description), then the implementation of doc2len for the multimodal case should be modified to reflect that. As it stands, the warning is incorrect.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI)[data, single_controller] chore: add warning log for multimodal filtering in single-controllerdata(数据过滤)+single_controller(执行载体),类型选择chore(工程优化/日志补充,无核心功能修改,无API变更)Test
API and Usage Example
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always(All checks passed, no code format issues)ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).) (Will send CI request after PR submission)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main. (No relation to therecipesubmodule, no operation required)