refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890

0x404 · 2025-04-03T07:59:46Z

HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0).

Now we can specify filter_overlong_prompts on large-scale datasets when set filter_overlong_prompts_workers to a appreciate num.

HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets.

CLAassistant · 2025-04-03T07:59:52Z

All committers have signed the CLA.

…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>

twangnyc · 2025-04-11T17:15:35Z

One of the breaking behavior change on this would be we can no longer point a directory as the config.data.val_files for config.data.train_files as previously pd.read_parquet support directly reading from a directory... @0x404

0x404 · 2025-04-12T01:03:15Z

HI @twangnyc, I think you can use wildcard if you want to use the entire directory as your dataset. for example, you can set your config.data.val_files to your_dir/*.parquet, and this should work.

twangnyc · 2025-04-13T15:34:40Z

@0x404 Thanks for your reply! Yes we are using wildcard to temporary fix our failing code running previously. Noticed you raised the new PR to make it compatible with pd.Dataframe behavior. Appreciate it!

…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR update some outdated docs on config: - Add `filter_overlong_prompts_workers` configuration option, which introduced in #890 - Add documentation for `actor_rollout_ref.rollout.val_kwargs` parameters, fix #1352 - Fix attribution of several configuration options to their proper namespaces ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>

…ine#1355) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR update some outdated docs on config: - Add `filter_overlong_prompts_workers` configuration option, which introduced in volcengine#890 - Add documentation for `actor_rollout_ref.rollout.val_kwargs` parameters, fix volcengine#1352 - Fix attribution of several configuration options to their proper namespaces ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

0x404 added 2 commits April 3, 2025 15:50

style: reformat code

06b63fa

hiyouga previously approved these changes Apr 3, 2025

View reviewed changes

fix ut

4c7f5ad

hiyouga dismissed their stale review via 4c7f5ad April 3, 2025 17:35

eric-haibin-lin approved these changes Apr 4, 2025

View reviewed changes

eric-haibin-lin merged commit 6974bba into volcengine:main Apr 4, 2025
24 checks passed

0x404 mentioned this pull request Apr 11, 2025

Truncation still does not work as intended #1015

Open

0x404 mentioned this pull request Apr 13, 2025

fix: filter overlong prompts should also consider multi modal inputs #1052

Open

0x404 mentioned this pull request May 2, 2025

docs: update config documentation with validation parameters #1355

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890

refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890

Uh oh!

0x404 commented Apr 3, 2025

Uh oh!

CLAassistant commented Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

twangnyc commented Apr 11, 2025

Uh oh!

0x404 commented Apr 12, 2025

Uh oh!

twangnyc commented Apr 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890

refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890

Uh oh!

Conversation

0x404 commented Apr 3, 2025

Uh oh!

CLAassistant commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

twangnyc commented Apr 11, 2025

Uh oh!

0x404 commented Apr 12, 2025

Uh oh!

twangnyc commented Apr 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 3, 2025 •

edited

Loading