-
Notifications
You must be signed in to change notification settings - Fork 2.8k
refactor: use hf Dataset instead of pandas DataFrame in RLHFDataset #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets.
…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>
|
One of the breaking behavior change on this would be we can no longer point a directory as the |
|
HI @twangnyc, I think you can use wildcard if you want to use the entire directory as your dataset. for example, you can set your |
|
@0x404 Thanks for your reply! Yes we are using wildcard to temporary fix our failing code running previously. Noticed you raised the new PR to make it compatible with pd.Dataframe behavior. Appreciate it! |
…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>
…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR update some outdated docs on config: - Add `filter_overlong_prompts_workers` configuration option, which introduced in #890 - Add documentation for `actor_rollout_ref.rollout.val_kwargs` parameters, fix #1352 - Fix attribution of several configuration options to their proper namespaces ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
…FDataset for speedup (volcengine#890) HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0). Now we can specify `filter_overlong_prompts` on large-scale datasets when set `filter_overlong_prompts_workers` to a appreciate num. --------- Co-authored-by: hoshi-hiyouga <[email protected]>
…ine#1355) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR update some outdated docs on config: - Add `filter_overlong_prompts_workers` configuration option, which introduced in volcengine#890 - Add documentation for `actor_rollout_ref.rollout.val_kwargs` parameters, fix volcengine#1352 - Fix attribution of several configuration options to their proper namespaces ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0).
Now we can specify
filter_overlong_promptson large-scale datasets when setfilter_overlong_prompts_workersto a appreciate num.