Skip to content

Conversation

@0x404
Copy link
Collaborator

@0x404 0x404 commented Apr 3, 2025

HF Dataset provides better memory management and can handle larger datasets. It also supports multi-process acceleration during map/filter operations (while pandas requires version >2.0).

Now we can specify filter_overlong_prompts on large-scale datasets when set filter_overlong_prompts_workers to a appreciate num.

0x404 added 2 commits April 3, 2025 15:50
HF Dataset provides better memory management and can handle larger datasets.
It also supports multi-process acceleration during map/filter operations
(while pandas requires version >2.0).

Now we can specify `filter_overlong_prompts` on large-scale datasets.
@CLAassistant
Copy link

CLAassistant commented Apr 3, 2025

CLA assistant check
All committers have signed the CLA.

hiyouga
hiyouga previously approved these changes Apr 3, 2025
@eric-haibin-lin eric-haibin-lin merged commit 6974bba into volcengine:main Apr 4, 2025
24 checks passed
yushengsu-thu pushed a commit to yushengsu-thu/verl that referenced this pull request Apr 4, 2025
…FDataset for speedup (volcengine#890)

HF Dataset provides better memory management and can handle larger
datasets. It also supports multi-process acceleration during map/filter
operations (while pandas requires version >2.0).

Now we can specify `filter_overlong_prompts` on large-scale datasets
when set `filter_overlong_prompts_workers` to a appreciate num.

---------

Co-authored-by: hoshi-hiyouga <[email protected]>
@twangnyc
Copy link

One of the breaking behavior change on this would be we can no longer point a directory as the config.data.val_files for config.data.train_files as previously pd.read_parquet support directly reading from a directory... @0x404

@0x404
Copy link
Collaborator Author

0x404 commented Apr 12, 2025

HI @twangnyc, I think you can use wildcard if you want to use the entire directory as your dataset. for example, you can set your config.data.val_files to your_dir/*.parquet, and this should work.

@twangnyc
Copy link

@0x404 Thanks for your reply! Yes we are using wildcard to temporary fix our failing code running previously. Noticed you raised the new PR to make it compatible with pd.Dataframe behavior. Appreciate it!

yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
…FDataset for speedup (volcengine#890)

HF Dataset provides better memory management and can handle larger
datasets. It also supports multi-process acceleration during map/filter
operations (while pandas requires version >2.0).

Now we can specify `filter_overlong_prompts` on large-scale datasets
when set `filter_overlong_prompts_workers` to a appreciate num.

---------

Co-authored-by: hoshi-hiyouga <[email protected]>
histmeisah pushed a commit to SJTU-IAAR/verl that referenced this pull request Apr 27, 2025
…FDataset for speedup (volcengine#890)

HF Dataset provides better memory management and can handle larger
datasets. It also supports multi-process acceleration during map/filter
operations (while pandas requires version >2.0).

Now we can specify `filter_overlong_prompts` on large-scale datasets
when set `filter_overlong_prompts_workers` to a appreciate num.

---------

Co-authored-by: hoshi-hiyouga <[email protected]>
vermouth1992 pushed a commit that referenced this pull request May 7, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?
This PR update some outdated docs on config:
- Add `filter_overlong_prompts_workers` configuration option, which
introduced in #890
- Add documentation for `actor_rollout_ref.rollout.val_kwargs`
parameters, fix #1352
- Fix attribution of several configuration options to their proper
namespaces

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…FDataset for speedup (volcengine#890)

HF Dataset provides better memory management and can handle larger
datasets. It also supports multi-process acceleration during map/filter
operations (while pandas requires version >2.0).

Now we can specify `filter_overlong_prompts` on large-scale datasets
when set `filter_overlong_prompts_workers` to a appreciate num.

---------

Co-authored-by: hoshi-hiyouga <[email protected]>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…ine#1355)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?
This PR update some outdated docs on config:
- Add `filter_overlong_prompts_workers` configuration option, which
introduced in volcengine#890
- Add documentation for `actor_rollout_ref.rollout.val_kwargs`
parameters, fix volcengine#1352
- Fix attribution of several configuration options to their proper
namespaces

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants