Skip to content

[trainer, data] feat: Dynamic Data Generation #2312

Merged
zhaochenyang20 merged 30 commits intoverl-project:mainfrom
jwong8314:main
Jul 9, 2025
Merged

[trainer, data] feat: Dynamic Data Generation #2312
zhaochenyang20 merged 30 commits intoverl-project:mainfrom
jwong8314:main

Conversation

@jwong8314
Copy link
Contributor

@jwong8314 jwong8314 commented Jul 1, 2025

What does this PR do?

Add interface to support dynamic data generation which will allow us to create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface to make it easier to implement other dynamic data generation algorithms. In particular, we want to have the model propose new tasks based on which tasks currently do or don't succeed. This has been shown to be useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205, https://openreview.net/pdf?id=oVKEAFjEqv, https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful:

Imagine wanting to generate variations on the hardest tasks for the current training loop. We implement this as a LLM API call as a custom data generator followed by a custom sampler that selects the desirable datapoints as they're generated.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: is:pr is:open data generation
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh

more details in Usage Example section below.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

  1. Change the yaml to enable
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'

The noop dataset just reappends the first datapoint at the end. You can see that this correctly happened by printing out the size of the dataset each epoch:

(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475

Note the original dataset length is 7473 for gsm8k_w_tool

High-Level Design

Demonstrate the high-level design if this PR is complex.

n/a

Specific Changes

List the specific changes.

  • Add an abstract datagen class that's used in ray_trainer.py to add data to the dataset
  • We refactor filtering out of _read_files_and_tokenize in RLHFDataset
  • We add append_dataframe to RLHFDataset
  • Add util for getting type from file.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

jwong8314 and others added 6 commits July 1, 2025 17:00
…to the batch during training

Co-authored-by: Justin Wong <wong.justin@berkeley.edu>
Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
@jwong8314
Copy link
Contributor Author

Is there a way to unblock CI workflows while discussion are still ongoing?

@jwong8314 jwong8314 requested a review from zhaochenyang20 July 1, 2025 21:57
@zhaochenyang20
Copy link
Collaborator

Is there a way to unblock CI workflows while discussion are still ongoing?

No. So have to rerun a lot of time. 🥲

@zhaochenyang20
Copy link
Collaborator

Hey Justin, could you elaborate more on this PR about what it is aiming at, especially what's Dynamic Data Generation?

I can help to contact verl team and see their feedback, but it's better on our side to make it well-defined? thanks!!

@jwong8314
Copy link
Contributor Author

To elaborate, this PR is refactoring the code and providing an interface to make it easier to implement other dynamic data generation algorithms. In particular, we want to have the model propose new tasks based on which tasks currently do or don't succeed.

@jwong8314
Copy link
Contributor Author

Let me know if you have additional questions! It looks the the CI passed and it's ready to merge.

@gemini-code-assist
Copy link
Contributor

Important

Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services.

@eric-haibin-lin
Copy link
Collaborator

Do you intend to support dataloader save/resume with dynamic datagen?

@jwong8314
Copy link
Contributor Author

Although we currently do not support dataloader save and resume, this can be added in the future.

@jwong8314 jwong8314 requested a review from eric-haibin-lin July 7, 2025 06:28
@jwong8314
Copy link
Contributor Author

I'd additionally like to highlight that custom dataset class won't be sufficient without the flags added in this PR to distinguish train vs val datasets. Only the train dataset should be allowed to generate new training datapoints and the val dataset should remain fixed.

I do not think it's necessary to keep the DynamicGenDataset in verl. it can be in your private recipe repo since verl already provides data.custom_cls to take any custom dataset class.

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with merging as soon as all tests pass

@zhaochenyang20 zhaochenyang20 merged commit ab11fff into verl-project:main Jul 9, 2025
51 of 52 checks passed
lkc233 pushed a commit to lkc233/verl that referenced this pull request Jul 10, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
ArronHZG pushed a commit to imh966/verl that referenced this pull request Jul 10, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
@ASchneidman
Copy link

I am fairly confident this implementation is incorrect. Data loader workers are on separate processes, and thus have their own in memory copies of the dataframe. Modifications to the dataframe after each batch will not reach the dataloader workers. Additionally, the dataloader will not see the change to the size of the dataset, so it will not get the new samples.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jan 20, 2026
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

Add interface to support dynamic data generation which will allow us to
create new tasks between each step of training.

To elaborate, this PR is refactoring the code and providing an interface
to make it easier to implement other dynamic data generation algorithms.
In particular, we want to have the model propose new tasks based on
which tasks currently do or don't succeed. This has been shown to be
useful for webtasks and reasoning: https://arxiv.org/pdf/2506.14205,
https://openreview.net/pdf?id=oVKEAFjEqv,
https://arxiv.org/abs/2502.06776, https://arxiv.org/pdf/2505.03335.

Basic example that could be useful: 

Imagine wanting to generate variations on the hardest tasks for the
current training loop. We implement this as a LLM API call as a custom
data generator followed by a custom sampler that selects the desirable
datapoints as they're generated.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: is:pr
is:open data generation
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

`bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh`

more details in Usage Example section below.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

1. Change the yaml to enable 
```
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
@@ -93,11 +93,11 @@ data:
 
     # The path to the file containing your customized data generation class.
     # E.g. 'verl.utils.dataset.datagen'
-    path: null 
+    path: 'verl.utils.dataset.datagen'
 
     # The class name of the data generation class within the specified file.
     # E.g. NoOpDataGen
-    name: null 
+    name: 'NoOpDataGen'
```

The noop dataset just reappends the first datapoint at the end. You can
see that this correctly happened by printing out the size of the dataset
each epoch:

```
(TaskRunner pid=71298) step:0 - val-core/openai/gsm8k/reward/mean@1:0.668
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 0/435 [00:00<?, ?it/s]
(WorkerDict pid=74307) /workplace/rl_workspace/src/AGIEmergeRL/vendor/verl_2/verl/verl/workers/rollout/sglang_rollout/utils.py:49: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.) [repeated 3x across cluster]
(WorkerDict pid=74307)   tensor_data = torch.ByteTensor(np.frombuffer(serialized_data, dtype=np.uint8)).to(device) [repeated 3x across cluster]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7474
(TaskRunner pid=71298) 
Filtering prompts longer than 1024 tokens: 100%|██████████| 1/1 [00:00<00:00, 165.34 examples/s]
(TaskRunner pid=71298) 7474
(TaskRunner pid=71298) step:1 - global_seqlen/min:88786.000 - global_seqlen/max:101138.000 - global_seqlen/minmax_diff:12352.000 - global_seqlen/balanced_min:94905.000 - global_seqlen/balanced_max:94905.000 - global_seqlen/mean:94905.000 - actor/entropy:0.361 - actor/kl_loss:0.002 - actor/kl_coef:0.001 - actor/pg_loss:0.022 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:1.301 - perf/mfu/actor:0.107 - perf/max_memory_allocated_gb:7.201 - perf/max_memory_reserved_gb:12.896 - perf/cpu_memory_used_gb:57.490 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.677 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.677 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.021 - critic/advantages/max:1.500 - critic/advantages/min:-1.500 - critic/returns/mean:-0.021 - critic/returns/max:1.500 - critic/returns/min:-1.500 - response_length/mean:376.086 - response_length/max:1024.000 - response_length/min:58.000 - response_length/clip_ratio:0.020 - prompt_length/mean:365.359 - prompt_length/max:459.000 - prompt_length/min:327.000 - prompt_length/clip_ratio:0.000 - timing_s/generate_sequences:44.158 - timing_s/reshard:2.658 - timing_s/gen:47.161 - timing_s/reward:0.423 - timing_s/old_log_prob:15.347 - timing_s/ref:28.668 - timing_s/adv:0.039 - timing_s/update_actor:60.185 - timing_s/step:151.945 - timing_per_token_ms/gen:0.122 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.038 - timing_per_token_ms/update_actor:0.079 - perf/total_num_tokens:759240.000 - perf/time_per_step:151.945 - perf/throughput:624.599
(TaskRunner pid=71298) NoOpDataGen: No operation performed on the dataset.
Training Progress:   0%|          | 1/435 [02:32<18:24:31, 152.70s/it]
(TaskRunner pid=71298) filter dataset len: 1
(TaskRunner pid=71298) new dataset len: 7475
```

Note the original dataset length is 7473 for `gsm8k_w_tool`



### High-Level Design

> Demonstrate the high-level design if this PR is complex.

n/a


### Specific Changes

> List the specific changes.

- Add an abstract datagen class that's used in ray_trainer.py to add
data to the dataset
- We refactor filtering out of `_read_files_and_tokenize` in RLHFDataset
- We add `append_dataframe` to RLHFDataset
- Add util for getting type from file.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: Frederick Robinson <frederick.robinson@frrad.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants