Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 30 additions & 13 deletions docs/examples/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ Data
return_raw_input_ids: False # This should be set to true when the tokenizer between policy and rm differs
return_raw_chat: False
shuffle: True
filter_overlong_prompts: False # for large-scale dataset, filtering overlong prompts could be timeconsuming. You should disable this and set `truncation='left'`
filter_overlong_prompts: False
filter_overlong_prompts_workers: 1
truncation: error
image_key: images
custom_cls:
Expand All @@ -42,7 +43,7 @@ Data
left-padded to this length. An error will be reported if the length is
too long
- ``data.max_response_length``: Maximum response length. Rollout in RL
algorithms (e.g. PPO) generates up to this length
algorithms (e.g. PPO) generates up to this length
- ``data.train_batch_size``: Batch size sampled for one training
iteration of different RL algorithms.
- ``data.return_raw_input_ids``: Whether to return the original
Expand All @@ -53,9 +54,10 @@ Data
chat_templates are different, this flag needs to be set
- ``data.return_raw_chat``:
- ``data.shuffle``: Whether to shuffle the data in the dataloader.
- ``data.filter_overlong_prompts``: Default don't filter. You can filter for small-scale dataset.
For large-scale dataset, filtering overlong prompts could be timeconsuming.
You should disable this and set ``truncation='left'``
- ``data.filter_overlong_prompts``: Default don't filter.
- ``data.filter_overlong_prompts_workers``: For large-scale dataset, filtering
overlong prompts could be timeconsuming. You cat set the ``filter_overlong_prompts_workers``
to use multiprocessing for speed up. Default to 1.
- ``data.truncation``: Truncate the input_ids or prompt length if they
exceed max_prompt_length. Default is 'error', not allow exceed the
max_prompt_length. The users should increase the max_prompt_length if
Expand Down Expand Up @@ -162,6 +164,13 @@ Actor/Rollout/Reference Policy
swap_space: null # null means "use the engine default value" (usually 4 GB), setting it to, e.g., 32 means 32 GB
# number of responses (i.e. num sample times)
n: 1 # > 1 for grpo, rloo
val_kwargs:
# sampling parameters for validation
top_k: -1 # 0 for hf rollout, -1 for vllm rollout
top_p: 1.0
temperature: 0
n: 1
do_sample: False # default eager for validation

**Common config for actor, rollout and reference model**

Expand Down Expand Up @@ -268,25 +277,33 @@ Reference model will be enabled when ``actor.use_kl_loss`` or/and ``algorithm.us
- ``temperature``, ``top_k``, ``top_p`` and others: Sampling
parameters in ``SamplingParams``.

- ``dtype``: Rollout model parameters type. This should be align with
- ``actor_rollout_ref.rollout.dtype``: Rollout model parameters type. This should be align with
the actor model parameter type in FSDP/Megatron backend.

- ``gpu_memory_utilization``: The proportion of the remaining GPU memory
- ``actor_rollout_ref.rollout.gpu_memory_utilization``: The proportion of the remaining GPU memory
allocated for kv cache after other models have initialized when using
vllm.

- ``tensor_model_parallel_size``: TP size for rollout. Only effective
- ``actor_rollout_ref.rollout.tensor_model_parallel_size``: TP size for rollout. Only effective
for vllm.

- ``actor_rollout_ref.ref.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu]
- ``actor_rollout_ref.rollout.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu]
The batch size for one forward pass in the computation of ``log_prob``. The value represent the global num.

- ``log_prob_micro_batch_size_per_gpu``: Micro batch size per gpu (The batch size for
- ``actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu``: Micro batch size per gpu (The batch size for
one forward pass) for recalculating ``log_prob``. The value represent the local num per gpu.

- ``do_sample``: Whether to sample. If set to False, the rollout model
will perform greedy sampling. We disable ``do_sample`` during
validation.
- ``actor_rollout_ref.rollout.do_sample``: Whether to sample during training rollout. If set to False, the rollout model
will perform greedy sampling.

- ``actor_rollout_ref.rollout.val_kwargs```: Sampling parameters used specifically during validation.
- ``top_k``: Top-k sampling parameter. Default to -1 for vLLM rollout or 0 for HF rollout.
- ``top_p``: Top-p sampling parameter. Default is 1.0 (disabled).
- ``temperature``: Sampling temperature. Default is 0 (deterministic greedy).
- ``n``: Number of responses to generate during validation. Default is 1.
- ``do_sample``: Whether to use sampling during validation. Default is False for
deterministic outputs. When set to True, the rollout will use the ``actor_rollout_ref.rollout.val_kwargs`` parameters
(top_k, top_p, temperature) to control the sampling behavior.

- ``actor_rollout_ref.rollout.engine_kwargs.swap_space``: swap space in GB used by the inference engine.
- ``null``: means not setting and using the engine default value (usually, e.g., 4 GB for vLLM)
Expand Down