diff --git a/docs/examples/config.rst b/docs/examples/config.rst index 7d1a9f80e4a..eae9c8db39c 100644 --- a/docs/examples/config.rst +++ b/docs/examples/config.rst @@ -22,7 +22,8 @@ Data return_raw_input_ids: False # This should be set to true when the tokenizer between policy and rm differs return_raw_chat: False shuffle: True - filter_overlong_prompts: False # for large-scale dataset, filtering overlong prompts could be timeconsuming. You should disable this and set `truncation='left'` + filter_overlong_prompts: False + filter_overlong_prompts_workers: 1 truncation: error image_key: images custom_cls: @@ -42,7 +43,7 @@ Data left-padded to this length. An error will be reported if the length is too long - ``data.max_response_length``: Maximum response length. Rollout in RL - algorithms (e.g. PPO) generates up to this length + algorithms (e.g. PPO) generates up to this length - ``data.train_batch_size``: Batch size sampled for one training iteration of different RL algorithms. - ``data.return_raw_input_ids``: Whether to return the original @@ -53,9 +54,10 @@ Data chat_templates are different, this flag needs to be set - ``data.return_raw_chat``: - ``data.shuffle``: Whether to shuffle the data in the dataloader. -- ``data.filter_overlong_prompts``: Default don't filter. You can filter for small-scale dataset. - For large-scale dataset, filtering overlong prompts could be timeconsuming. - You should disable this and set ``truncation='left'`` +- ``data.filter_overlong_prompts``: Default don't filter. +- ``data.filter_overlong_prompts_workers``: For large-scale dataset, filtering + overlong prompts could be timeconsuming. You cat set the ``filter_overlong_prompts_workers`` + to use multiprocessing for speed up. Default to 1. - ``data.truncation``: Truncate the input_ids or prompt length if they exceed max_prompt_length. Default is 'error', not allow exceed the max_prompt_length. The users should increase the max_prompt_length if @@ -162,6 +164,13 @@ Actor/Rollout/Reference Policy swap_space: null # null means "use the engine default value" (usually 4 GB), setting it to, e.g., 32 means 32 GB # number of responses (i.e. num sample times) n: 1 # > 1 for grpo, rloo + val_kwargs: + # sampling parameters for validation + top_k: -1 # 0 for hf rollout, -1 for vllm rollout + top_p: 1.0 + temperature: 0 + n: 1 + do_sample: False # default eager for validation **Common config for actor, rollout and reference model** @@ -268,25 +277,33 @@ Reference model will be enabled when ``actor.use_kl_loss`` or/and ``algorithm.us - ``temperature``, ``top_k``, ``top_p`` and others: Sampling parameters in ``SamplingParams``. -- ``dtype``: Rollout model parameters type. This should be align with +- ``actor_rollout_ref.rollout.dtype``: Rollout model parameters type. This should be align with the actor model parameter type in FSDP/Megatron backend. -- ``gpu_memory_utilization``: The proportion of the remaining GPU memory +- ``actor_rollout_ref.rollout.gpu_memory_utilization``: The proportion of the remaining GPU memory allocated for kv cache after other models have initialized when using vllm. -- ``tensor_model_parallel_size``: TP size for rollout. Only effective +- ``actor_rollout_ref.rollout.tensor_model_parallel_size``: TP size for rollout. Only effective for vllm. -- ``actor_rollout_ref.ref.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu] +- ``actor_rollout_ref.rollout.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu] The batch size for one forward pass in the computation of ``log_prob``. The value represent the global num. -- ``log_prob_micro_batch_size_per_gpu``: Micro batch size per gpu (The batch size for +- ``actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu``: Micro batch size per gpu (The batch size for one forward pass) for recalculating ``log_prob``. The value represent the local num per gpu. -- ``do_sample``: Whether to sample. If set to False, the rollout model - will perform greedy sampling. We disable ``do_sample`` during - validation. +- ``actor_rollout_ref.rollout.do_sample``: Whether to sample during training rollout. If set to False, the rollout model + will perform greedy sampling. + +- ``actor_rollout_ref.rollout.val_kwargs```: Sampling parameters used specifically during validation. + - ``top_k``: Top-k sampling parameter. Default to -1 for vLLM rollout or 0 for HF rollout. + - ``top_p``: Top-p sampling parameter. Default is 1.0 (disabled). + - ``temperature``: Sampling temperature. Default is 0 (deterministic greedy). + - ``n``: Number of responses to generate during validation. Default is 1. + - ``do_sample``: Whether to use sampling during validation. Default is False for + deterministic outputs. When set to True, the rollout will use the ``actor_rollout_ref.rollout.val_kwargs`` parameters + (top_k, top_p, temperature) to control the sampling behavior. - ``actor_rollout_ref.rollout.engine_kwargs.swap_space``: swap space in GB used by the inference engine. - ``null``: means not setting and using the engine default value (usually, e.g., 4 GB for vLLM)