Author: Yingru Li Last updated: 2025-11-04
📖 Documentation Structure
- This document - Mathematical theory: formulations, derivations, and algorithmic foundations
- Rollout Correction Usage Guide - Practical implementation: configurations, presets, troubleshooting
Start here for theory and design rationale, refer to the usage guide for implementation.
This document provides the definitive mathematical formulations for rollout correction methods in verl, following the natural progression from REINFORCE to PPO to Decoupled PPO.
Rollout correction provides a unified framework to handle general off-policy problems in RL training - any scenario where the data collection distribution differs from the training distribution.
Applicable scenarios include:
- Policy mismatch: Different precision (FP8 vs FP16 vs BF16 vs FP32), different backends (vLLM vs SGLang vs FSDP vs Megatron)
- Temporal lag: Model staleness, asynchronous rollout workers
- Replay buffers: Training on historical trajectories from earlier policy versions
- Off-policy algorithms: Behavioral cloning, DAPO, expert demonstrations
- Data filtering: Reweighting, preference learning, curriculum learning
- Theoretical Foundation: From REINFORCE to Decoupled PPO
- Implementation in verl: The Three-Policy Framework
- Algorithmic Components and Combinations
- Off-Policy Diagnostic Metrics
- Summary and Decision Guide
- Implementation References
This section establishes the theoretical progression that verl implements.
The REINFORCE algorithm (Williams, 1992) is the foundation of policy gradient methods.
Vanilla REINFORCE (On-Policy)
For trajectories
where
Off-Policy REINFORCE
When trajectories are sampled from a different behavior policy
where the trajectory-level importance weight is:
The transition dynamics
Key properties:
- Off-policy capable: Can learn from any behavior policy via importance sampling
- No trust region: Policy updates not constrained
Implementation in verl: The pg_is method implements off-policy REINFORCE with truncated importance sampling.
Proximal Policy Optimization (Schulman et al., 2017) adds a clipped surrogate objective:
where
Key properties:
-
Two policies:
$\mu$ (reference for clipping) and$\pi_\theta$ (being updated) -
Trust region via clipping: Limits policy update magnitude via ratio
$r_t(\theta) = \frac{\pi_\theta}{\mu}$
Decoupled PPO (Hilton et al., 2021) solves PPO's batch size sensitivity by decoupling two roles:
-
Proximal policy
$\pi_{\text{prox}}$ : The anchor policy for PPO clipping (controls policy update size) -
Behavior policy
$\mu$ : The policy that collected the data (for off-policy correction via importance sampling)
The problem: Standard PPO controls policy update size via the ratio
The solution: Decouple these two roles, leading to a three-policy formulation:
where:
-
$w_t = \frac{\pi_{\text{prox}}(a_t|s_t)}{\mu(a_t|s_t)}$ : Importance sampling weight (corrects for behavior policy$\mu$ ) -
$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{prox}}(a_t|s_t)}$ : PPO ratio (controls policy update size against proximal policy$\pi_{\text{prox}}$ )
Key properties: By decoupling:
-
Batch size invariance: Policy update control (via
$\pi_{\text{prox}}$ ) is independent of data aggregation -
Flexible behavior policy: Any
$\mu$ can be used (different workers, replay buffers, or stale checkpoints) - Stale data utilization: Older trajectories can be corrected via importance sampling
-
Clipping preserved: Clipping against
$\pi_{\text{prox}}$ limits update magnitude
This is the algorithm that verl implements via its three-policy framework.
The verl library implements decoupled PPO using three distinct policies, each serving a specific role.
- When created: During rollout/data collection phase
- Purpose: Generate trajectories for training
- Common sources:
- Policy mismatch: Same weights, different implementation (precision, backend)
- Temporal lag: Stale checkpoint from async workers
- Replay buffer: Historical data from earlier iterations
- Off-policy algorithms: Expert demonstrations, auxiliary policies (DAPO)
- Data filtering: Reweighted or filtered data
- Fixed: Frozen during training on a batch
-
When created:
-
Decoupled mode: Computed at start of training epoch via
actor.compute_log_prob() -
Bypass mode: Set equal to
$\pi_{\text{rollout}}$ (skips separate computation)
-
Decoupled mode: Computed at start of training epoch via
-
Purpose:
- Anchor point for PPO clipping (controls policy update size)
- When separate from
$\pi_{\text{rollout}}$ : Enables batch size invariance and efficient use of stale data
- Fixed: Frozen during all PPO update epochs on the same batch
- Updated: Every gradient step
- Purpose: The policy we're improving
The three-policy framework can operate in two modes:
Decoupled Mode (Three Policies)
- Computes
$\pi_{\text{old}}$ separately at the start of each training epoch - Algorithm: Full decoupled PPO with three policies (mathematically correct)
- Properties: Achieves batch size invariance; separately corrects Drift 1 (rollout→old) and Drift 2 (old→current)
Bypass Mode (Two Policies)
- Sets
$\pi_{\text{old}} = \pi_{\text{rollout}}$ (skips separate computation) -
Algorithm: Uses
$\pi_{\text{rollout}}$ as both behavior policy and proximal policy (mathematically correct) - Key difference: Proximal policy equals behavior policy, so no IS correction needed between them
-
Properties: Faster (skips
actor.compute_log_prob()call); does not achieve batch size invariance
The three-policy framework handles two types of distribution drift:
Drift 1:
This is the distribution shift between the data collection policy and the training reference policy.
- Nature: Ranges from negligible (same checkpoint, minor differences) to severe (replay buffers, expert data)
-
Correction: Importance sampling weight
$w_t = \frac{\pi_{\text{old}}(a_t|s_t)}{\pi_{\text{rollout}}(a_t|s_t)}$ - Optional: Can be ignored (bypass mode) when negligible
Drift 2:
This is the drift from policy parameter updates during training.
-
Nature: Occurs as
$\pi_\theta$ is updated via gradient descent -
Correction: PPO clipping on ratio
$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ - Universal: Applies to both on-policy and off-policy training
-
$\pi_{\text{rollout}}$ : Behavior policy (data collection) -
$\pi_{\text{old}}$ : Proximal policy (PPO anchor) -
$\pi_{\theta}$ : Current policy (being updated) -
$\rho_t = \frac{\pi_{\text{old}}(a_t|s_t)}{\pi_{\text{rollout}}(a_t|s_t)}$ : Per-token IS ratio (corrects Drift 1) -
$r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ : PPO ratio (corrects Drift 2) -
$A_t$ : Advantage at token$t$ -
$T$ : Set of valid tokens in a sequence -
$C_{\text{IS}}$ : Upper threshold for IS weights (e.g., 2.0) -
$C_{\text{RS-upper}}$ : Upper threshold for RS mask (e.g., 2.0) -
$C_{\text{RS-lower}}$ : Lower threshold for RS mask (typically$1/C_{\text{RS-upper}}$ ) -
$\epsilon$ : PPO clip range (typically 0.2)
The rollout correction framework in verl is built from orthogonal components that can be combined flexibly:
-
Operating Mode: How
$\pi_{\text{old}}$ is computed (Decoupled vs Bypass) - Loss Function: PPO (with clipping) vs Pure IS (policy gradient only)
- IS/RS Aggregation Level: Token, Sequence, or Geometric
- Safety Mechanisms: Veto for catastrophic outliers
This section explains each component and their valid combinations.
The operating mode determines how the proximal policy
Configuration: bypass_old_logprob_for_rollout = false
Policy setup:
-
$\pi_{\text{rollout}}$ : Behavior policy (data collection) -
$\pi_{\text{old}}$ : Proximal policy (computed viaactor.compute_log_prob()at start of training epoch) -
$\pi_{\theta}$ : Current policy (being updated)
IS ratio:
PPO ratio:
Properties:
- ✅ Achieves batch size invariance
- ✅ Separately corrects two distribution drifts
- ✅ Efficient stale data utilization
- ❌ Extra forward pass needed (
actor.compute_log_prob())
Configuration: bypass_old_logprob_for_rollout = true
Policy setup:
-
$\pi_{\text{rollout}}$ : Behavior policy (data collection) -
$\pi_{\text{old}} = \pi_{\text{rollout}}$ : Proximal policy equals behavior policy -
$\pi_{\theta}$ : Current policy (being updated)
IS ratio:
PPO ratio:
Properties:
- ✅ Faster: Skips
actor.compute_log_prob()call - ✅ Mathematically correct: Uses actual behavior policy as proximal policy
- ❌ Does not achieve batch size invariance
Configuration: use_pure_rollout_correction = false
Loss function:
where:
-
$w_t$ : IS weight (depends on aggregation level, see Section 3.3) -
$r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ : PPO ratio -
$\epsilon$ : Clip range (typically 0.2)
Properties:
- Trust region control via clipping
- Limits policy update magnitude
- Standard in RL training
Configuration: use_pure_rollout_correction = true (requires bypass_old_logprob_for_rollout = true)
Loss function:
where:
-
$w_{\text{seq}}(\theta) = \min\left( \prod_{t \in T} \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\text{rollout}}(a_t|s_t)}, C_{\text{IS}} \right)$ : Sequence-level IS weight - IS weight is detached from gradient (treated as constant)
Effective gradient:
Properties:
- Algorithm: Off-policy REINFORCE + IS
- No PPO clipping: Pure policy gradient
-
Always uses bypass mode: Direct
$\pi_\theta$ to$\pi_{\text{rollout}}$ comparison - Fast: Single forward pass
Implementation: compute_policy_loss_with_rollout_correction() in core_algos.py
The aggregation level determines how per-token probability ratios are combined into IS weights and/or rejection masks. This choice is orthogonal to the operating mode - you can use any aggregation level in either decoupled or bypass mode.
IS weights:
Configuration:
rollout_is = "token" # IS weights
rollout_rs = "token" # Optional: rejection samplingProperties:
- Independent truncation per token
- Stable for moderate distribution shifts
- Typical threshold: 1.5 - 5.0
- Optional batch normalization (§3.6): Normalizes over all token weights to ensure
$\mathbb{E}[\tilde{w}_t] = 1$ (reduces variance)
Loss function (REINFORCE + Token IS):
where
Implementation:
- IS weights:
compute_rollout_correction_weights()in rollout_corr_helper.py - Loss:
compute_policy_loss()in core_algos.py
IS weights:
Configuration:
rollout_is = "sequence" # IS weights
rollout_rs = "sequence" # Optional: rejection samplingProperties:
- Multiplicative aggregation across sequence
- More sensitive to outliers than token-level
- Typical threshold: 2.0 - 10.0
- Optional batch normalization (§3.6): Normalizes over sequence means (one weight per sequence)
Loss function (REINFORCE + Sequence IS):
where
IS weights (for rejection only):
Configuration:
rollout_is = null # No IS weights, pure rejection
rollout_rs = "geometric" # Rejection sampling onlyProperties:
- Geometric mean of per-token ratios
- Extremely sensitive to outliers
- Typical threshold: 1.0001 - 1.001 (very tight!)
- Used for rejection sampling only, not IS weighting
Why tight thresholds?
For 100 tokens with
- Arithmetic product:
$\prod_{t=1}^{100} \rho_t = 1.01^{100} \approx 2.7$ - Geometric mean:
$(1.01)^{1} = 1.01$
A threshold of 1.001 means rejecting sequences with average per-token deviation > 0.1%.
Loss function (REINFORCE + Geometric RS):
where $\mathcal{A}{\text{geo}} = { \text{seq} : C{\text{RS-lower}} \leq \rho_{\text{geo}} \leq C_{\text{RS-upper}} }$ is the acceptance set. This formulation can also be combined with PPO clipping.
Rejection sampling can be added to any combination of operating mode and aggregation level. It modifies the response_mask to exclude outlier tokens/sequences.
Configuration:
rollout_rs = "token" # or "sequence" or "geometric"
rollout_rs_threshold = 2.0 # Upper threshold
rollout_rs_threshold_lower = 0.5 # Lower threshold (auto-reciprocal if null)Acceptance set:
- Token-level: $\mathcal{A}{\text{token}} = { t : C{\text{RS-lower}} \leq \rho_t \leq C_{\text{RS-upper}} }$
- Sequence-level: $\mathcal{A}{\text{seq}} = { \text{seq} : C{\text{RS-lower}} \leq \prod_{t \in T} \rho_t \leq C_{\text{RS-upper}} }$
- Geometric: $\mathcal{A}{\text{geo}} = { \text{seq} : C{\text{RS-lower}} \leq \rho_{\text{geo}} \leq C_{\text{RS-upper}} }$
Properties:
- Separate from IS weighting (can use RS without IS)
- Reduces effective sample size
- Filters extreme outliers
Implementation: compute_rollout_rejection_mask() in rollout_corr_helper.py
An independent safety layer that rejects sequences with catastrophically low token probabilities.
Configuration:
rollout_token_veto_threshold = 1e-4 # null = disabledVeto condition:
Properties:
- Prevents catastrophic updates from tokens with near-zero probability
- Independent of IS/RS settings (always applied if enabled)
- Checks unclamped per-token ratios before safety bounds
- Typical values:
$10^{-4}$ to$10^{-6}$
Implementation: rollout_corr_helper.py
An optional variance reduction technique that normalizes IS weights to have mean 1.0 within each batch.
Configuration:
rollout_is_batch_normalize = True # Default: FalseNormalization formula (aggregation-aware):
For token-level IS (§3.3.1):
$$
\tilde{w}t = \frac{w_t}{\frac{1}{\sum{i,t} m_{i,t}} \sum_{i,t} w_{i,t} \cdot m_{i,t}}
$$
where
For sequence-level IS (§3.3.2): $$ \tilde{w}i = \frac{w_i}{\frac{1}{B}\sum{j=1}^B \bar{w}j} $$ where $\bar{w}j = \frac{1}{T_j}\sum{t=1}^{T_j} w{j,t} \cdot m_{j,t}$ is the per-sequence mean (all tokens in a sequence have the same weight), and normalization is over sequences.
Properties:
- Applied after truncation to preserve truncation semantics
- Ensures
$\mathbb{E}[\tilde{w}] = 1$ within each batch - Aggregation-aware: Token-level normalizes over tokens; sequence-level normalizes over sequences
- Uses
masked_meanto respect padding tokens - Reduces gradient magnitude variance by removing random batch-level scale fluctuations
Metrics:
rollout_is_batch_norm_factor: The normalization factor applied (batch mean before normalization)
Implementation: rollout_corr_helper.py
| Preset Method | Mode | IS Level | RS Level | Properties |
|---|---|---|---|---|
decoupled_token_is() |
Decoupled | token | - | Per-token IS weights |
decoupled_seq_is() |
Decoupled | sequence | - | Sequence-level IS weights |
decoupled_seq_is_rs() |
Decoupled | sequence | sequence | Sequence IS + sequence RS |
decoupled_geo_rs() |
Decoupled | - | geometric + veto | Geometric RS + veto, no IS weights |
ppo_is_bypass() |
Bypass | - | - | Bypass mode, skips old_log_prob |
pg_rs() |
Bypass | - | geometric + veto | Policy gradient with RS (no IS weights) |
pg_is() |
Bypass | sequence | - | Policy gradient with IS |
disabled() |
- | - | - | Metrics only, no correction |
Note: All presets use PPO loss except pg_is() and pg_rs() which use policy gradient (both require use_pure_rollout_correction=True).
These combinations are fully supported but require manual configuration:
1. Token IS + Token RS
config = RolloutCorrectionConfig(
rollout_is="token",
rollout_is_threshold=2.0,
rollout_rs="token",
rollout_rs_threshold=2.0,
)Properties: Token-level IS weights + token-level RS mask.
2. Pure Token RS
config = RolloutCorrectionConfig(
rollout_is=None,
rollout_rs="token",
rollout_rs_threshold=2.0,
)Properties: Token-level RS mask only, no IS weights.
3. Pure Sequence RS
config = RolloutCorrectionConfig(
rollout_is=None,
rollout_rs="sequence",
rollout_rs_threshold=2.0,
)Properties: Sequence-level RS mask only, no IS weights.
Key properties:
- Any IS aggregation level (token/sequence) can be used in either decoupled or bypass mode
- Rejection sampling can be added to any combination
- Veto is independent and can be added to any combination
- Geometric aggregation is typically used for RS only (not IS weighting)
- Pure RS (
pg_rs) uses bypass + geometric RS withuse_pure_rollout_correction=Truefor pure policy gradient (no IS weights) - All combinations in the table above are valid and supported by the implementation
Theory: Naive LLM-RL implementation that incorrectly applies PPO by ignoring the actual rollout policy and assuming
Note: This incorrect implementation pattern was identified in Liu, Li, et al. (2025) as a key cause of training instability in LLM-RL systems, motivating the development of this rollout correction framework.
Loss Function:
where
Why it's wrong:
-
Ignores
$\pi_{\text{rollout}}$ : Uses$\pi_{\text{old}}$ as behavior policy instead of actual$\pi_{\text{rollout}}$ -
Policy mismatch: In LLM-RL, rollout typically uses different precision/backend/checkpoint than training, causing
$\pi_{\text{rollout}} \neq \pi_{\text{old}}$ even with same model weights - Not PPO's fault: PPO itself is correct; the issue is the incorrect assumption
Correct alternatives:
-
Decoupled mode: Three policies with IS correction from
$\pi_{\text{rollout}}$ to$\pi_{\text{old}}$ -
Bypass mode: Two policies using
$\pi_{\text{rollout}}$ as both behavior policy and proximal policy - Pure IS mode: Two policies with IS correction and no PPO clipping
Implementation: compute_policy_loss() in core_algos.py
These metrics quantify the severity of off-policy drift.
Note on notation: Metrics use
Direct KL estimator:
K3 KL estimator (alternative formulation):
where
Old policy perplexity:
Rollout policy perplexity:
PPL ratio (inverse of geometric mean IS weight):
Interpretation: Values > 1 mean
Measures the second moment of the IS weight distribution.
Token-level:
Sequence-level:
Interpretation:
-
$\chi^2 = 0$ : Policies are identical -
$\chi^2 > 0$ : Higher values indicate more severe off-policy distribution shift
Implementation: compute_offpolicy_metrics() in rollout_corr_helper.py
| Method | Theory | Policies | PPO Clip | IS Correction | Correctness | Speed |
|---|---|---|---|---|---|---|
pg_is |
Off-policy REINFORCE | 2 (rollout, θ) | ❌ | ✅ Seq-level | ✅ Correct | Fast |
pg_rs |
Pure PG + Geo RS | 2 (rollout, θ) | ❌ | Rejection only | ✅ Correct | Fast |
| Naive LLM-RL | Incorrect PPO usage | 2 (old, θ) | ✅ | ❌ | Standard | |
ppo_is_bypass |
PPO (rollout as prox) | 2 (rollout, θ) | ✅ | ❌ | ✅ Correct | Fast |
decoupled_token_is |
Decoupled PPO | 3 (rollout, old, θ) | ✅ | ✅ Token-level | ✅ Correct | Standard |
decoupled_seq_is |
Decoupled PPO | 3 (rollout, old, θ) | ✅ | ✅ Seq-level | ✅ Correct | Standard |
decoupled_seq_is_rs |
Decoupled PPO + RS | 3 (rollout, old, θ) | ✅ | ✅ + Rejection | ✅ Correct | Standard |
decoupled_geo_rs |
Decoupled PPO + Geo RS | 3 (rollout, old, θ) | ✅ | Rejection only | ✅ Correct | Standard |
Off-policy severity:
-
Negligible (same checkpoint, minor differences):
ppo_is_bypassuses$\pi_{\text{rollout}}$ as proximal policy (mathematically correct); naive LLM-RL implementations use$\pi_{\text{old}}$ instead of$\pi_{\text{rollout}}$ (mathematically incorrect when$\pi_{\text{rollout}} \neq \pi_{\text{old}}$ ) -
Moderate (async workers, slight staleness):
decoupled_token_isprovides per-token IS correction with separate proximal policy -
Severe (replay buffers, old data):
decoupled_seq_isanddecoupled_seq_is_rsprovide sequence-level IS correction with optional rejection sampling
Algorithm properties:
- Batch size invariance: Decoupled mode with three policies (
decoupled_token_is,decoupled_seq_is) achieves batch size invariance - Computational efficiency: Bypass mode (
ppo_is_bypass) skipsold_log_probcomputation - Pure policy gradient:
pg_isimplements off-policy REINFORCE without PPO clipping
Decoupled mode (computes old_log_prob separately):
- Implements full decoupled PPO with three policies (mathematically correct)
- Separately measures and corrects Drift 1 (rollout→old) and Drift 2 (old→current)
- Achieves batch size invariance and efficient stale data utilization
- Enables accurate off-policy metrics monitoring
Bypass mode (sets
- Uses
$\pi_{\text{rollout}}$ as both behavior policy and proximal policy (mathematically correct) - Computational efficiency: Skips separate
old_log_probcomputation - Does not achieve batch size invariance (proximal policy depends on data collection)
- Rollout Correction Usage Guide - Practical configuration and troubleshooting
- Config: verl/trainer/config/algorithm.py
- IS/RS Helper: verl/trainer/ppo/rollout_corr_helper.py
- PPO Loss: verl/trainer/ppo/core_algos.py
- Tests: tests/trainer/ppo/test_rollout_corr.py
- Williams, R. J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine Learning, 8(3-4), 229-256. https://doi.org/10.1007/BF00992696
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
- Hilton, J., Cobbe, K., & Schulman, J. (2021). "Batch size-invariance for policy optimization." arXiv preprint arXiv:2110.00641. https://arxiv.org/abs/2110.00641
- Introduced decoupled PPO: separating proximal policy (for controlling policy update size) from behavior policy (for off-policy correction) to achieve batch size invariance
- Liu, J., Li, Y., et al. (2025). "When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch"