wangboxiong320
diff --git a/‎docs/advance/rollout_is_migration.md‎
Lines changed: 642 additions & 0 deletions b/‎docs/advance/rollout_is_migration.md‎
Lines changed: 642 additions & 0 deletions
diff --git a/‎docs/examples/config.rst‎
Lines changed: 21 additions & 1 deletion b/‎docs/examples/config.rst‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/rollout_importance_sampling/README.md‎
Lines changed: 242 additions & 0 deletions b/‎examples/rollout_importance_sampling/README.md‎
Lines changed: 242 additions & 0 deletions
diff --git a/‎examples/rollout_importance_sampling/run_with_rollout_is.sh‎
Lines changed: 99 additions & 0 deletions b/‎examples/rollout_importance_sampling/run_with_rollout_is.sh‎
Lines changed: 99 additions & 0 deletions
@@ -118,7 +118,13 @@ Actor/Rollout/Reference Policy
       clip_ratio: 0.2
       entropy_coeff: 0.0
       use_kl_loss: False # True for GRPO
-      tis_imp_ratio_cap: -1 # set to positive values for Truncated Importance Sampling (requires setting `rollout.calculate_log_probs` as True)
+      # Rollout Importance Sampling (corrects distribution mismatch between rollout and training)
+      rollout_is: False # Enable IS correction
+      rollout_is_threshold: null # Upper threshold for IS weights (null to disable)
+      rollout_is_threshold_lower: null # Lower threshold (null = auto 1/upper)
+      rollout_is_level: token # Aggregation: token/sequence/geometric
+      rollout_is_mode: truncate # Bounding: truncate/clip
+      rollout_is_veto_threshold: 1e-4 # Catastrophic outlier threshold
       use_torch_compile: True # False to disable torch compile
       kl_loss_coef: 0.001 # for grpo
       kl_loss_type: low_var_kl # for grpo
@@ -498,6 +504,13 @@ Algorithm
        kl_coef: 0.005
        horizon: 10000
        target_kl: 0.1
+     # Rollout Importance Sampling
+     rollout_is: False
+     rollout_is_threshold: null
+     rollout_is_threshold_lower: null
+     rollout_is_level: token
+     rollout_is_mode: truncate
+     rollout_is_veto_threshold: 1e-4
 
 - ``gamma``: discount factor
 - ``lam``: Trade-off between bias and variance in the GAE estimator
@@ -510,6 +523,13 @@ Algorithm
   - ``kl_coef``: The (initial) coefficient of in-reward kl_penalty. Default is 0.001.
   - ``type``: 'fixed' for FixedKLController and 'adaptive' for AdaptiveKLController.
   - ``horizon`` and ``target_kl``: See source code of AdaptiveKLController for details.
+- ``rollout_is``: Whether to enable rollout importance sampling correction. Default is False.
+- ``rollout_is_threshold``: Upper threshold for IS weights. Set to ``null`` to disable IS completely.
+- ``rollout_is_threshold_lower``: Lower threshold for IS weights. If ``null``, defaults to reciprocal of upper (1/upper).
+- ``rollout_is_level``: Aggregation level: ``token`` (biased), ``sequence`` (unbiased), or ``geometric`` (experimental).
+- ``rollout_is_mode``: Bounding mode: ``truncate`` (cap upper only) or ``clip`` (zero outside bounds).
+- ``rollout_is_veto_threshold``: Per-token veto threshold for catastrophic outliers. Default is 1e-4.
+  Note: Rollout IS requires setting ``actor_rollout_ref.rollout.calculate_log_probs=True``.
 
 Trainer
 ~~~~~~~
 
@@ -121,6 +121,7 @@ verl is fast with:
    examples/sandbox_fusion_example
    advance/rollout_trace.rst
    advance/rollout_skip.rst
+   advance/rollout_is_migration.md
    advance/one_step_off
    advance/agent_loop
 
 
@@ -0,0 +1,242 @@
+# Rollout Importance Sampling (IS) Examples
+
+This directory contains examples and documentation for using Rollout Importance Sampling to correct distribution mismatch between rollout and training policies.
+
+**References:**
+- When Speed Kills Stability: https://yingru.notion.site/When-Speed-Kills-Stability-271211a558b7808d8b12d403fd15edda
+- Off-policy RL: https://fengyao.notion.site/off-policy-rl
+
+## Overview
+
+Rollout Importance Sampling corrects for distribution mismatch when:
+1. **Rollout generation** uses one policy (e.g., vLLM with BFloat16)
+2. **Training** uses another policy (e.g., FSDP with FP32)
+3. This mismatch leads to biased gradient estimates
+
+## Quick Start
+
+### Basic Configuration
+
+```yaml
+algorithm:
+  # Main control: set threshold to enable (null = disabled)
+  rollout_is_threshold: 2.0
+  # Whether to apply weights to policy loss (true) or just compute metrics (false)
+  rollout_is: true
+  rollout_is_level: token
+  rollout_is_mode: truncate
+
+# IMPORTANT: Must enable log prob calculation
+actor_rollout_ref:
+  rollout:
+    calculate_log_probs: true
+```
+
+### Running the Example
+
+```bash
+# Basic example with token-level truncate
+bash examples/rollout_importance_sampling/run_with_rollout_is.sh
+```
+
+## Configuration Options
+
+### Aggregation Levels (`rollout_is_level`)
+
+| Level | Properties | Threshold Range |
+|-------|-----------|-----------------|
+| **token** | Per-token | 1.5 - 5.0 |
+| **sequence** | Per-sequence | 2.0 - 10.0 |
+| **geometric** | Geometric mean | 1.0002 - 1.001 |
+
+### Bounding Modes (`rollout_is_mode`)
+
+| Mode | Behavior |
+|------|----------|
+| **truncate** | Cap weights at upper threshold only |
+| **clip** | Zero out weights outside [lower, upper] |
+
+### Key Parameters
+
+- `rollout_is_threshold`: Upper threshold for IS weights (null = disabled, float = enabled). **Main on/off switch.**
+- `rollout_is`: Whether to apply weights to loss (true) or just compute metrics (false). Default: false.
+- `rollout_is_threshold_lower`: Lower threshold (null = auto 1/upper)
+- `rollout_is_veto_threshold`: Catastrophic outlier threshold (default: 1e-4)
+
+## Configuration Examples
+
+### Example 1: Full IS Correction (Apply Weights)
+
+```yaml
+algorithm:
+  rollout_is_threshold: 2.0
+  rollout_is: true  # Apply to loss
+  rollout_is_level: token
+  rollout_is_mode: truncate
+  rollout_is_veto_threshold: 1e-4
+```
+
+### Example 2: Metrics Only (No Weight Application)
+
+```yaml
+algorithm:
+  rollout_is_threshold: 2.0
+  rollout_is: false  # Compute metrics only, don't apply to loss
+  rollout_is_level: token
+  rollout_is_mode: truncate
+```
+
+### Example 3: Geometric Mean with Clip
+
+```yaml
+algorithm:
+  rollout_is_threshold: 1.0002
+  rollout_is: true
+  rollout_is_threshold_lower: 0.9998
+  rollout_is_level: geometric
+  rollout_is_mode: clip
+  rollout_is_veto_threshold: 1e-4
+```
+
+### Example 4: Sequence-level with Truncate
+
+```yaml
+algorithm:
+  rollout_is_threshold: 5.0
+  rollout_is: true
+  rollout_is_threshold_lower: null  # Auto-reciprocal: 0.2
+  rollout_is_level: sequence
+  rollout_is_mode: truncate
+  rollout_is_veto_threshold: 1e-4
+```
+
+### Example 5: Asymmetric Thresholds
+
+```yaml
+algorithm:
+  rollout_is_threshold: 5.0
+  rollout_is: true
+  rollout_is_threshold_lower: 0.8
+  rollout_is_level: token
+  rollout_is_mode: clip
+```
+
+## Monitoring Metrics
+
+Key metrics to watch (all prefixed with `mismatch/` in logs):
+
+### Health Indicators
+- `rollout_is_mean`: Mean IS weight across sequences
+- `rollout_is_eff_sample_size`: Effective sample size after weighting
+- `rollout_is_veto_fraction`: Fraction of sequences vetoed
+
+### Distribution Metrics
+- `rollout_is_max`, `rollout_is_min`: Weight extremes
+- `rollout_is_std`: Standard deviation
+- `rollout_is_p50`, `rollout_is_p95`, `rollout_is_p99`: Percentiles
+
+### Diagnostic Metrics
+- `rollout_is_ratio_fraction_high`: Fraction exceeding upper threshold
+- `rollout_is_ratio_fraction_low`: Fraction below lower threshold
+- `rollout_is_catastrophic_token_fraction`: Catastrophic tokens detected
+
+### Mismatch Metrics (Training vs Rollout Policy)
+
+These metrics help diagnose the distribution mismatch between rollout and training policies:
+
+**Perplexity Metrics:**
+- `mismatch_training_ppl`: Perplexity of training policy
+- `mismatch_rollout_ppl`: Perplexity of rollout policy
+- `mismatch_ppl_ratio`: Ratio of training PPL to rollout PPL
+- `mismatch_log_ppl_diff`: Log perplexity difference
+
+**KL Divergence Metrics:**
+- `mismatch_kl`: KL divergence KL(π_rollout || π_training)
+- `mismatch_k3_kl`: K3 KL estimator
+
+## Troubleshooting
+
+### Issue: High Variance in IS Weights
+
+**Symptoms**: `rollout_is_std` > 1.0, `rollout_is_eff_sample_size` < 0.3
+
+**Solutions**:
+1. Switch from `sequence` to `geometric` level
+2. Tighten thresholds
+3. Check if rollout and training are too different
+
+### Issue: Too Many Sequences Vetoed
+
+**Symptoms**: `rollout_is_veto_fraction` > 0.1
+
+**Solutions**:
+1. Relax veto threshold: `rollout_is_veto_threshold: 1e-3`
+2. Check for numerical issues in log prob computation
+3. Verify rollout and training policies aren't completely different
+
+### Issue: Mean IS Weight Far from 1.0
+
+**Symptoms**: `rollout_is_mean` < 0.5 or > 2.0
+
+**Solutions**:
+1. Check that `calculate_log_probs=True` is set
+2. Verify rollout_log_probs are correctly passed
+3. Check for systematic bias in rollout vs training
+
+### Issue: Too Much Data Discarded (Clip Mode)
+
+**Symptoms**: `rollout_is_clipped_fraction` > 0.5
+
+**Solutions**:
+1. Widen thresholds
+2. Switch to `truncate` mode
+3. Use `geometric` level for better stability
+
+## Performance Considerations
+
+### Memory Usage
+- Rollout IS adds minimal memory overhead (~1% of model memory)
+- Log-space computation prevents numerical overflow
+
+### Computational Cost
+- Token-level: ~1-2% overhead
+- Sequence-level: ~2-3% overhead
+- Geometric: ~2-3% overhead
+
+## Advanced Topics
+
+### Dual Thresholds
+
+Specify both upper and lower explicitly:
+
+```yaml
+rollout_is_threshold: 2.0      # Upper
+rollout_is_threshold_lower: 0.5  # Lower (not 1/2.0 = 0.5)
+```
+
+Or use auto-reciprocal:
+
+```yaml
+rollout_is_threshold: 2.0      # Upper = 2.0, Lower = 0.5 (auto)
+rollout_is_threshold_lower: null
+```
+
+### Veto Mechanism
+
+The veto mechanism zeros out entire sequences containing catastrophic outliers:
+
+- If any token has ratio < `rollout_is_veto_threshold`, the entire sequence is rejected
+- This prevents extreme outliers from dominating training
+- Default threshold: 1e-4 (ratio 10,000x off)
+- Set to `null` to disable: `rollout_is_veto_threshold: null`
+
+## Examples
+
+See the script in this directory:
+- `run_with_rollout_is.sh`: Basic example with token-level truncate mode
+
+## References
+
+- Implementation: `verl/trainer/ppo/mismatch_helper.py`
+- Core algorithm: `verl/trainer/ppo/core_algos.py`
+- Paper: "Your Efficient RL Framework Secretly Brings You Off-Policy RL Training"
@@ -0,0 +1,99 @@
+#!/usr/bin/env bash
+# Example: Basic PPO training with Rollout Importance Sampling
+# This demonstrates the standard setup for correcting distribution mismatch
+
+set -xeuo pipefail
+
+# ==============================================================================
+# Rollout Importance Sampling Configuration
+# ==============================================================================
+
+# Main control: Upper threshold for IS weights (null = disabled, float = enabled)
+rollout_is_threshold=2.0
+
+# Whether to apply IS weights to policy loss
+# true = apply weights to loss, false = compute metrics only
+rollout_is=true
+
+# Lower threshold (null = auto-reciprocal, i.e., 1/upper = 0.5)
+rollout_is_threshold_lower=null
+
+# Aggregation level: token | sequence | geometric (experimental)
+rollout_is_level=token
+
+# Bounding mode: truncate (cap upper) | clip (zero outside bounds)
+rollout_is_mode=truncate
+
+# Catastrophic outlier veto threshold
+rollout_is_veto_threshold=1e-4
+
+# ==============================================================================
+# Model and Data Configuration
+# ==============================================================================
+
+MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen2.5-7B"}
+TRAIN_FILE=${TRAIN_FILE:-"data/train.parquet"}
+TEST_FILE=${TEST_FILE:-"data/test.parquet"}
+
+max_prompt_length=512
+max_response_length=1024
+
+# ==============================================================================
+# Training Configuration
+# ==============================================================================
+
+train_batch_size=128
+ppo_mini_batch_size=32
+ppo_epochs=1
+learning_rate=5e-7
+
+# ==============================================================================
+# Algorithm Configuration
+# ==============================================================================
+
+adv_estimator=gae
+gamma=1.0
+lam=0.95
+
+# ==============================================================================
+# Launch Training
+# ==============================================================================
+
+python3 -m verl.trainer.main_ppo \
+    data.train_files="${TRAIN_FILE}" \
+    data.val_files="${TEST_FILE}" \
+    data.max_prompt_length=${max_prompt_length} \
+    data.max_response_length=${max_response_length} \
+    data.train_batch_size=${train_batch_size} \
+    algorithm.adv_estimator=${adv_estimator} \
+    algorithm.gamma=${gamma} \
+    algorithm.lam=${lam} \
+    algorithm.rollout_is=${rollout_is} \
+    algorithm.rollout_is_threshold=${rollout_is_threshold} \
+    algorithm.rollout_is_threshold_lower=${rollout_is_threshold_lower} \
+    algorithm.rollout_is_level=${rollout_is_level} \
+    algorithm.rollout_is_mode=${rollout_is_mode} \
+    algorithm.rollout_is_veto_threshold=${rollout_is_veto_threshold} \
+    actor_rollout_ref.model.path="${MODEL_PATH}" \
+    actor_rollout_ref.actor.optim.lr=${learning_rate} \
+    actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
+    actor_rollout_ref.actor.ppo_epochs=${ppo_epochs} \
+    actor_rollout_ref.rollout.calculate_log_probs=True \
+    actor_rollout_ref.rollout.name=vllm \
+    trainer.logger='["console","wandb"]' \
+    trainer.project_name="rollout_is_example" \
+    trainer.experiment_name="basic_token_truncate" \
+    trainer.total_epochs=10
+
+echo "Training completed!"
+echo ""
+echo "Rollout IS Configuration:"
+echo "  - Threshold: ${rollout_is_threshold}"
+echo "  - Apply to loss: ${rollout_is}"
+echo "  - Level: ${rollout_is_level}"
+echo "  - Mode: ${rollout_is_mode}"
+echo ""
+echo "Monitor these key metrics in wandb:"
+echo "  - mismatch/rollout_is_mean (should be ~1.0)"
+echo "  - mismatch/rollout_is_eff_sample_size (should be >0.5)"
+echo "  - mismatch/rollout_is_veto_fraction (should be <0.1)"