SPEAR
is a curriculum-based self-imitation learning (SIL) framework for training agentic LLMs on long-horizon, sparse-reward tasks. It balances exploration and exploitation by first leveraging auxiliary tool-use rewards to encourage broad skill-level exploration, and later strengthening self-imitation to exploit successful trajectories from replayed experiences. This adaptive curriculum stabilizes training and improves efficiency while maintaining well-controlled entropy.
Jan. 2026: SPEAR is accepted to ICLR 2026.Sep. 2025: 🔥🔥🔥 Codes of our SPEAR are released.
The core concept of our proposed SPEAR for training long-horizon LLM agents via group-based RL. Compared with the vanilla GRPO-like algorithms, we introduce the curriculum-based self-imitation learning with intrinsic reward shaping. Given the same data input, a group of trajectories are generated with multi-turn tool interactions and then engaged for episode-level reward computation and advantage estimation. Then, we propose filtering valuable good trajectories to update the replay buffer, where the stored past experiences guide the agent to explore effectively on sparsely rewarded tasks via self-imitation. The total training batch contains both on-policy and off-policy data from the replay buffer.
Overview of SPEAR in terms of data flow. During each episode, the agent interacts with the environment to generate a set of trajectories. These trajectories are processed along two complementary paths. First, they are used for intrinsic reward shaping, advantage estimation, and on-policy updates, following a mechanism similar to the vanilla GRPO. Second, they are selectively filtered and stored in a replay buffer, enabling off-policy updates through the proposed self-imitation scheme with advantage recalibration and regularization. This dual integration allows the agent to maximize the utility of rewarding past experiences, thereby expanding the exploration space effectively, while simultaneously mitigating persistent over-uncertainty in decision-making under shifting distributions of external feedback. As a result, SPEAR achieves a stable balance between exploration and exploitation through self-guided policy adaptation.
Results using Qwen2.5-1.5B-Instruct on ALFWorld and WebShop:
| Method | ALFWorld | WebShop(SR) |
|---|---|---|
| GRPO | 72.8 | 56.8 |
| +SPEAR(ours) | 88.9(+16.1) | 77.5(+20.7) |
| Dr.BoT(GRPO) | 79.1 | 62.9 |
| +SPEAR(ours) | 87.7(+8.6) | 76.8(+13.9) |
| GiGPO | 86.1 | 67.4 |
| +SPEAR(ours) | 91.2(+5.1) | 79.3(+11.8) |
| Dr.BoT(GiGPO) | 90.6 | 68.8 |
| +SPEAR(ours) | 93.2(+2.6) | 81.1(+81.1) |
Results using Qwen2.5-32B-Instruct and Qwen3-32B-Instruct on AIME24 and AIME25:
| Method | Model | AIME24 | AIME25 |
|---|---|---|---|
| PPO | Qwen2.5-32B-Instruct | - | 55.0 |
| GRPO | Qwen2.5-32B-Instruct | - | 60.0 |
| Dr.BoT(GRPO) | Qwen2.5-32B-Instruct | 64.7 | 54.0 |
| +SPEAR(ours) | Qwen2.5-32B-Instruct | 66.3(+1.6) | 60.1(+6.1) |
| Dr.BoT(GRPO) | Qwen3-32B-Instruct | 82.5 | 77.3 |
| +SPEAR(ours) | Qwen3-32B-Instruct | 85.6(+3.1) | 80.5(+3.2) |
actor_rollout_ref:
actor:
# Whether to enable self-imitation loss
enable_trajectory_replay: False
# Maximum number of trajectories stored in the self-imitation buffer
trajectory_buffer_size: 2048
# Only trajectories with an advantage larger than this threshold will be saved
trajectory_score_threshold: 1
# Only trajectories with a step delay less than this tolerance will be remained
trajectory_tolerate_steps: 10
# PPO loss coefficient for self-imitation learning
replay_loss_coef: 1
# Number of steps for increasing the PPO loss coefficient using a cosine scheduler
max_replay_loss_steps: 200actor_rollout_ref:
actor:
# How the advantage of trajectories in the replay buffer is re-estimated
weight_decay_trajectory_replay: -1
# Number of trajectories' rewards used to calculate the 50th percentile baseline
baseline_buffer_size: 10240 -
weight_decay_trajectory_replaycontrols how the advantage of trajectories in the replay buffer is recalibrated. -
If
weight_decay_trajectory_replayis -1, the 50th percentile baseline will be used to re-estimate the advantage. -
If
weight_decay_trajectory_replayis in (0, 1], the advantage will decay as: $$ \text{advantage} = \text{old advantage} \times \text{weightDecayTrajectoryReplay} $$
actor_rollout_ref:
actor:
policy_loss:
# Loss mode for regularization. Options: (see https://arxiv.org/abs/2505.22617)
# - vanilla
# - clip-cov, default = clip-cov for Dr.BoT
# - kl-cov
# - gpg
loss_mode: "vanilla"
# ================== Hyperparameters for On-policy RL Loss ==================
# Ratio of tokens to be clipped for clip-cov loss
clip_cov_ratio: 0.02
# Lower bound for clip-cov loss
clip_cov_lb: 1.0
# Upper bound for clip-cov loss
clip_cov_ub: 40.0
# Ratio of tokens to apply KL penalty for kl-cov loss
kl_cov_ratio: 0.02
# ================== Hyperparameters for SIL Loss ==========================
# [Replay Only] Ratio of tokens to be clipped for clip-cov loss
clip_cov_ratio_replay: 0.02
# [Replay Only] Lower bound for clip-cov loss
clip_cov_lb_replay: 1.0
# [Replay Only] Upper bound for clip-cov loss
clip_cov_ub_replay: 40.0
# [Replay Only] Ratio of tokens to apply KL penalty for kl-cov loss
kl_cov_ratio_replay: 0.02algorithm:
# Tool-call reward mode:
# - "none" : Do not use tool-call reward
# - "constant" : Use a fixed tool-call reward coefficient (1) during training
# - "cosine" : Decay the tool-call reward coefficient with a cosine scheduler
use_toolcall_reward: "cosine"
# Maximum number of steps for the cosine scheduler
max_toolcall_steps: 100Removing KL divergence to the reference model:
actor_rollout_ref:
actor:
# Whether to use KL loss against the reference model
use_kl_loss: False
# Coefficient for KL loss (set to 0.0 if disabled)
kl_loss_coef: 0.0
# KL loss type (e.g., "low_var_kl" for GRPO)
kl_loss_type: low_var_klClip higher:
actor_rollout_ref:
actor:
# Lower bound of the clipping ratio
clip_ratio_low: 0.2
# Upper bound of the clipping ratio
clip_ratio_high: 0.28Removing intra-group normalization:
algorithm:
# Whether to normalize advantages by group standard deviation in GRPO
norm_adv_by_std_in_grpo: FalseRemoving length bias:
actor_rollout_ref:
actor:
# Aggregation mode for loss:
# - "token-mean" (DAPO)
# - "seq-mean-token-sum"
# - "seq-mean-token-mean"
# - "seq-mean-token-sum-norm" (Dr.GRPO)
loss_agg_mode: "seq-mean-token-sum-norm"Filtering low-quality samples:
algorithm:
# Filter out overlong responses, default = True for Dr.BoT
filter_overlong_responses: True
# Filter out incomplete responses (void-turn), default = True for Dr.BoT
filter_incomplete_responses: True
# Filter out repetitive responses, default = True for Dr.BoT
filter_repetitive_responses: True
# Filter out unreadable responses, default = True for Dr.BoT
filter_unreadable_responses: TrueFiltering low-variance groups:
actor_rollout_ref:
rollout:
# Rollout filtering ratio by standard deviation. We use 0.75 in Dr.BoT
rollout_filter_ratio: 0.75
# Rollout filter type: "std" (standard deviation)
rollout_filter_type: stdWe follow the installation instructions in verl documentation to install the nessary environment.
Install CUDA>=12.4
# change directory to anywher you like, in verl source code directory is not recommended
mkdir tmp
cd tmp
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cuda-toolkit-12-4
update-alternatives --set cuda /usr/local/cuda-12.4Install cuDNN>9.8.0
# change directory to anywher you like, in verl source code directory is not recommended
mkdir tmp
cd tmp
wget https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cudnn-cuda-12Install NVIDIA Apex, you can change the MAX_JOBS to accelerate the installation process, but do not set it too large in case of memory issues.
# change directory to anywher you like, in verl source code directory is not recommended
mkdir tmp
cd tmp
git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
MAX_JOB=32 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./Create a new environment:
conda create -n verl python==3.10 -y
conda activate verlThen, execute the install.sh script provided in verl:
cd repo_root/verl
USE_MEGATRON=0 USE_SGLANG=0 bash scripts/install_vllm_sglang_mcore.shcd repo_root/verl
pip install --no-deps -e .- Preparing data:
python3 recipe/spear/sft_preprocess.py- Getting the cold-start model:
bash recipe/spear/run_qwen2-32b_sft.sh- Transform to HuggingFace format:
python -m verl.model_merger merge --backend fsdp \
--local_dir <SFT_SAVE_PATH>/global_step_372/actor \
--target_dir <SFT_SAVE_PATH>/global_step_372_mergeTraining with GRPO baseline:
bash recipe/spear/run_qwen2-32b.shTraining with Dr.BoT:
bash recipe/spear/run_qwen2-32b_drbot.shTraining with SPEAR:
bash recipe/spear/run_qwen2-32b_spear.shWe follow the installation instructions in verl-agentdocumentation to install the nessary environment.
Unzip the environments:
cd verl-agent/agent_system/
tar -xvf environments.tar
Due to potential package version conflicts, we recommend setting independent conda environments for different agent environments.
Install verl and ALFWorld dependencies
## Install verl dependencies
conda create -n verl-agent-alfworld python==3.12 -y
conda activate verl-agent-alfworld
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -r requirements.txt
pip3 install -e .
pip3 install vllm==0.8.5
## Install ALFWorld dependencies
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
pip install vllm==0.8.5 Download PDDL & Game files and pre-trained MaskRCNN detector (will be stored in ~/.cache/alfworld/):
alfworld-download -fUse --extra to download pre-trained checkpoints and seq2seq data.
Play a Textworld game:
alfworld-play-twWebShop requires Python <=3.10, so begin by creating a new verl-agent-webshop environment
conda create -n verl-agent-webshop python==3.10 -y
conda activate verl-agent-webshopInstall WebShop
cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d allNote: If you encounter issues with gdown, you may need to visit https://drive.google.com/, get your Google Drive cookie, and paste it into ~/.cache/gdown/cookies.txt. Or you may need to manually download the files.
After WebShop is installed, return to the root directory of the repository and install the verl package in verl-agent:
cd repo_root/verl-agent
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2 --no-deps
pip3 install -r requirements-vllm-0.8.2.txt --no-deps
# vllm 0.8.2 requires mistral_common[opencv]>=1.5.4, which is not installed.
# spacy 3.7.2 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.
# weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.
# The warnings can be safely ignored.Installing mistral_common will update numpy and cause errors in the following training. So we don't install it here.
Training with GRPO:
# ALFWorld
bash examples/grpo_trainer/run_alfworld.sh # GRPO baseline
bash examples/grpo_trainer/run_alfworld_drbot.sh # Dr.BoT
bash examples/grpo_trainer/run_alfworld_spear.sh # SPEAR
# WebShop
bash examples/grpo_trainer/run_webshop.sh # GRPO baseline
bash examples/grpo_trainer/run_webshop_drbot.sh # Dr.BoT
bash examples/grpo_trainer/run_webshop_spear.sh # SPEARTraining with GiGPO:
# ALFWorld
bash examples/gigpo_trainer/run_alfworld.sh # GRPO baseline
bash examples/gigpo_trainer/run_alfworld_drbot.sh # Dr.BoT
bash examples/gigpo_trainer/run_alfworld_spear.sh # SPEAR
# WebShop
bash examples/gigpo_trainer/run_webshop.sh # GRPO baseline
bash examples/gigpo_trainer/run_webshop_drbot.sh # Dr.BoT
bash examples/gigpo_trainer/run_webshop_spear.sh # SPEAROur codebase is bulit upon verl and verl-agent. We greatly appreciate their awesome work and the dedication of the contributors who made these projects available to the community.
If you find this project useful, please consider the following citation:
@article{qin2025learn,
title={Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning},
author={Qin, Yulei and Tan, Xiaoyu and He, Zhengbao and Li, Gang and Lin, Haojia and Li, Zongyi and Xu, Zihan and Shi, Yuchen and Cai, Siqi and Rui, Renting and others},
journal={arXiv preprint arXiv:2509.22601},
year={2025}
}


