Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

2025.3.18 Update: We have added support for process rewards! You can now assign rewards for each tool call based on its effectiveness. To balance process rewards with outcome rewards, we implemented reward normalization inspired by PRIME.

Overview

Agent-R1 is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs End-to-End reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents. For more details on the algorithm, see algorithm doc.

Key Features

Multi-turn Tool Calling: End-to-end reinforcement learning on complete interaction trajectories, allowing agents to learn from sequences of actions
Custom Tools and Environments: Compatible with mainstream LLM tool calling formats, making it easy to extend with your own tools and scenarios
Multiple RL Algorithms: Supports diverse reinforcement learning approaches including PPO, GRPO, and REINFORCE++
Reasoning before Action: Jointly optimize reasoning and action strategies over entire trajectories.

Upcoming Features

Immediate Action Rewards: Per-action reward mechanisms to complement trajectory-level reinforcement
Expanded Model Support: Integration with more foundation models beyond the currently supported Qwen
Additional Use Cases: More example implementations across diverse scenarios and domains

Get Started

Results on HotpotQA

PPO

REINFORCE++

GRPO

We can see that the model (Qwen2.5-1.5B-Instruct) effectively learns to think and then invoke the tool in multiple rounds when faced with challenging multi-hop questions, ultimately achieving improved the EM results. The effectiveness of different reinforcement learning algorithms varies, but the general trend is the same.

Notably, our experiments reveal a striking correlation: EM scores, number of tool calls (turns), and final response length all display consistent trends across training. This demonstrates a novel dimension of scaling laws—one that relates to the frequency of agent-environment interactions. As the agent learns to interact more effectively with its environment through multiple tool calls, performance improves proportionally, suggesting that the ability to engage in multiple rounds of environment interaction may be as crucial to agent performance as traditional scaling factors.

Extending Agent-R1 with Your Own Tools and Environments

Extending Agent-R1 is straightforward: create custom tools by extending the Tool base class, implement data preprocessing scripts to format your dataset, and define reward functions for task-specific evaluation. Register these components in their respective directories, and configure a training script to adapt Agent-R1 to your use case.

For detailed implementation guidance, examine the existing code:

Tools: agent_r1/tool/tools/calculator_tool.py, search_tool.py
Data processing: examples/data_preprocess/hotpotqa.py
Reward functions: verl/utils/reward_score/qa_em_and_format.py

See the extending doc for details.

Feedback

We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.

Contributors

Jie Ouyang*, Ruiran Yan*, Yucong Luo*, Zirui Liu, Shuo Yu

Acknowledgements

We extend our gratitude to DeepSeek for providing the DeepSeek-R1 model and inspiring ideas. We are also thankful to the veRL team for their robust infrastructure support. Additionally, we acknowledge the RAGEN team for their groundbreaking discoveries, which significantly influenced our early exploration. Lastly, we deeply appreciate the insightful discussions and contributions from Jie Ouyang, Ruiran Yan, and Yucong Luo.

Citation

@misc{Agent-R1,
  author       = {Jie Ouyang, Ruiran Yan, Yucong Luo, Zirui Liu, Shuo Yu},
  title        = {Training Powerful LLM Agents with End-to-End Reinforcement Learning},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/0russwest0/Agent-R1},
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
agent_r1		agent_r1
docs		docs
examples		examples
image		image
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_grpo.sh		run_grpo.sh
run_ppo.sh		run_ppo.sh
run_rpp.sh		run_rpp.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Overview

Key Features

Upcoming Features

Get Started

Results on HotpotQA

PPO

REINFORCE++

GRPO

Extending Agent-R1 with Your Own Tools and Environments

Feedback

Contributors

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Overview

Key Features

Upcoming Features

Get Started

Results on HotpotQA

PPO

REINFORCE++

GRPO

Extending Agent-R1 with Your Own Tools and Environments

Feedback

Contributors

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages