Commit 6db61dc
[algo] feat: Add RateLimitedRewardLoopManager with three-layer rate limiting for API-based rewards (verl-project#4107)
### What does this PR do?
This PR implements a **three-layer rate limiting system** for API-based
reward functions in the reward loop manager, specifically designed for
LLM-as-judge scenarios. The new `RateLimitedRewardLoopManager` provides:
1. **Concurrency limiting** (`max_concurrent`) - Controls maximum
parallel requests
2. **Request rate limiting** (`max_rpm`) - Limits requests per minute
(RPM)
3. **Token rate limiting** (`max_tpm`) - Limits tokens per minute (TPM)
This is essential for integrating with external API-based reward models
(e.g., OpenAI, Anthropic) that have rate limits, preventing request
failures and ensuring smooth training workflows.
The implementation uses a custom `AsyncTokenBucket` algorithm for smooth
rate limiting and includes timeout handling and error recovery
mechanisms.
**Related context:** This addresses the need for controlled API usage
during RL training when using external LLM-as-judge reward models.
### Checklist Before Starting
- [x] Search for similar PRs: [GitHub search for "rate
limit"](https://github.com/volcengine/verl/search?q=rate+limit&type=pullrequests)
- [x] Format the PR title as `[reward] feat: Add
RateLimitedRewardLoopManager with three-layer rate limiting`
### Test
**Testing approach:**
- Unit tests should cover the `AsyncTokenBucket` rate limiting logic
- Integration tests should validate rate limiting with mock API reward
functions
- Manual validation: Configure different rate limits and verify request
patterns comply with limits
### API and Usage Example
**Configuration:**
```yaml
# In config file or via command line overrides
reward_model:
manager: rate_limited # Use the new rate-limited manager
max_concurrent: 10 # Max parallel requests
max_rpm: 100 # Max 100 requests per minute
max_tpm: 20000 # Max 20,000 tokens per minute
estimated_tokens_per_request: 2000 # Token estimate for TPM limiting
timeout: 300.0 # Timeout in seconds
```
### Design & Code Changes
**High-level design:**
1. **AsyncTokenBucket class**
([limited.py:30-63](verl/experimental/reward/reward_loop/limited.py#L30-L63))
- Implements token bucket algorithm for smooth rate limiting
- Supports variable token consumption (useful for TPM limiting)
- Thread-safe with asyncio locks
- Auto-refills tokens based on configured rate
2. **RateLimitedRewardLoopManager class**
([limited.py:66-235](verl/experimental/reward/reward_loop/limited.py#L66-L235))
- **Class-level state**: Rate limiters are shared globally across all
worker instances to ensure limits apply system-wide
- **Three-layer limiting**:
- Layer 1 (Concurrency): `asyncio.Semaphore` limits parallel requests
- Layer 2 (RPM): `AsyncTokenBucket` limits requests per minute
- Layer 3 (TPM): `AsyncTokenBucket` limits tokens per minute
- **Initialization guard**: `_class_initialized` flag prevents duplicate
initialization
- **Error handling**: Timeout and exception handling with fallback
rewards
- **Logging**: Detailed configuration logging on initialization
**Specific changes:**
- Added `verl/experimental/reward/reward_loop/limited.py` (235 lines)
- Updated `verl/experimental/reward/reward_loop/__init__.py` to export
`RateLimitedRewardLoopManager`
- Added `CLAUDE.md` to `.gitignore`
**Key implementation details:**
- Rate limiters acquire in order: RPM → TPM → Concurrency, ensuring
smooth throttling
- Timeout handling returns reward=0.0 with metadata for debugging
- Supports both sync and async reward functions via
`inspect.iscoroutinefunction`
- Estimated tokens per request used for TPM limiting (configurable)
### Documentation
**Added comprehensive docstrings** to
[verl/experimental/reward/reward_loop/limited.py](verl/experimental/reward/reward_loop/limited.py):
1. **AsyncTokenBucket class** (lines 30-137)
- Detailed algorithm explanation (token bucket rate limiting)
- Complete Args and Attributes documentation with types
- Thread safety guarantees
- Usage examples for RPM and TPM limiting scenarios
- Step-by-step algorithm details
- Implementation notes about asyncio event loop usage
2. **RateLimitedRewardLoopManager class** (lines 140-219)
- Comprehensive overview of three-layer rate limiting architecture
- Detailed Rate Limiting Flow explanation
- Full configuration parameters with defaults and descriptions
- Global class-level state management documentation
- Example configuration with concrete values
- Thread safety notes for distributed training
- Cross-references to related classes and functions
**Documentation coverage:**
- All public methods have detailed docstrings with Args, Returns,
Raises, and Examples
- Class-level documentation explains design patterns and use cases
- Code examples demonstrate common usage patterns
- Algorithm details help developers understand implementation
### Test Coverage
**Created comprehensive test suite** with 35+ test cases covering both
unit and integration scenarios:
#### 1. **Unit Tests for AsyncTokenBucket**
([tests/experimental/reward/test_async_token_bucket_on_cpu.py](tests/experimental/reward/test_async_token_bucket_on_cpu.py))
19 test cases covering:
- ✅ Basic token acquisition and refill mechanism
- ✅ Waiting behavior when tokens are insufficient
- ✅ Max tokens capacity cap enforcement
- ✅ Fractional token consumption
- ✅ Concurrent acquires with race condition handling
- ✅ High rate limit scenarios (1000 tokens/sec)
- ✅ Rate limit accuracy verification (within 20% margin)
- ✅ Sequential vs concurrent acquisition patterns
- ✅ Large token acquisitions (5x capacity)
- ✅ Multiple wait cycles in refill loop
- ✅ Thread safety with locks under high concurrency
- ✅ Default parameters (max_tokens = rate_limit)
- ✅ Zero initial state and first-use behavior
- ✅ Rapid small acquisitions (50x 2-token requests)
#### 2. **Integration Tests with Mock API Functions**
([tests/experimental/reward/test_rate_limited_reward_manager_on_cpu.py](tests/experimental/reward/test_rate_limited_reward_manager_on_cpu.py))
16 test cases covering:
- ✅ Basic reward computation (sync and async)
- ✅ RPM rate limiting validation (60 RPM = 1 req/sec)
- ✅ TPM rate limiting validation (6000 TPM with 2000 tokens/req)
- ✅ Concurrency limiting (max 2 concurrent requests)
- ✅ Timeout handling for slow APIs (500ms timeout)
- ✅ Error handling for failing APIs with exception catching
- ✅ Dict vs float return format handling
- ✅ Combined multi-layer rate limits (all 3 layers active)
- ✅ Correct vs incorrect answer scoring
- ✅ High throughput scenarios (50+ concurrent requests)
- ✅ Class initialization idempotency
- ✅ Extra info propagation through reward pipeline
- ✅ Synchronous reward function compatibility
- ✅ Global rate limiter sharing across instances
**Mock API functions implemented:**
- `mock_sync_reward_function` - Synchronous API simulation
- `mock_async_reward_function` - Asynchronous API with call tracking
- `mock_slow_api_function` - Timeout testing (2s delay)
- `mock_failing_api_function` - Error handling testing
- `mock_dict_result_function` - Complex result format testing
- `MockAPICounter` - Global call tracking and rate measurement
**CI Integration:**
- Both test files follow `*_on_cpu.py` naming convention
- Automatically discovered by
[.github/workflows/cpu_unit_tests.yml](.github/workflows/cpu_unit_tests.yml)
- No workflow changes needed - tests run automatically on every PR
- Tests validated for syntax correctness and async compatibility
**Test execution:**
```bash
# Run AsyncTokenBucket unit tests
pytest tests/experimental/reward/test_async_token_bucket_on_cpu.py -v --asyncio-mode=auto
# Run integration tests with mock APIs
pytest tests/experimental/reward/test_rate_limited_reward_manager_on_cpu.py -v --asyncio-mode=auto
# Run all reward loop tests
pytest tests/experimental/reward/ -v --asyncio-mode=auto
```
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs). ✅
**Added comprehensive docstrings to AsyncTokenBucket and
RateLimitedRewardLoopManager classes**
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. ✅ **Added 35+ async unit tests and integration
tests with mock API functions, automatically run in CPU CI workflow**
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
---
**Notes for reviewers:**
- The class-level state pattern ensures rate limits are enforced
globally across all workers in distributed training
- The token bucket implementation provides smooth rate limiting without
bursts
- Timeout and error handling ensure training continues even if reward
computation fails
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent c460541 commit 6db61dc
File tree
8 files changed
+1322
-10
lines changed- tests/experimental/reward
- verl
- experimental/reward/reward_loop
- trainer/ppo
- workers/reward_manager
8 files changed
+1322
-10
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
131 | | - | |
| 131 | + | |
Lines changed: 267 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
0 commit comments