[RFC]Dynamic Load Balancing with VLLM Capacity Management


## Abstract

This RFC proposes a dynamic load balancing solution for VeRL to address the imbalance problem in multi-turn rollout. The solution introduces capacity-aware request distribution and dynamic VLLM instance capacity scaling based on GPU resource utilization.

## Motivation

### Problem Statement

The current VERL implementation suffers from load imbalance issues that impact system performance and resource utilization:

1. **Resource Underutilization and slow training**: Some GPUs may be idle waiting for the longest rollout due to varying rollout lengths. In multi-turn scenarios, tasks have different numbers of turns and varying decode lengths per turn, which further exacerbates the problem.
2. **Static request Distribution**: rollouts in a batch are statically and evenly distributed.
3. **Concurrency Bottlenecks**: Load balancing effectiveness is limited by fixed concurrency settings, also mentioned in #2520.

## Solution Overview

The proposed solution introduces a three-tier architecture for dynamic load balancing:

1. **AgentLoopManager**: Central coordinator for request distribution with capacity awareness
2. **AgentLoopWorker**: Local request handler that processes requests and manages local capacity tracking
3. **LLMServerManager**: Resource monitor and capacity updater that distributes requests to selected VLLM instances, supporting least-request priority and session affinity

## Architecture

<img width="1550" height="794" alt="Image" src="https://github.com/user-attachments/assets/64ac365f-48e9-4446-bff5-eda8a7aee71b" />


## Detailed Design

### 1. AgentLoopManager

**Responsibilities:**
- Central request distribution coordinator with capacity constraints
- Maintains global view of all AgentLoopWorker capacities
- Supports request-level processing with iterative distribution

**Key Features:**
```python
class AgentLoopManager:
    def __init__(self):
        self.workers = {}  # worker_id -> AgentLoopWorker
        self.capacities = {}  # worker_id -> current_capacity
        self.pending_requests = []
    
    def distribute_requests(self, batch_requests):
        """Distribute requests considering worker capacities"""
        while batch_requests:
            for worker_id, worker in self.workers.items():
                if self.capacities[worker_id] > 0 and batch_requests:
                    request = batch_requests.pop(0)
                    worker.submit_request(request)
                    self.capacities[worker_id] -= 1
    
```

### 2. AgentLoopWorker

**Responsibilities:**
- Local request processing 
- Capacity updates

**Key Features:**
```python
class AgentLoopWorker:
    def __init__(self, worker_id):
        self.worker_id = worker_id
        self.request_queue = []
        self.current_capacity = 0
    
    def process_request(self, request, target_instance):
        """Process request on specified VLLM instance"""
        if self.current_capacity > 0:
            self._send_to_vllm(request, target_instance)
            self.current_capacity -= 1
    
    def update_capacity(self, new_capacity):
        """Update worker capacity from LLMServerManager"""
        self.current_capacity = new_capacity
```

### 3. LLMServerManager

**Responsibilities:**
- Periodic monitoring of VLLM instance GPU block availability
- Concurrent request capacity calculation based on free GPU blocks
- Request routing to available VLLM instances
- Concurrent capacity updates to AgentLoopWorkers
- Resource utilization reporting

**Key Features:**
```python
class LLMServerManager:
    def __init__(self, monitoring_interval=5):
        self.monitoring_interval = monitoring_interval
        self.vllm_instances = {}
        self.workers = {}
        self.instance_capacities = {}
        self.blocks_per_request = 1
    
    def start_monitoring(self):
        """Start periodic capacity monitoring"""
        while True:
            self._update_all_capacities()
            time.sleep(self.monitoring_interval)
    
    def route_request(self, request):
        """Route request to best available VLLM instance"""
        best_instance = self._select_best_instance()
        if best_instance and self.instance_capacities[best_instance] > 0:
            worker = self._get_worker_for_instance(best_instance)
            worker.process_request(request, best_instance)
            self.instance_capacities[best_instance] -= 1
    
    def _select_best_instance(self):
        """Select VLLM instance with highest available capacity"""
        pass
    
    def _get_worker_for_instance(self, instance_id):
        """Get worker responsible for specific VLLM instance"""
        return self.workers[instance_id]
    
    def _update_all_capacities(self):
        """Update concurrent request capacities for all VLLM instances"""
        for instance_id, instance in self.vllm_instances.items():
            free_blocks = self._get_free_gpu_blocks(instance)
            capacity = self._calculate_capacity(free_blocks)
            self.instance_capacities[instance_id] = capacity
            self._notify_workers(instance_id, capacity)
    
    def _calculate_capacity(self, free_blocks):
        """Calculate maximum concurrent request capacity from free GPU blocks"""
        # Implementation depends on VLLM's block allocation strategy
        return free_blocks * self.blocks_per_request
```

## Extension Features

### Request Abort & Migration (Depends on PR #2200)

Once PR #2200 is merged, the following features can be implemented:

1. **Request Abort**: Ability to cancel requests stuck in overloaded instances
2. **Request Migration**: Move requests from overloaded to underutilized instances
3. **Dynamic Rebalancing**: Real-time load redistribution based on current conditions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]Dynamic Load Balancing with VLLM Capacity Management #3432

Abstract

Motivation

Problem Statement

Solution Overview

Architecture

Detailed Design

1. AgentLoopManager

2. AgentLoopWorker

3. LLMServerManager

Extension Features

Request Abort & Migration (Depends on PR #2200)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]Dynamic Load Balancing with VLLM Capacity Management #3432

Description

Abstract

Motivation

Problem Statement

Solution Overview

Architecture

Detailed Design

1. AgentLoopManager

2. AgentLoopWorker

3. LLMServerManager

Extension Features

Request Abort & Migration (Depends on PR #2200)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions