Supporting SGLang's native endpoints via HTTP Server

Introduction

The SGLang HTTP server provides a REST API interface for managing and monitoring SGLang components running in a dynamo distributed environment. It leverages dynamo's service discovery mechanism to automatically find and communicate with SGLang workers across the cluster.

How it works under the hood

Architecture Overview

The HTTP server (sgl_http_server.py) is built on FastAPI and integrates with dynamo's DistributedRuntime to discover and interact with SGLang components. It uses the following discovery flow:

Service Discovery: Queries dynamo's etcd instance to find components that expose specific endpoints
Dynamic Targeting: Automatically discovers all matching components across namespaces without requiring manual configuration
Direct Communication: Establishes direct connections to discovered component instances using dynamo's client infrastructure

Discovery Mechanism

The server uses dynamo's hierarchical service discovery structure:

DistributedRuntime: Maintains connections to etcd (service discovery) and NATS (messaging)
Namespace: Logical grouping of components (default: "dynamo")
Component: Individual SGLang workers or services
Endpoint: Specific functionality exposed by each component

The discovery process queries etcd with the prefix instances/ to find all registered components that expose the target endpoint. Components are identified by their namespace, component name, and endpoint, allowing the server to dynamically scale operations across multiple instances.

Supported Endpoints

All of these endpoints can be called using

curl -X POST http://<ip>:9001/<endpoint>

`/flush_cache`

Flushes the kv cache across all SGLang components. Useful for resetting after a warmup or a benchmarking run.

`/start_expert_distribution_record`

Begins recording expert distribution metrics across SGLang components.

`/stop_expert_distribution_record`

Stops the expert distribution recording process.

`/dump_expert_distribution_record`

Dumps the collected expert distribution data.

Configuration

The server accepts the following command-line arguments:

--port: HTTP server port (default: 9001)
--ns/--namespace: Target dynamo namespace (default: "dynamo")

Usage

Start the server:

python src/dynamo/sglang/utils/sgl_http_server.py --port 9001 --namespace dynamo

The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting SGLang's native endpoints via HTTP Server

Introduction

Architecture Overview

Discovery Mechanism

Supported Endpoints

`/flush_cache`

`/start_expert_distribution_record`

`/stop_expert_distribution_record`

`/dump_expert_distribution_record`

Configuration

Usage

FilesExpand file tree

sgl-http-server.md

Latest commit

History

sgl-http-server.md

File metadata and controls

Supporting SGLang's native endpoints via HTTP Server

Introduction

Architecture Overview

Discovery Mechanism

Supported Endpoints

/flush_cache

/start_expert_distribution_record

/stop_expert_distribution_record

/dump_expert_distribution_record

Configuration

Usage

`/flush_cache`

`/start_expert_distribution_record`

`/stop_expert_distribution_record`

`/dump_expert_distribution_record`