This directory contains the core components that make up the Dynamo inference framework. Each component serves a specific role in the distributed LLM serving architecture, enabling high-throughput, low-latency inference across multiple nodes and GPUs.
Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:
- vLLM - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
- SGLang - Structured generation language framework with ZMQ-based communication
- TensorRT-LLM - NVIDIA's optimized LLM inference engine with TensorRT acceleration
Each engine provides launch scripts for different deployment patterns in their respective /launch & /deploy directories.
The backends directory contains inference engine integrations and implementations, with a key focus on:
- vLLM - Full-featured vLLM integration with disaggregated serving, KV-aware routing, and SLA-based planning
- SGLang - SGLang engine integration supporting disaggregated serving and KV-aware routing
- TensorRT-LLM - TensorRT-LLM integration with disaggregated serving capabilities
The frontend component provides the HTTP API layer and request processing:
- OpenAI-compatible HTTP server - RESTful API endpoint for LLM inference requests
- Pre-processor - Handles request preprocessing and validation
- Router - Routes requests to appropriate workers based on load and KV cache state
- Auto-discovery - Automatically discovers and registers available workers
The planner component monitors system state and dynamically adjusts worker allocation:
- Dynamic scaling - Scales prefill/decode workers up and down based on metrics
- SLA-based planning - Ensures inference performance targets are met
- Load-based planning - Optimizes resource utilization based on demand
To get started with Dynamo components:
- Choose an inference engine from the supported backends
- Set up required services (etcd and NATS) using Docker Compose
- Configure your chosen engine using Python wheels or building an image
- Run deployment scripts from the engine's launch directory
- Monitor performance using the metrics component
For detailed instructions, see the README files in each component directory and the main Dynamo documentation.