llm-d-inference-sim is a lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models. It operates as a fully OpenAI-compliant server, allowing developers to test clients, schedulers, and infrastructure using realistic request-response cycles, token streaming, and latency patterns.
Running full LLM inference requires significant GPU resources and introduces non-deterministic latency, making it difficult to isolate infrastructure bugs or iterate quickly on control-plane logic. This simulator decouples development from heavy inference, offering a controlled environment to:
- Accelerate Infrastructure Development: Test routing, scheduling, and KV cache locality logic without waiting for slow, expensive GPU operations.
- Ensure Deterministic Testing: simulate precise token timing and latency to isolate performance regressions and bugs in a way that is impossible with non-deterministic real models.
- Validate Observability: Mirror vLLM’s Prometheus metrics to ensure monitoring and alerting systems are functioning correctly before deploying to production.
- Test Advanced Features: Safely develop complex logic such as LoRA adapter lifecycles (loading, unloading, and switching) and Disaggregated Prefill integrations.
The simulator is designed to act as a drop-in replacement for vLLM, sitting between your client/infrastructure and the void where the GPU usually resides. It processes requests through a configurable simulation engine that governs what is returned and when it is returned.
For detailed configuraiton definitions see the Configuration Guide
The simulator decides the content of the response based on two primary modes:
- Echo Mode (--mode echo):
Acts as a loopback. The response content mirrors the input (e.g., the last user message in a chat request). Useful for network throughput testing where content validity is irrelevant. - Random Mode (--mode random):
The default mode. Generates synthetic responses based on requested parameters (like max_tokens). Utilizes probabilistic histograms to determine response length. Content is sourced from either a set of pre-defined sentences or a custom dataset (see below).
Natively supports both HTTP (OpenAI-compatible) and gRPC (vLLM-compatible) interfaces on the same port, allowing for versatile integration testing across different client architectures.
For detailed API definitions see the APIs Guide.
In Random Mode, the simulator can generate content in two ways:
-
Predefined Text: By default, it constructs responses by concatenating random sentences from a built-in list until the target token length is met.
-
Real Datasets: If a dataset is provided (via --dataset-path or --dataset-url), the simulator attempts to match the hash of the incoming prompt to a conversation history in the database. If a match is found, it returns the stored response. If no match is found, it falls back to a random response from the dataset or predefined text.
Supports downloading SQLite datasets directly from HuggingFace.
For response generation algorithms details see Response Generation Guide.
Unlike simple mock servers that just "sleep" for a fixed time, this simulator models the physics of LLM inference:
-
Time to first token: Simulates the prefill phase latency, including configurable standard deviation (jitter) for realism.
-
Inter-token latency: Simulates the decode phase, adding delays between every subsequent token generation.
-
Load Simulation: The simulator automatically increases latency as the number of concurrent requests becomes higher.
-
Disaggregated Prefill (PD): Can simulate KV-cache transfer latency instead of standard TTFT when mimicking Prefill/Decode disaggregation architectures.
The simulator offers flexible tokenization to balance accuracy vs. performance. The simulator automatically selects between two tokenization modes based on the provided --model name:
- HuggingFace Mode: Used for real models (e.g.,
meta-llama/Llama-3.1-8B-Instruct). Downloads actual tokenizers for exact accuracy. - Simulated Mode: Used for dummy/non-existent model names. Uses a fast regex tokenizer for maximum performance with zero startup overhead.
For details on caching, and performance tuning, see the Tokenization Guide.
Simulates the lifecycle (loading/unloading) of LoRA adapters without occupying actual memory. Reports LoRA related Prometheus metrics.
Tracks simulated memory usage and publishes ZMQ events for cache block allocation and eviction.
The configuration for P/D disaggregation deployment can be found in manifests/disaggregation.
Can randomly inject specific errors (e.g., rate_limit, model_not_found) to test client resilience.
The simulator is designed to run either as a standalone binary or within a Kubernetes Pod (e.g., for testing with Kind).
The simulator supports a subset of standard vLLM Prometheus metrics.
For detailes see the Metrics Guide
If you do not wish to build the simulator from source, you can use the pre-built container images hosted on the GitHub Container Registry.
Image Repository: ghcr.io/llm-d/llm-d-inference-sim
- Pull the Image You can pull the latest version of the simulator directly via Docker:
docker pull ghcr.io/llm-d/llm-d-inference-sim:v0.8.0- Deployment via Kubernetes
To deploy the simulator in a cluster, update your deployment manifest to point to the official image. An example configuration can be found in manifests/deployment.yaml.
To build a Docker image of the vLLM Simulator, run:
make image-buildPlease note that the default image tag is ghcr.io/llm-d/llm-d-inference-sim:dev.
The following environment variables can be used to change the image tag
| Variable | Descriprtion | Default Value |
|---|---|---|
| IMAGE_REGISTRY | Name of the repo | ghcr.io/llm-d |
| IMAGE_TAG_BASE | Image base name | $(IMAGE_REGISTRY)/llm-d-inference-sim |
| SIM_TAG | Image tag | dev |
| IMG | The full image specification |
|
To build the vLLM simulator to run locally as an executable, run:
make buildTo run the vLLM simulator in a standalone test environment with real model:
- Run the UDS tokenizer, see details here.
a. Clone the kv-cache projectb. Create and activate a python virtual environmentgit clone git@github.com:llm-d/llm-d-kv-cache.git
c. Navigate to 'llm-d-kv-cache/services/uds_tokenizer'python -m venv <virt env folder> source <virt env folder>/bin/activate
d. Install the requirementscd llm-d-kv-cache/services/uds_tokenizere. Run the UDS tokenizerpip install -e .python ./run_grpc_server.py
- Start the simulator:
./bin/llm-d-inference-sim --model Qwen/Qwen2.5-0.5B-Instruct --port 8000Note: If the model is not a real model, there is no need to run the UDS tokenizer.
Set these environment variables before running tests:
export TESTCONTAINERS_RYUK_DISABLED=true
export DOCKER_HOST="unix://${HOME}/.colima/default/docker.sock"Note:
-
TESTCONTAINERS_RYUK_DISABLED=truedisables Testcontainers' resource cleanup daemon (recommended for local testing). -
DOCKER_HOSTshould point to your Docker socket:
| Provider | Docker Host Path |
|---|---|
| Colima (macOS) | unix://${HOME}/.colima/default/docker.sock |
| Docker Desktop (macOS/Windows) | unix:///var/run/docker.sock |
| Linux | unix:///var/run/docker.sock (default, may omit DOCKER_HOST) |
Verify socket accessibility with docker info. Adjust DOCKER_HOST based on your Docker provider.
make testTo run the vLLM simulator in a Kind cluster, run:
make dev-env-kindCheck Makefile for environment variables to tune the process. For example:
KIND_CLUSTER_NAME=mytest UDS_TOKENIZER_TAG=v0.6.0 SIM_TAG=dev make dev-env-kindTo verify the deployment is available, run:
kubectl get deployment vllm-sim
kubectl get service vllm-simTest the API with curl
curl -X POST http://localhost:30080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'To delete the cluster, run:
make clean-dev-env-kind