Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,4 +237,4 @@ We currently provide deployment examples for Kubernetes and SLURM.
- **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**

## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
12 changes: 12 additions & 0 deletions components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,18 @@ See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.

### Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:

```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs.

## Request Migration

You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
Expand Down
4 changes: 4 additions & 0 deletions docs/architecture/kv_cache_routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,10 @@ The two types of events are:

The publisher can be initialized and used through C bindings or Python bindings.

### Deterministic Event IDs

For KV-aware routing to work across multiple workers and restarts, engines must emit deterministic block identifiers in KV events. Ensure all workers use identical engine versions/configuration so that block IDs for the same token content remain consistent. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. The router recomputes local block hashes from tokens for matching, but parent/child links and removals depend on engine-provided IDs being stable.

### KVIndexer
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.

Expand Down
Loading