Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
ac7e888
docs: fix helm chart urls (#2033)
nealvaidya Jul 21, 2025
76fd471
refactor: support for turning prefix cache off (#2034)
alec-flowers Jul 22, 2025
4449f3d
fix: never sleep on the eos (#2039)
alec-flowers Jul 22, 2025
20c5daf
fix: install torch distribution matching container cuda version (#2027)
ptarasiewiczNV Jul 22, 2025
e5a8628
feat: add a hierarchical Prometheus MetricsRegistry trait for Distrib…
keivenchang Jul 22, 2025
7882693
feat: use atomic transactions when creating etcd kv (#2044)
PeaBrane Jul 22, 2025
d65ce1b
chore(sglang): Move examples/sglang to components/backends/sglang (#2…
grahamking Jul 22, 2025
73505c7
fix: correct Nixl plugin paths in Dockerfile. (#2048)
karya0 Jul 22, 2025
c49a13e
docs: Cleanup index.rst (#2007)
atchernych Jul 22, 2025
9f2356c
chore: Remove unused portion of kv bindings test (#2052)
rmccorm4 Jul 22, 2025
f3e3d94
refactor: vLLM to new Python UX (#1983)
alec-flowers Jul 22, 2025
9cfaa7b
chore: Bump genai-perf to v0.0.15 (#2051)
ptarasiewiczNV Jul 22, 2025
22e6c96
chore: Change vllm K8s from dynamo-run to python -m dynamo.frontend (…
grahamking Jul 22, 2025
b127d95
feat: health check changes based on endpoint served (#1996)
nnshah1 Jul 23, 2025
1958b3a
build: Fixes for vLLM Blackwell Builds (#2020)
zaristei Jul 23, 2025
2c642fd
fix: vllm deployment examples (#2062)
biswapanda Jul 23, 2025
6a69ef4
fix: cryptic error message for empty messages list in /chat/completio…
heisenberglit Jul 23, 2025
c6f12f6
ci: Add RUN_SGLANG to CI variables (#1928)
pvijayakrish Jul 23, 2025
e0a5194
feat: Connect Library (#1478)
whoisj Jul 23, 2025
ffb5409
fix: endpoint changes should be prioritized over new requests in kv s…
PeaBrane Jul 23, 2025
eebc741
docs: Adjust the path to examples (#2056)
atchernych Jul 23, 2025
f9b1757
fix: Bring back ignore_eos/min_tokens support in trtllm component (#2…
rmccorm4 Jul 23, 2025
66b7d2c
fix: updates versions and adds ahashmap to BPE (#2072)
paulhendricks Jul 23, 2025
9bdceac
fix: github ci triggers (#2075)
biswapanda Jul 23, 2025
7a0013b
chore: update attributions for 0.3.2 release (#1837) (#2032)
nv-anants Jul 23, 2025
13560ab
feat: sglang examples launch and deploy (#2068)
biswapanda Jul 23, 2025
f3d784f
feat: query instance_id based on routing strategy (#1787)
biswapanda Jul 23, 2025
3c500ae
docs: Update docs for new UX (#2070)
grahamking Jul 23, 2025
19a77ae
chore(dynamo-run): Remove out=sglang|vllm|trtllm (#1920)
grahamking Jul 24, 2025
ee3a8e4
feat: add initial Grove support (#2012)
julienmancuso Jul 24, 2025
cde8db3
docs: Replace a sym link with and actual markdown link (#2074)
atchernych Jul 24, 2025
13d3cc1
feat: add nixl benchmark deployment instructions (#2060)
biswapanda Jul 24, 2025
2fc65ad
feat: dump radix tree as router events (#2057)
PeaBrane Jul 24, 2025
ba3ac23
test: add router e2e test with mockers to per-merge ci (#2073)
PeaBrane Jul 24, 2025
fe718fd
feat: deploy SLA profiler to k8s (#2030)
hhzhang16 Jul 24, 2025
a2874fd
feat: add possibility to use grove in dynamo graph helm chart (#1954)
julienmancuso Jul 24, 2025
f03f8be
docs: hello_world python binding example (#2083)
nealvaidya Jul 24, 2025
2bbbd44
chore: Remove unused trtllm requirements.txt (#2098)
rmccorm4 Jul 24, 2025
f0e382a
fix: Merge env vars correctly (#2096)
julienmancuso Jul 24, 2025
3094278
docs: Create a guide for writing dynamo deployments CR (#1999)
atchernych Jul 24, 2025
ff92053
docs: add NAMESPACE (#2105)
atchernych Jul 25, 2025
a2cb1c3
feat: update python packaging for new dynamo UX (#2054)
grahamking Jul 25, 2025
24cb926
docs: Clean index.rst (#2104)
atchernych Jul 25, 2025
412a12a
fix: rm enforce eager from vllm deploy - prefer perf over pod launch …
biswapanda Jul 25, 2025
2cd96ec
build: Add TensorRT-LLM to optional dependency and corresponding inst…
tanmayv25 Jul 25, 2025
384e449
fix: agg router test (#2123)
alec-flowers Jul 25, 2025
4dc529a
chore: remove vLLM v0 multimodal example (#2099)
GuanLuo Jul 25, 2025
4498a77
fix: move docker-compose.yml to deploy/, and update frontend port (#2…
keivenchang Jul 25, 2025
222245e
refactor: Move engine and publisher from dynamo.llm.tensorrt_llm to d…
tanmayv25 Jul 26, 2025
b8461b6
chore: updated health checks to use new probes (#2124)
nnshah1 Jul 27, 2025
e2a514b
fix: remove prints (#2142)
alec-flowers Jul 28, 2025
615580d
feat: Base metrics: add generic ingress handler metrics (#2090)
keivenchang Jul 28, 2025
e82bc4e
chore: update vLLM to 0.10.0 (#2114)
ptarasiewiczNV Jul 28, 2025
803bfa8
feat: proper local hashes for mockers + router watches endpoints (#2132)
PeaBrane Jul 28, 2025
0cb01b3
feat: updates to structured logging (#2061)
nnshah1 Jul 28, 2025
ca0035f
fix: copy whole workspace for pre-merge vllm tests (#2146)
nv-anants Jul 28, 2025
d23d48b
feat: Deploy SLA planner to Kubernetes (#2135)
hhzhang16 Jul 28, 2025
708d7c3
docs: add Llama4 eagle3 one model example and configs (#2087)
jhaotingc Jul 28, 2025
096d117
docs: update router docs (#2148)
PeaBrane Jul 28, 2025
1e6709d
feat: allow to override any podSpec property (#2116)
julienmancuso Jul 28, 2025
f809659
docs: hello world deploy example (#2102)
atchernych Jul 28, 2025
cfc6178
feat: add sglang disagg deployment examples (#2137)
biswapanda Jul 28, 2025
bbe8dbb
fix: remove containers from required property of extraPodSpec (#2153)
julienmancuso Jul 28, 2025
fdcf611
chore: Add Request Migration docs and minor enhancements (#2038)
kthui Jul 28, 2025
095ea3e
chore: updating and removing tests (#2130)
nnshah1 Jul 29, 2025
4747790
feat: deprecate sdk as dependency (#2149)
biswapanda Jul 29, 2025
3175b10
docs: Update to README.md (#2141)
athreesh Jul 29, 2025
7fbd43a
docs: Update dynamo_glossary.md (#2082)
athreesh Jul 29, 2025
358e908
docs: Adding document for running Dynamo on Azure Kubernetes Services…
saurabh-nvidia Jul 29, 2025
195c4c4
docs: Quickstart with new UX (#2005)
nealvaidya Jul 29, 2025
291df28
docs: add disagg example + explanation (#2086)
nealvaidya Jul 29, 2025
ca5b681
docs: add multinode example (#2155)
nealvaidya Jul 29, 2025
a8cb655
docs: update readme install instructions (#2170)
nv-anants Jul 29, 2025
5be23eb
Readmes + eks additions (#2157)
athreesh Jul 29, 2025
2befa38
feat: claim support for AL2023 x86_64 (#2150)
saturley-hall Jul 29, 2025
e542f00
chore: cleanup examples codeowners (#2171)
nealvaidya Jul 29, 2025
12a7b83
docs: Examples README/restructuring, framework READMEs, EKS examples …
athreesh Jul 29, 2025
8b0a035
docs: Update the operator docs (#2172)
atchernych Jul 29, 2025
8248a11
feat: gaie helm chart based example (#2168)
biswapanda Jul 29, 2025
157714a
chore: add instructions to modify SLA to profile_sla doc; update comp…
tedzhouhk Jul 29, 2025
30d4612
fix: install rdma libs in runtime image. (#2163)
karya0 Jul 29, 2025
da0c572
chore: update sgl version and fix h100 wideep example (#2169)
ishandhanani Jul 30, 2025
4c90b1b
chore: Version bump to 0.4.0 (#2179)
dmitry-tokarev-nv Jul 30, 2025
ee09de0
fix: link to point to bindings/python/README.md (#2186)
keivenchang Jul 30, 2025
dabfea3
chore: address QA broken links comments (#2184)
athreesh Jul 30, 2025
b69c507
fix: add better port logic (#2175)
alec-flowers Jul 30, 2025
7fc94da
fix(container): update sgl dockerfile install commands (#2194)
ishandhanani Jul 30, 2025
57482dc
docs: Bug 5424387 (#2196)
atchernych Jul 30, 2025
f3868b1
fix: support config without resource limit for profile sla script (#2…
tedzhouhk Jul 31, 2025
f8b0a5a
feat: Add trtllm deploy examples for k8s (#2133)
tanmayv25 Jul 31, 2025
62c7898
fix: add curl and jq for health checks (#2203)
biswapanda Jul 31, 2025
c546b63
fix: update SGLang version in instructions and Dockerfile to revert t…
ishandhanani Jul 31, 2025
97390ac
fix(k8s): sglang disagg now uses decode worker (#2206)
ishandhanani Jul 31, 2025
f10aab3
fix: Migrating trtllm examples from `1.0.0rc0` to `1.0.4rc4` (#2217)
KrishnanPrash Jul 31, 2025
3bf22bb
feat: reorganize sglang and add expert distribution endpoints (#2181)
ishandhanani Jul 31, 2025
bae25dc
feat: skip downloading model weights if using mocker (only tokenizer)…
PeaBrane Jul 31, 2025
cbc0e20
fix: fix endpoint run to return error DIS-325 (#2156)
keivenchang Jul 31, 2025
625578c
chore: update nixl version to 0.4.1 (#2221)
nv-anants Jul 31, 2025
7e3b3fa
fix: Add default configs in LLMAPI. Fixes OOM issues (#2198)
tanmayv25 Jul 31, 2025
f10e44c
fix: Integration tests fixes (#2161)
keivenchang Jul 31, 2025
f14f59c
chore: Remove multimodal readme. (#2212)
krishung5 Jul 31, 2025
dbd33df
fix: handle groveTerminationDelay and auto-detect grove installation …
julienmancuso Aug 1, 2025
66231cf
feat: reduce / revert routing overheads, do not consider output token…
PeaBrane Aug 1, 2025
8c75ed7
fix: frontend metrics to be renamed from nv_llm_http_service_* => dyn…
keivenchang Aug 1, 2025
1ad6abe
feat: add sgl deploy readme (#2238)
ishandhanani Aug 1, 2025
efd863d
fix: dynamo_component to be added in metric names (#2180)
keivenchang Aug 1, 2025
faafa5f
docs: add a docs/guides/metrics.md (#2160)
keivenchang Aug 1, 2025
cb1492a
rebase main
ziqifan617 Aug 1, 2025
ae51b3f
test: Request Migration Docs and E2E vLLM Tests (#2177)
kthui Aug 1, 2025
959f810
feat: sglang + gb200 (#2223)
ishandhanani Aug 1, 2025
fa492bb
docs: Dyn 591 (#2247)
atchernych Aug 2, 2025
357f34b
cleanup (#2250)
ziqifan617 Aug 2, 2025
2954005
Merge branch 'main' into ziqi/connector-250801
ziqifan617 Aug 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test: Request Migration Docs and E2E vLLM Tests (#2177)
Signed-off-by: Jacky <[email protected]>
Co-authored-by: Neelay Shah <[email protected]>
  • Loading branch information
kthui and nnshah1 authored Aug 1, 2025
commit ae51b3f43b63389bad4e5c67db093b9036ed14cd
10 changes: 2 additions & 8 deletions components/backends/llama_cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,10 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.

For example,
```bash
python3 -m dynamo.llama_cpp ... --migration-limit=3
```
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.

The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
10 changes: 2 additions & 8 deletions components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,19 +143,13 @@ When using MoE models, you can also use the our implementation of the native SGL

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.

For example,
```bash
python3 -m dynamo.sglang ... --migration-limit=3
```
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.

The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.

## Advanced Examples

Expand Down
10 changes: 2 additions & 8 deletions components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,19 +263,13 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.

For example,
```bash
python3 -m dynamo.trtllm ... --migration-limit=3
```
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.

The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.

## Client

Expand Down
10 changes: 2 additions & 8 deletions components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,16 +235,10 @@ The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.

For example,
```bash
python3 -m dynamo.vllm ... --migration-limit=3
```
indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.

The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
3 changes: 3 additions & 0 deletions components/backends/vllm/src/dynamo/vllm/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ def cleanup(self):

async def generate(self, request):
request_id = str(uuid.uuid4().hex)
logger.debug(f"New Request ID: {request_id}")

prompt = TokensPrompt(prompt_token_ids=request["token_ids"])

Expand Down Expand Up @@ -164,6 +165,8 @@ def __init__(self, component, engine, default_sampling_params):

async def generate(self, request):
request_id = request["request_id"]
logger.debug(f"New Prefill Request ID: {request_id}")

prompt = TokensPrompt(prompt_token_ids=request["token_ids"])
sampling_params = msgspec.convert(request["sampling_params"], SamplingParams)

Expand Down
114 changes: 114 additions & 0 deletions docs/architecture/request_migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Request Migration Architecture

This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience.

## Overview

Request migration is implemented through a Migration operator that sits in the LLM processing pipeline between the Backend operator and the service backend. When a worker fails during request processing, the migration system preserves the partial generation state and recreates the request on a new worker to continue from where the previous worker left off.

## Architecture Components

### Migrator

The migration system is integrated into the LLM processing pipeline between the frontend preprocessing and the actual service backends. This positioning allows it to intercept all communication flows and manage failure scenarios transparently.

Key responsibilities:
- Intercepts all requests and responses flowing through the pipeline
- Detects worker failure scenarios through error pattern matching
- Manages retry logic with configurable migration limits
- Tracks partial response state for seamless continuation

### Migration Limit Configuration

Each model can be configured with a migration limit parameter that specifies the maximum number of times a request can be migrated to another worker:

- Default behavior: no migration allowed
- Can be set independently for different engine types
- Applicable to LLM worker nodes that perform inference
- Allows engines to override user-specified limits for compatibility

## Token State Tracking and Request Migration

The core of the migration system is the ability to preserve and continue partial generations through token state management. This ensures that when a worker fails mid-generation, the new worker can seamlessly continue from the exact point of failure.

### Token Accumulation Process

When a request is being processed and responses are flowing back from a worker, the migration system tracks every token that has been successfully generated:

1. **Initial Request State**: The system starts with the original preprocessed request containing the initial prompt tokens.

2. **Response Tracking**: As each response arrives from the worker, the migration system extracts the newly generated tokens and appends them to the request's token sequence. This creates accumulates all tokens that have been generated.

3. **Token Count Management**: The system also updates the remaining token budget to reflect the number of tokens already generated, ensuring that the total generation stays within the originally requested limits.

### Migration Trigger Scenarios

The migration system handles two distinct failure scenarios:

#### 1. New Request Migration (Initial Connection Failure)

**Scenario**: Worker is unreachable when creating the initial connection.

**Error Pattern**: Communication system reports chosen worker instance is unavailable.

**Migration Process**:
- Detects connection failure during initial stream setup
- Decrements migration retry count
- Attempts to create a new stream with the original request
- No partial state to preserve since generation hasn't started

#### 2. Ongoing Request Migration (Mid-Stream Disconnection)

**Scenario**: Connection lost during active generation after partial responses have been received.

**Error Pattern**: Stream termination detected before generation completion.

**Migration Process**:

1. **Failure Detection**: The system detects the stream disconnection through error monitoring.

2. **State Preservation**: At this point, the request's token sequence contains both the original prompt tokens and all successfully generated tokens from the failed worker.

3. **New Stream Creation**: A fresh stream is created with the accumulated request state, ensuring the new worker has complete context.

4. **Continuation**: The new worker receives the request with the full token context and continues generation from the exact point where the previous worker left off.

### Seamless Token Flow and Request State Evolution

From the client's perspective, the token stream appears continuous and uninterrupted. The client receives tokens from the first worker until failure occurs, then seamlessly continues receiving tokens from the backup worker without any indication of the underlying migration.

The request state evolves dynamically during processing. Initially, the request contains only the original prompt tokens. As generation proceeds, each successfully generated token is appended to the request's token sequence, creating a growing record of the complete conversation context.

When a migration occurs, this accumulated state is transferred to the new worker, which uses it to reconstruct the complete context. The new worker then continues generation as if it had been processing the request from the beginning, but starting from the current position in the sequence.

The migration is transparent because:
1. No tokens are lost or duplicated during the transition
2. The new worker has complete context via the accumulated token sequence
3. Generation continues from the exact failure point
4. Response streaming maintains consistent format and timing

This token accumulation mechanism ensures that migrations are truly seamless, preserving all computational work and maintaining generation quality across worker transitions.

## Benefits

1. **Fault Tolerance**: System continues operating during individual worker failures
2. **Resource Efficiency**: Partial generations are preserved rather than restarted
3. **Seamless User Experience**: Users experience no interruption during worker failures
4. **Configurable Behavior**: Migration limits allow tuning based on deployment requirements
5. **No Token Loss**: Complete preservation of generation state across migrations

## Design Considerations

The migration system is designed with several important architectural considerations:

**Engine Compatibility**: Different LLM engines may have varying capabilities for handling migrated requests. The system allows engines to override migration settings to ensure compatibility and correctness.

**Multi-Model Support**: Since a frontend may serve multiple models simultaneously, migration limits can be configured at the engine level, providing flexibility for different model types with varying reliability characteristics.

**State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation.

**Error Handling**: The migration system distinguishes between different types of failures and applies appropriate recovery strategies for each scenario.

## Operational Impact

Request migration fundamentally changes how the system handles failures, moving from a "fail-fast" approach to a "graceful degradation" model. This architectural shift enables higher availability and better resource utilization while maintaining the same external API contract for clients.
1 change: 1 addition & 0 deletions docs/guides/backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ The `model_type` can be:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `migration_limit`: Maximum number of times a request may be [migrated to another Instance](../architecture/request_migration.md). Defaults to 0.

See `components/backends` for full code examples.

Expand Down
12 changes: 3 additions & 9 deletions docs/guides/dynamo_run.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,19 +211,13 @@ The KV-aware routing arguments:

### Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the HTTP Server and the Worker Engine.
In a [Distributed System](#distributed-system), you can enable [request migration](../architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:

The HTTP Server will automatically track which Worker Engines are having connectivity issues with it and avoid routing new requests to the Engines with known connectivity issues.

For ongoing requests, there is a `--migration-limit` flag which can be set on the Worker Engines that tells the HTTP Server how many times a request can be migrated to another Engine should there be a loss of connectivity to the current Engine.

For example,
```bash
dynamo-run in=dyn://... out=vllm ... --migration-limit=3
dynamo-run in=dyn://... out=<engine> ... --migration-limit=3
```
indicates a request to this model may be migrated up to 3 times to another Engine, before failing the request, should the HTTP Server detects a connectivity issue to the current Engine.

The migrated request will continue responding to the original request, allowing for a seamless transition between Engines, and a reduced overall request failure rate at the HTTP Server for enhanced user experience.
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../architecture/request_migration.md) documentation for details on how this works.

## Full usage details

Expand Down
80 changes: 80 additions & 0 deletions tests/fault_tolerance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Fault Tolerance Tests

This directory contains end-to-end tests for Dynamo's fault tolerance capabilities.

## Tests

### `test_request_migration.py`

Tests worker fault tolerance with migration support using the `test_request_migration_vllm` function. This test:

0. Downloads the DeepSeek-R1-Distill-Llama-8B model from HuggingFace if not already cached
1. Starts a Dynamo frontend using `python -m dynamo.frontend` with round-robin routing
2. Starts 2 workers sequentially using `python3 -m dynamo.vllm` with specific configuration:
- Model: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
- `--enforce-eager`, `--gpu-memory-utilization 0.45`
- `--max-model-len 8192`, `--migration-limit 3`
3. Waits for both workers to be fully ready (looking for "Reading Events from" messages)
4. Sends a test request ("Who are you?", 100 tokens) to determine which worker handles requests
5. Determines primary/backup worker roles based on round-robin routing and log analysis
6. Sends a long completion request ("Tell me a long long long story about yourself?", 8000 tokens) in a separate thread
7. Waits 0.5 seconds, then kills the primary worker using SIGKILL process group termination
8. Verifies the request completes successfully despite the worker failure (with 240s timeout)
9. Checks that the frontend logs contain "Stream disconnected... recreating stream..." indicating migration occurred

## Prerequisites

- vLLM backend installed (`pip install ai-dynamo-vllm`)
- NATS and etcd services running (provided by `runtime_services` fixture)
- Access to DeepSeek-R1-Distill-Llama-8B model (automatically downloaded from HuggingFace)
- Sufficient GPU memory (test uses 0.45 GPU memory utilization)

## Running the Tests

To run the fault tolerance tests:

```bash
# Run all fault tolerance tests
pytest /workspace/tests/fault_tolerance

# Run specific test with verbose output
pytest /workspace/tests/fault_tolerance/test_request_migration.py::test_request_migration_vllm -v

# Run with specific markers
pytest -m "e2e and vllm" /workspace/tests/fault_tolerance

# Run with debug logging
pytest /workspace/tests/fault_tolerance/test_request_migration.py::test_request_migration_vllm -v -s
```

## Test Markers

- `@pytest.mark.e2e`: End-to-end test
- `@pytest.mark.vllm`: Requires vLLM backend
- `@pytest.mark.gpu_1`: Requires single GPU access
- `@pytest.mark.slow`: Known to be slow (due to model loading and inference)

## Environment Variables

- `DYN_LOG`: Set to `debug` or `trace` for verbose logging (automatically set to `debug` by worker processes)
- `CUDA_VISIBLE_DEVICES`: Control which GPUs are used for testing

## Expected Test Duration

The test typically takes 2-3 minutes to complete, including:
- Model download/loading time (if not cached) - can take 1-2 minutes for first run
- Worker startup and registration
- Request processing and response validation
- Worker failure simulation and migration
- Cleanup

## Troubleshooting

If tests fail:

1. Check that NATS and etcd services are running
2. Verify vLLM backend is properly installed
3. Ensure sufficient GPU memory is available (test requires ~45% GPU memory)
4. Check internet connectivity for model download from HuggingFace
5. Review test logs for specific error messages
6. Verify that the DeepSeek-R1-Distill-Llama-8B model can be accessed
Loading