-
Notifications
You must be signed in to change notification settings - Fork 751
feat: FT Request Cancellation feature and test for 0.5.0 #2500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
4fcb1c5
fix: Relay engine_ctx between streams
kthui 1c8188a
fix: Retain context_id between requests
kthui 8f12b18
debug: Add debug print on received migration limit
kthui e75ca6d
feat: FT Python Context and Unit Tests (#2677)
kthui b4603fa
feat: vLLM abort on stream stop (#2717)
kthui 2678c5b
test: HTTP Request Cancellation E2E Testing (#2350)
kthui d5a5a5c
docs: Request Cancellation
kthui 29b3829
Pass context as kwargs
kthui a0047e2
Enhance stop stop_generating and kill methods logic
kthui 8b06e6c
docs: Remove Future Roadmap section
kthui ba47834
test: Combine NATS and ETCD start/stop into conftest.py
kthui File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| # Request Cancellation Architecture | ||
|
|
||
| This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed. | ||
|
|
||
| ## AsyncEngineContext Trait | ||
|
|
||
| At the core of Dynamo's request cancellation system is the `AsyncEngineContext` trait. This trait is associated with every request stream and provides lifecycle management for async operations, including stream identification, graceful shutdown capabilities, and immediate termination capabilities. | ||
|
|
||
| ### Key Methods | ||
|
|
||
| #### Identification | ||
| - **`id()`**: Returns the unique identifier for the stream. This ID is set by the user for request identification, and the same ID can be used for sub-requests to associate them with the original user request. | ||
|
|
||
| #### Status Checking | ||
| - **`is_stopped()`**: Returns `true` if graceful cancellation has been requested via `stop_generating()`. This represents a signal to the worker that the request has been cancelled and it should return early. | ||
| - **`is_killed()`**: Returns `true` if a hard stop has been issued via `kill()`. This typically indicates that the network connection between client and server has been cut or an immediate termination is required. | ||
|
|
||
| #### Async Status Monitoring | ||
| - **`stopped()`**: An async method that completes when the context becomes stopped. If already stopped, returns immediately. | ||
| - **`killed()`**: An async method that completes when the context becomes killed. If already killed, returns immediately. | ||
|
|
||
| #### Cancellation Control | ||
| - **`stop_generating()`**: The recommended method for cancelling a request. This informs the engine to stop producing results for the stream gracefully. This method is idempotent and does not invalidate results currently in the stream. | ||
| - **`stop()`**: Alias for `stop_generating()`. | ||
| - **`kill()`**: Extends `stop_generating()` but also indicates a preference to terminate without draining remaining items in the stream. This is implementation-specific and may not be supported by all engines. | ||
|
|
||
| #### Child Request Management | ||
| - **`link_child(child: Arc<dyn AsyncEngineContext>)`**: Links a child `AsyncEngineContext` to this context. When `stop_generating()`, `stop()`, or `kill()` is called on the parent context, the same method is automatically called on all linked child contexts in the order they were linked. This is especially useful in disaggregated serving scenarios where a frontend receives cancellation notification and needs to cancel requests to workers, and the worker can then cancel its sub-requests (e.g., remote prefill operations). | ||
|
|
||
| ### Thread Safety | ||
|
|
||
| The `AsyncEngineContext` trait ensures thread-safety with `Send + Sync` bounds, allowing safe concurrent access across multiple threads and async tasks. | ||
|
|
||
| ## Python Bindings | ||
|
|
||
| The `AsyncEngineContext` functionality is exposed to Python through the `Context` class, which provides a largely one-to-one mapping from Rust methods to Python methods. | ||
|
|
||
| ### Python Context Class | ||
|
|
||
| The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the following methods: | ||
|
|
||
| - **`id()`**: Returns the unique identifier for the context | ||
| - **`is_stopped()`**: Synchronous method equivalent to the Rust `is_stopped()` | ||
| - **`is_killed()`**: Synchronous method equivalent to the Rust `is_killed()` | ||
| - **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method | ||
| - **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`. | ||
|
|
||
| ### Context Usage in Python | ||
|
|
||
| The context is available optionally in both incoming and outgoing request scenarios: | ||
|
|
||
| #### Incoming Requests | ||
| For incoming requests, the generate method may optionally accept a `context` argument after the `request` argument. If the `context` parameter is specified in the method signature, it will receive the context object of the incoming request. Request handlers can: | ||
|
|
||
| - Check for cancellation synchronously using `context.is_stopped()` before beginning expensive operations | ||
| - Listen for cancellation asynchronously using `await context.async_killed_or_stopped()` | ||
|
|
||
| Example: | ||
| ```python | ||
| async def generate(self, request, context): | ||
| for i in range(1000): | ||
| # Check for cancellation before expensive work | ||
| if context.is_stopped(): | ||
| raise asyncio.CancelledError | ||
|
|
||
| # Perform work... | ||
| await expensive_computation() | ||
| yield result | ||
| ``` | ||
|
|
||
| #### Outgoing Requests | ||
| For outgoing requests, Python scripts may optionally provide a context object to outgoing runtime endpoint client router operations (such as `generate`, `round_robin`, `random`, `direct` methods) as a keyword argument. The script can cancel the outgoing request via the provided context object. | ||
|
|
||
| This is especially useful when child outgoing requests need to be cancelled when the parent incoming request is cancelled. In such cases, the script can simply pass the incoming context object to the outgoing request, automatically linking the cancellation behavior. | ||
|
|
||
| Example: | ||
| ```python | ||
| async def generate(self, request, context): | ||
| # Forward the incoming context to outgoing request | ||
| # If the incoming request is cancelled, the outgoing request will be too | ||
| stream = await self.client.generate(request, context=context) | ||
| async for response in stream: | ||
| yield response | ||
| ``` | ||
|
|
||
| This design enables seamless cancellation propagation through multi-tier request chains, ensuring that when a client cancels a request, all associated sub-requests are automatically cancelled, saving computational resources across the entire request pipeline. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.