Skip to content

[Roadmap]: H1 '26 timeline and roadmapΒ #5506

@harryskim

Description

@harryskim

Hi Dynamo developers!

We wanted to present the following key timelines and focus areas for Dynamo in H1 2026.

πŸ”¨ Update to Minor Release Process

We plan on having 3 major releases (0.8 - 1.0) until we reach 1.0 by GTC ’26. Dynamo will continue to be released on a biweekly cadence as before, but we are changing our approach to minor releases.

Previously, minor releases (i.e., 0.x.1) for Dynamo were cut from main just like our major releases (i.e., 0.x.0), which included all changes made to main between code freeze of the two releases.

Going forward, minor releases (e.g., 0.8.1) will be based on the previous major release (e.g., 0.8.0) rather than main, enabling us to focus on critical bug fixes and important feature updates.

πŸ“… Timeline

Planned dates for future releases are shown below.

v0.8 v0.8.1 v0.9.0 v0.9.1 v1.0 v1.0 +
1/14 1/28 2/11 2/25 3/11 Dates to be shared after GTC

We will be sharing more details about each major release as GitHub issues pinned next to this overall H1'26 roadmap.

🎯 H1’26 Focus Areas

Our goal is for Dynamo to provide a seamless configuration and deployment experience. To achieve this, we are focused on five key areas:

  1. Performance
  2. Production Grade Serving & Scaling
  3. Core (including Router and KV Caching)
  4. Agents
  5. Multimodality and Diffusion (Omni)
  • Performance

    • AIConfigurator
      • Improve prediction accuracy for all LLM inference engines (SGLang, TRT-LLM, vLLM)
        • Special thanks to the Mooncake team for contribution adding SGLang support to AIC πŸ™.
      • Support for popular models such as upcoming DeepSeek models.
      • Support for Blackwell GPU
    • Multi-Feature Recipes
      • Add more recipes that combine KV-aware routing, disaggregated serving and KV cache offloading to maximize performance for the following use cases:
        • Agents (Qwen3 32B)
        • Coding (Qwen3 235B or DSV3)
        • Multimodality (Qwen3-VL 30B)
  • Production Grade Serving & Scaling

    • Planner

      • Hierarchical Planner for optimizing heterogeneous worker pool, for example:
        • Requests have identical TTFT SLAs but different ISLs, requiring distinct prefill parallelization
        • Requests have identical ISL/OSL but different ITL SLAs, requiring distinct decode parallelization
      • Dynamic scaling interval between autoscaling decisions
      • Support for aggregated serving workloads
    • Fault Tolerance

      • Request rejection to prevent system overload
      • Proactive Health Monitoring via NVSentinel for parallel detection of critical XID errors (e.g., Xid 45) and automated node cordoning
      • Fast recovery for full restart
        • Continuous Availability: User-facing HTTP service stays online, and queue holds incoming requests during restarts.
        • Persistent Networking: ZMQ and load balancer connections remain stable, allowing backend engines to reconnect instantly.
        • Zero-Copy Recovery (via Model Express): Decouples worker processes from model weights using a GPU Memory Service, skipping disk I/O entirely to reduce restart latency from minutes to milliseconds.
        • KVBM Resilience: Automatically reclaims memory and supports multi-node KV cache restoring via RDMA to prevent lost context.
      • HW fault injection framework to test fault tolerance of Dynamo deployments without breaking real cluster setups
      • Elastic EP (aka WideEP) fault tolerance for vLLM + SGLang:
        • In-Place Reconfiguration: Preserves healthy ranks and rebuilds communication groups (NCCL/Gloo) without terminating the entire instance.
        • NIXL EP Kernels: Integration of non-blocking kernels to prevent driver deadlocks and enable safe scale-down/up.
    • Grove - Kubernetes-native AI inference orchestration

      • Full topology-aware serving support
      • GB200 setup automation (e.g., DRA, MNNVL)
      • Training support collaborating with ByteDance Seed
      • Support for rolling upgrade strategies
      • Compatibility with default kube-scheduler + additional schedulers via plugins
    • ModelExpress - Reduces latency of artifact downloads and writes

      • Performance optimization for model weight loading via NIXL integration
      • Integration with GPU memory service to support fault tolerance and autoscaling use cases
  • Core (including Router and KV Caching)

    • Removing NATS and etcd Dependencies from Dynamo
      • As of 0.8.0, NATS and etcd are optional for the requests and discovery planes - replaced with transport-agnostic requests via TCP and Kubernetes-native service discovery via EndpointSlices. Removal of the NATS requirement for the KV events plane, used for KV-aware routing, is in progress.
    • Router
      • Hierarchical routing to enable a high-performance downstream module that integrates with upstream schedulers by exchanging real-time metadata and granular feedback metrics
    • KV Caching
      • Performant KV offloading from HBM to host memory and SSD; performance optimization for remote storage in progress.
      • Distributed KV cache management across multiple nodes via P2P mesh or global object and file storage
      • Laying groundwork for CUDA Memory Extension (CME) support to enable future hardware to efficiently share KV cache across nodes via unified memory access over NVLink fabric
      • Support for SGLang
    • Multi-LoRA Support
      • Initial implementation is available, and we will finish our design implementation outlined here.
  • Agents

    • Predictive Routing
      • Proactive Load Balancing: Decisions are informed by expected future load rather than just current system saturation.
      • Intelligent Cache Retention: Router prioritizes retaining KV cache blocks that Nemo Agentic Toolkit predicts will have high reuse, rather than using standard eviction policies.
      • Nuanced Session Affinity: Instead of binary "stickiness," Router can maintain affinity for sessions with high predicted reuse or allow migration for sessions nearing completion.
    • KV cache offloading and prefetching for tool calls
  • Multimodality and Diffusion

    • Multimodality
      • Multimodal hash router support for vLLM and SGLang (already enabled for TRT-LLM)
      • E/P/D disaggregation performance optimization
    • Diffusion
      • Support for SGLang Diffusion/Omni and vLLM Omni
      • Extend Planner to support autoscaling for Omni model (e.g. UniVideo)

Please let us know in the comments if there are additional features that the Dynamo team should prioritize. Thank you so much for your ongoing feedback, and we will do our best to provide the best possible Dynamo for the community. πŸ™

Metadata

Metadata

Assignees

No one assigned

    Labels

    roadmapTracks features, enhancements, or milestones planned as part of the project roadmap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions