Skip to content

Conversation

@Wenhan-Tan
Copy link

@Wenhan-Tan Wenhan-Tan commented Nov 18, 2025

Overview:

Update EKS guide for Dynamo 0.6.0

Summary by CodeRabbit

  • Documentation
    • Added comprehensive deployment guide for EKS environments, covering cluster creation, storage configuration, platform installation, and inference graph deployment with testing procedures.

@Wenhan-Tan Wenhan-Tan requested review from a team as code owners November 18, 2025 22:48
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi Wenhan-Tan! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Nov 18, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

Walkthrough

Adds comprehensive EKS deployment documentation for Dynamo, covering cluster creation, EFS configuration, platform installation, inference graph deployment, and endpoint testing procedures.

Changes

Cohort / File(s) Summary
EKS Deployment Documentation
examples/deployments/EKS/README.md
New guide detailing deployment of Dynamo on EKS clusters, including prerequisites, cluster creation, EFS storage setup, Kubernetes platform installation, inference graph deployment, and endpoint verification

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

  • Documentation-only addition with no code or API changes
  • Single file addition requiring straightforward content verification

Poem

🐰 A guide through the clouds, so clear and bright,
EKS paths unfold, with storage right,
Dynamo's deployed with every step,
From cluster to inference, what perfect prep!
Documentation hops, making deployments light! 🚀

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is largely incomplete, providing only an Overview section with minimal detail. Required sections like Details, Where should the reviewer start, and Related Issues are missing. Complete the pull request description by adding Details explaining specific changes made, identifying key files for review, and including any related GitHub issue references using action keywords.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'docs: Update EKS guide for Dynamo v0.6.0' directly and accurately summarizes the main change: updating documentation for the EKS deployment guide to reflect version 0.6.0.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
examples/deployments/EKS/README.md (3)

1-11: Fix heading hierarchy to follow Markdown standards.

The document starts with # Dynamo Deployment on EKS (h1) but jumps directly to ### Step 1. Create EKS cluster (h3), skipping the h2 level. Heading levels should increment by only one level at a time.

Apply this diff to fix the heading hierarchy:

-# Dynamo Deployment on EKS
+# Dynamo Deployment on EKS
 
 This guide covers steps of creating an Amazon EKS cluster...
 
-[Step 1. Create EKS cluster](#step-1-create-eks-cluster)
+## Steps
 
-[Step 2. Install Dynamo Kubernetes Platform](#step-2-install-dynamo-kubernetes-platform)
+[Step 1. Create EKS cluster](#step-1-create-eks-cluster)
 
-[Step 3. Deploy Dynamo Inference Graph](#step-3-deploy-dynamo-inference-graph)
+[Step 2. Install Dynamo Kubernetes Platform](#step-2-install-dynamo-kubernetes-platform)
 
-### Step 1. Create EKS cluster
+[Step 3. Deploy Dynamo Inference Graph](#step-3-deploy-dynamo-inference-graph)
+
+## Step 1. Create EKS cluster

94-94: Fix bare URL and improve phrasing.

Line 94 contains a bare URL that should be wrapped in Markdown link syntax. Additionally, the phrasing "able to access" is more clearly expressed as "can access".

Apply this diff:

-Follow the steps to create an EFS file system: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md. Make sure you mount subnets in the last step correctly. This will affect whether your nodes are able to access the created EFS file system.
+Follow the [steps to create an EFS file system](https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md). Make sure you mount subnets correctly in the final step. This will affect whether your nodes can access the created EFS file system.

Also applies to: 94-94


17-84: Add language identifiers to all fenced code blocks.

All code blocks lack language specifications, which impacts syntax highlighting and readability in documentation viewers. Based on context, add appropriate language identifiers: yaml for Kubernetes manifests and bash for shell commands.

Examples of the fixes needed:

-```
+```yaml
 apiVersion: eksctl.io/v1alpha5
 kind: ClusterConfig
-```
+```bash
 eksctl create cluster -f <FILENAME>.yaml
-```
+```bash
 # Create Image Pull Secret
 export DOCKER_SERVER=<ECR_REGISTRY>

Apply similar fixes to all remaining code blocks. Use yaml for Kubernetes manifests and YAML configs, and bash for shell commands and command examples.

Also applies to: 88-90, 100-114, 118-120, 126-142, 145-153, 157-159, 163-169, 177-187, 191-199, 205-396, 400-402, 406-408, 412-433, 437-455, 459-461

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c65f1ac and a649b96.

📒 Files selected for processing (1)
  • examples/deployments/EKS/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/deployments/EKS/README.md

[style] ~94-~94: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...ly. This will affect whether your nodes are able to access the created EFS file system. ##...

(BE_ABLE_TO)

🪛 markdownlint-cli2 (0.18.1)
examples/deployments/EKS/README.md

11-11: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)


17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


88-88: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


94-94: Bare URL used

(MD034, no-bare-urls)


100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


118-118: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


126-126: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


145-145: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


157-157: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


163-163: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


191-191: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


205-205: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


400-400: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


406-406: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


412-412: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


437-437: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


459-459: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo


#### c) Create Dynamo Inference Graph

For this example, we'll deploy `Qwen/Qwen3-32B` in disaggregated mode. We'll create 12 prefill workers with TP1 and 1 decode worker with TP4 for a total of 16 x H200 GPUs (p5en.48xlarge). Please change `<DYNAMO_TRTLLM_IMAGE>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us note, this is disagg with kv router

fsx: true
- name: p5en-ng
instanceType: p5en.48xlarge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add note saying that you can switch instance type to g6e or any other instance if p5en is not available


### Step 3. Deploy Dynamo Inference Graph

#### a) Build Dynamo TRTLLM runtime image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also show prebuilt NGC container way, show both ways - NGC container and build image

@rmccorm4 rmccorm4 changed the title Wenhant update EKS guide docs: Update EKS guide for Dynamo v0.6.0 Nov 19, 2025
@github-actions github-actions bot added the docs label Nov 19, 2025
#### b) Run below to install Dynamo Kubernetes Platform
```
# Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-0.6.0.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - v0.6.1 is already out, with v0.7.0 coming soon, should this just be updated in preparation for v0.7.0 instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can wait for 0.7.0 since it'll be out this week

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wenhan, Kshitiz - v0.7.0 is out, so this can be updated: #4448 (comment)

Can we close out this PR if updates are still needed? It'll get progressively harder to merge the longer it stays open.

sharedMemory:
size: 64Gi
envs:
- name: DYN_LOG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these debug log levels intentional or we'd comment it out and leave a note to enable for diagnosis?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already left a note for all these env vars. I can definitely comment them out

-n dynamo-system
```

#### b) Run below to install Dynamo Kubernetes Platform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we shouldn't just point to the production install doc here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works too. Probably even better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs external-contribution Pull request is from an external contributor size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants