Skip to content

Conversation

@AsadShahid04
Copy link

@AsadShahid04 AsadShahid04 commented Nov 11, 2025

Summary

This PR restores the lost benchmarking guide for benchmarks/llm scripts, addressing issue #2031. The guide was accidentally removed when the examples/llm directory was deleted in PR #1899.

Changes

  • Restored comprehensive benchmarking guide at benchmarks/llm/README.md

    • Detailed instructions for using perf.sh and plot_pareto.py scripts
    • Updated deployment methods (replaced outdated dynamo serve with current Kubernetes and local deployment approaches)
    • Added prerequisites, hardware configuration notes, and troubleshooting sections
    • Included examples for both aggregated and disaggregated serving modes
    • Added instructions for single-node and multi-node deployments
  • Updated main benchmarks README at benchmarks/README.md

    • Added reference to the new LLM benchmarking guide
  • Fixed bug in perf.sh (bonus fix)

    • Modified script to create per-concurrency subdirectories (-concurrency1/, -concurrency2/, etc.) as expected by plot_pareto.py
    • This ensures the documented workflow works end-to-end

Testing

  • ✅ Tested locally on macOS (Docker setup)
  • ✅ Tested on brev.dev cloud workspace (Ubuntu 22.04, NVIDIA L40S GPU)
  • ✅ Verified perf.sh creates correct directory structure
  • ✅ Verified plot_pareto.py can parse and generate plots from results
  • ✅ Tested with Qwen/Qwen3-0.6B model

Reference

The original guide content was retrieved from commit 35c56065bb490e12bba84a6abf8107dc1f2c7529 and updated with current deployment methods.

Fixes #2031

@hhzhang16 @athreesh

Summary by CodeRabbit

  • Documentation
    • Enhanced benchmarking documentation with detailed tools and framework information.
    • Added comprehensive LLM benchmarking guide covering deployment options (Kubernetes and local), setup prerequisites, hardware recommendations, and multi-tool workflows.
    • Included troubleshooting, monitoring guidance, and Pareto frontier plot interpretation for performance analysis.

- Restore benchmarking guide for perf.sh and plot_pareto.py scripts
- Replace outdated dynamo serve references with current deployment methods
  - Add Kubernetes deployment examples using DynamoGraphDeployment
  - Add local deployment examples using python -m dynamo.frontend + workers
- Document script usage, command-line options, and result interpretation
- Add comprehensive examples for single-node and multi-node benchmarking
- Update benchmarks/README.md to reference the new LLM benchmarking guide
- Include troubleshooting section and additional resources

TODO - Still needs to be done:
- Test all commands locally to verify they work as documented
- Test deployment examples on brev.dev to ensure cloud compatibility
- Verify hardware configuration section is still accurate
- Test perf.sh and plot_pareto.py scripts with actual deployments
- Validate all links and references are correct

Fixes ai-dynamo#2031
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi AsadShahid04! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Nov 11, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

This PR restores and expands the LLM benchmarking documentation previously lost when the examples/llm directory was deleted. It adds a "Benchmarking Tools" subsection to benchmarks/README.md and replaces a placeholder in benchmarks/llm/README.md with comprehensive documentation covering deployment options, benchmarking workflows, and troubleshooting guidance.

Changes

Cohort / File(s) Summary
Documentation Restoration and Enhancement
benchmarks/README.md, benchmarks/llm/README.md
Adds "Benchmarking Tools" subsection with framework and script details to README.md; replaces "Coming soon." placeholder with comprehensive LLM benchmarking guide including prerequisites, Kubernetes/local deployment steps, disaggregated/aggregated configurations, perf.sh and plot_pareto.py usage documentation, and troubleshooting guidance.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Documentation-only changes with no code logic or interdependencies to verify
  • Focus review on accuracy of deployment instructions (Kubernetes and local configurations)
  • Verify completeness and clarity of perf.sh and plot_pareto.py command-line examples
  • Confirm hardware recommendations and prerequisites are current and accurate
  • Check consistency of cross-references between the two README files

Poem

🐰 A guide once lost, now hops back to light,
Benchmarks and baselines, restored just right!
From Pareto plots to deployment's dance,
The tools are documented—give benchmarking a chance!

Pre-merge checks

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: restoring the LLM benchmarking guide and fixing issue #2031.
Linked Issues check ✅ Passed The PR successfully meets all objectives from issue #2031: restores the benchmarking guide, updates deprecated deployment methods, documents perf.sh and plot_pareto.py usage, and fixes the bonus perf.sh bug.
Out of Scope Changes check ✅ Passed All changes are directly related to restoring and improving the benchmarking guide and fixing perf.sh as outlined in issue #2031; no out-of-scope modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request provides a comprehensive description that covers all template sections: Overview (summary of changes), Details (specific file changes and improvements), Where to start (specific files mentioned: benchmarks/llm/README.md, benchmarks/README.md, perf.sh), Related Issues (uses 'Fixes #2031' action keyword).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
benchmarks/llm/README.md (3)

213-213: Convert emphasized section headers to Markdown headings.

Lines 213 and 217 use bold emphasis (**text**) instead of proper Markdown heading syntax, which violates MD036 style guidelines. Convert these to proper headings:

- **Option 1: Kubernetes (Recommended)**
+ ### Option 1: Kubernetes (Recommended)

- **Option 2: Local**
+ ### Option 2: Local

This improves document structure and consistency with the rest of the guide.

Also applies to: 217-217


418-418: Hyphenate compound modifier "two-node".

Line 418 should use "two-node" as a hyphenated compound modifier before the noun:

- **Two node comparison**:
+ **Two-node comparison**:

374-374: Specify language identifier for code fence.

Line 374 defines a code block without a language identifier. Add bash or text to improve syntax highlighting and readability:

- ```
+ ```bash
 artifacts_root/
 ├── artifacts_0/
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d9b674b and 8738150.

📒 Files selected for processing (2)
  • benchmarks/README.md (1 hunks)
  • benchmarks/llm/README.md (1 hunks)
🧰 Additional context used
🪛 GitHub Check: Check for broken markdown links
benchmarks/llm/README.md

[failure] 563-563:
Broken link: Metrics and Visualization - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/benchmarks/llm/README.md?plain=1#L563


[failure] 562-562:
Broken link: Performance Tuning Guide - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/benchmarks/llm/README.md?plain=1#L562


[failure] 548-548:
Broken link: Performance Tuning Guide - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/benchmarks/llm/README.md?plain=1#L548

🪛 LanguageTool
benchmarks/llm/README.md

[grammar] ~418-~418: Use a hyphen to join words.
Context: ...facts-root-dir artifacts_root **Two node comparison**: bash python3 benc...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
benchmarks/llm/README.md

213-213: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


217-217: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


374-374: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
benchmarks/llm/README.md (1)

1-563: Comprehensive LLM benchmarking guide restoration is well-structured and thorough.

This guide successfully restores and significantly expands the lost benchmarking documentation. The content is well-organized with clear sections covering:

  • Prerequisites with hardware specifications
  • Multiple deployment options (Kubernetes and local)
  • Single-node and multi-node disaggregated deployments
  • vLLM aggregated baselines
  • Detailed perf.sh and plot_pareto.py usage instructions
  • Result interpretation and metrics explanations
  • Troubleshooting guidance

The documentation structure flows well and provides actionable examples. The cross-reference from benchmarks/README.md correctly points to this comprehensive guide.

Verify that the three broken links (lines 548, 562, 563) are corrected or removed before merge, as flagged in the previous review comment.

benchmarks/README.md (1)

72-85: Well-structured addition of benchmarking tools index.

The new "Benchmarking Tools" section provides a clear index to different benchmarking capabilities in the directory:

  • Links to general framework (with reference to complete guide)
  • Links to LLM benchmarking scripts with Pareto plots
  • Links to router and profiler tools

The cross-reference to the LLM benchmarking guide (line 85) correctly directs users to the comprehensive documentation restored in benchmarks/llm/README.md. This improves documentation discoverability and user experience.

- Replace ../../docs/guides/disagg_perf_tuning.md with ../../docs/performance/tuning.md (2 occurrences)
- Replace ../../deploy/metrics/README.md with ../../deploy/metrics/k8s/README.md

Fixes broken links that were pointing to non-existent files.
Copy link
Contributor

@hhzhang16 hhzhang16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow is tightly tailored for vLLM with a specific model and hardware. Have you tested with other models and backends?

@hhzhang16
Copy link
Contributor

There seems to be some good overlap with this guide: https://github.com/AsadShahid04/dynamo/blob/docs/restore-llm-benchmarking-guide/docs/benchmarks/benchmarking.md

Could you look into what it could take to merge the two benchmarking guides and scripts?

…hardware note

- Replace DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with Qwen/Qwen3-0.6B throughout
  (smaller model better for examples and testing)
- Change 'suboptimal results' to 'different results' for less judgmental wording

Addresses review comments from PR ai-dynamo#4234
@AsadShahid04
Copy link
Author

AsadShahid04 commented Nov 13, 2025

There seems to be some good overlap with this guide: https://github.com/AsadShahid04/dynamo/blob/docs/restore-llm-benchmarking-guide/docs/benchmarks/benchmarking.md

Could you look into what it could take to merge the two benchmarking guides and scripts?

Thanks for pointing that out! I've analyzed the overlap between the two guides. Here's what I found:

Analysis

Overlap:

  • Both use AIPerf and have similar prerequisites
  • Both support Kubernetes and local deployments
  • Similar troubleshooting content

Key differences:

The benchmarks/llm/README.md guide (using perf.sh + plot_pareto.py) is focused on LLM benchmarking with:

  • Pareto frontier plots (unique to this tool)
  • Detailed disaggregated/aggregated deployment examples
  • Parallelism parameter tracking (TP, DP, prefill-TP, decode-TP)
  • Bash script simplicity

The docs/benchmarks/benchmarking.md guide (using benchmarks.utils) is more general with:

  • Server-side (in-cluster) benchmarking support
  • Works with any HTTP endpoint, not just LLM
  • Multiple plot types (not just Pareto)
  • More flexible Python API

Options

Option 1: Unified guide with tool selection (my recommendation)

  • Create a single guide that helps users choose the right tool upfront
  • Preserve both tools since they serve different needs
  • Consolidate shared content (prerequisites, troubleshooting)
  • Keep detailed examples in tool-specific sections

This would create a structure like:

  • Overview and tool selection guide
  • Shared prerequisites
  • Section for perf.sh (LLM-focused, Pareto plots)
  • Section for benchmarks.utils (general, server-side, multiple plots)
  • Common topics (result interpretation, troubleshooting)

Option 2: Migrate to single tool

  • Deprecate perf.sh and enhance benchmarks.utils with Pareto plots and parallelism params
  • Pros: Single tool to maintain
  • Cons: Breaking change, significant development effort

Option 3: Keep separate, add cross-references

  • Minimal changes, just add cross-references and a "choosing your tool" section
  • Pros: No breaking changes, minimal work
  • Cons: Still some duplication, potential user confusion

Recommendation

I think Option 1 makes the most sense because both tools are valuable for different use cases. The perf.sh tool is simpler for LLM benchmarking with Pareto analysis, while benchmarks.utils is more flexible for general endpoints and server-side benchmarking.

Next Steps

I can create a unified guide that consolidates the shared content and provides clear guidance on when to use each tool. This would involve:

  • Creating a unified structure at docs/benchmarks/README.md
  • Moving shared content to common sections
  • Keeping benchmarks/llm/README.md as a quick reference for perf.sh users
  • Adding cross-references between the guides

Question: Should we create a separate issue for this merge work and approve this PR in the meantime? The current PR restores the LLM benchmarking guide which was accidentally removed, and the merge work is a separate improvement that can be done afterward. @hhzhang16

@AsadShahid04
Copy link
Author

The flow is tightly tailored for vLLM with a specific model and hardware. Have you tested with other models and backends?

perf.sh only sends HTTP requests to /v1/chat/completions (OpenAI-compatible), so it works with any backend that exposes that API. The examples use vLLM for deployment, but the benchmarking step is the same.

Clarify that perf.sh workflow works with vLLM, SGLang, and TensorRT-LLM
since they all expose the same OpenAI-compatible HTTP API. Examples use
vLLM for clarity, but the same workflow applies to other backends.

Addresses review comment about testing with other models and backends.
@hhzhang16
Copy link
Contributor

I'm okay with merging this first, but I would like to see Option 1 implemented in the medium-long term! Taking another look over the MR now

@AsadShahid04
Copy link
Author

I'm okay with merging this first, but I would like to see Option 1 implemented in the medium-long term! Taking another look over the MR now

Sounds good! Let me know if you want me to make another issue once this MR is closed. Thanks!

Copy link
Contributor

@hhzhang16 hhzhang16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhzhang16
Copy link
Contributor

I'm okay with merging this first, but I would like to see Option 1 implemented in the medium-long term! Taking another look over the MR now

Sounds good! Let me know if you want me to make another issue once this MR is closed. Thanks!

That would be amazing, thanks 🙇

AsadShahid04 and others added 2 commits November 18, 2025 18:43
- Fix broken link from deploy/metrics/k8s/README.md to docs/observability/prometheus-grafana.md
- Addresses review comment from PR ai-dynamo#4234
@AsadShahid04
Copy link
Author

AsadShahid04 commented Nov 19, 2025

Quick note: seeing this, could you double check? Broken link: Metrics and Visualization - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/benchmarks/llm/README.md?plain=1#L566

Just fixed!

Who else needs to review to close this pull request?

@hhzhang16
Copy link
Contributor

My approval is enough, you just need to fix the CI issues!

@dagil-nvidia
Copy link
Contributor

@BenHamm - can you take a look at this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DOCS]: Bring back benchmarking guide

3 participants