-
Notifications
You must be signed in to change notification settings - Fork 738
add corresponding qwen3-32b-fp8 aic based disagg performance tuning md guide #4655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughA new markdown document for the qwen3-32b-fp8 recipe detailing disaggregated serving performance tuning. Contents include QPS matching methodology via worker parallelism and batch sizing, AIC-driven automation deployment workflow, manual fine-tuning guidance, and deployment case studies demonstrating performance gains. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md (1)
25-25: Minor wording suggestion: simplify "in view of".Line 25 uses "in view of" which is somewhat wordy. Consider a shorter alternative like "based on" or simply restructuring the sentence for greater clarity.
Example:
-### __Match__ the N prefill worker candidates with M decode worker candidates in view of __sequence throughput seq/s__ +### __Match__ the N prefill worker candidates with M decode worker candidates by __sequence throughput seq/s__
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
recipes/qwen3-32b-fp8/images/agg_allignment.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/challenges_in_disagg.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/disagg_aic_allignment.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/disagg_allignment.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/find_worker_SLA.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/local_deploy_k8s.pngis excluded by!**/*.pngrecipes/qwen3-32b-fp8/images/qps_match.pngis excluded by!**/*.png
📒 Files selected for processing (1)
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4655/merge) by davilu-nvidia.
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md
[error] 1-1: Trailing whitespace found. pre-commit hook trailing-whitespace failed and modified the file in place. Fix the trailing spaces in this file.
🪛 LanguageTool
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md
[grammar] ~16-~16: Ensure spelling is correct
Context: ...n: 0 auto;"> ## Disagg pd QPS Matching Methology ### We can firstly __find a worker that meet...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~17-~17: Consider using “who” when you are referring to a person instead of an object.
Context: ...ogy ### We can firstly find a worker that meets SLA and under constraints - En...
(THAT_WHO)
[style] ~25-~25: ‘in view of’ might be wordy. Consider a shorter alternative.
Context: ...didates with M decode worker candidates in view of sequence throughput seq/s - Seq/s ...
(EN_WORDINESS_PREMIUM_IN_VIEW_OF)
[grammar] ~43-~43: Use a hyphen to join words.
Context: ... >= 60 - disable prefix caching ### AIC based full automation deployment [AIC a...
(QB_NEW_EN_HYPHEN)
[grammar] ~51-~51: Ensure spelling is correct
Context: ...- AIC projection and ai-perf actual run allignment
What's the problem here: this tps/user -...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~56-~56: Use a hyphen to join words.
Context: ...the most important SLAs ### Manual fine tuning based on AIC suggestions __agg a...
(QB_NEW_EN_HYPHEN)
[grammar] ~72-~72: Ensure spelling is correct
Context: ...t (60), which means we need more decode wokers and to tune decode max_batch_size (us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~74-~74: Use a hyphen to join words.
Context: ...o enhance decoding capability. We fine tuned with more decode GPUs (2 x tp2 a...
(QB_NEW_EN_HYPHEN)
[grammar] ~74-~74: Ensure spelling is correct
Context: ...ker max_batch_size and prefill worker parallism setting, finally we found best disagg ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~80-~80: Use a hyphen to join words.
Context: ...Based on AIC run and minimum manual fine tuning process - Under TTFT constraint...
(QB_NEW_EN_HYPHEN)
[grammar] ~86-~86: Ensure spelling is correct
Context: ...e been working on fine-grained AIC perf allignment, advanced feature such as prefix cachin...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md
88-88: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
90-90: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
| # Advanced Disagg Perf Tuning | ||
| ## Challenges in Disaggregated Serving Deployment | ||
| __Challenge1__ – Is disaggregated serving always better than aggregated serving? How much perf gain is reasonable? | ||
|
|
||
| A: For example, considering __ISL:OSL=1:4000__, do we have perf gain by using disaggregated serving? – __NO__ | ||
|
|
||
| __Challenge2__ – How to configure disaggregated serving to solve the problem __throughput @ latency__ | ||
|
|
||
| - Parallelism of the worker | ||
| - How many p and d | ||
| - Depend on ISL, OSL, TTFT, TPOT | ||
| - The tuning efforts are tremendous | ||
|
|
||
| <img src="images/challenges_in_disagg.png" width="700" alt="challenges_in_disagg" style="display: block; margin: 0 auto;"> | ||
|
|
||
| ## Disagg pd QPS Matching Methology | ||
| ### We can firstly __find a worker that meets SLA and under constraints__ | ||
|
|
||
| - Enumerate parallelism combination of a worker, tp x pp x attn dp x moe tp x moe ep | ||
| - Find max batch size of the worker which meets TTFT and TPOT respectively (Disagg is awesome! We can achieve this separately) | ||
| - Ensure there's no OOM | ||
|
|
||
| <img src="images/find_worker_SLA.png" width="600" alt="find_worker_SLA" style="display: block; margin: 0 auto;"> | ||
|
|
||
| ### __Match__ the N prefill worker candidates with M decode worker candidates in view of __sequence throughput seq/s__ | ||
|
|
||
| - Seq/s of prefill = how many sequences I can process and finish context phase per second => __producer__ | ||
| - Seq/s of decode = how many sequences I can process and finish the whole generation phase per second => __consumer__ | ||
| - The throughput should __match__ between xP and yD | ||
| - Finally, sweep X and Y for a given (prefill, decode) worker combination, find the best seq/s/gpu, thus the best tokens/s/gpu | ||
|
|
||
| <img src="images/qps_match.png" width="500" alt="qps_match" style="display: block; margin: 0 auto;"> | ||
|
|
||
| # agg/disagg best perf tuning based on AIC | ||
| ## Case Study | ||
| ### Settings | ||
| - model: qwen3-32b-fp8-per-block | ||
| - ISL:OSL = 4000/500 | ||
| - TTFT SLA = 600/1200 ms | ||
| - TPS/user SLA >= 60 | ||
| - disable prefix caching | ||
|
|
||
| ### AIC based full automation deployment | ||
| [AIC automation deploy guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md) | ||
|
|
||
| AIC is now supporting automate everything in one script, starting from configuring the deployment, generating configs, preparing docker image and container, pulling model checkpoints, deploying service, benchmarking and summarizing. Refer to [Automation](https://github.com/ai-dynamo/aiconfigurator/blob/main/tools/automation/README.md) for more details | ||
|
|
||
| ### local deployment vs. k8s deployment | ||
| <img src="images/local_deploy_k8s.png" width="900" alt="local_deploy_k8s" style="display: block; margin: 0 auto;"> | ||
|
|
||
| ### disagg - AIC projection and ai-perf actual run allignment | ||
| <img src="images/disagg_aic_allignment.png" width="800" alt="disagg_aic_allignment" style="display: block; margin: 0 auto;"> | ||
|
|
||
| What's the problem here: this tps/user - tps/gpu pareto plot __does not have TTFT info at all__ while TTFT is one of the most important SLAs | ||
|
|
||
| ### Manual fine tuning based on AIC suggestions | ||
| __agg allignment__ | ||
|
|
||
| <img src="images/agg_allignment.png" width="500" alt="agg_allignment" style="display: block; margin: 0 auto;"> | ||
|
|
||
| TTFT estimation is complicated. | ||
| Currently AIC can handle TTFT from engine execution, but not other online serving overheads, say request queuing | ||
|
|
||
| actual TTFT is higher than expected, so we're supposed to reduce `max_batch_size` with `TP2` to meet TTFT SLA | ||
|
|
||
| We did look around with other combinations of `TP_size` and `max_batch_size` and AIC was right, `TP2` is the best choice | ||
|
|
||
| __disagg allignment__ | ||
|
|
||
| <img src="images/disagg_allignment.png" width="500" alt="disagg_allignment" style="display: block; margin: 0 auto;"> | ||
|
|
||
| Actual run based on AIC's suggestion is noted in yellow, as we can observe, tps/user is a little bit less than SLA requirement (60), which means we need more decode wokers and to tune decode `max_batch_size` (usually equals to `request concurrency`) to enhance decoding capability. | ||
|
|
||
| We fine tuned with more decode GPUs (`2 x tp2` and `1 x tp4`) with corresponding decode worker `max_batch_size` and prefill worker `parallism setting`, finally we found best disagg config __within minimum search area__ | ||
|
|
||
| Regarding prefill, considering that `tp1` has less communication consumption, we tried `4 x tp1` and it shows better performance than that of `2 x tp2`. | ||
|
|
||
| __Conclusion__ | ||
|
|
||
| Based on AIC run and minimum manual fine tuning process | ||
|
|
||
| - Under TTFT constraint of 600 ms, disagg delivers a __148%__ tps/gpu perf gain over agg | ||
|
|
||
| - Under TTFT constraint of 1200 ms, disagg delivers a __102%__ tps/gpu perf gain over agg | ||
|
|
||
| We've been working on fine-grained AIC perf allignment, advanced feature such as prefix caching modeling, vllm/sglang (including wide ep) backend supports etc. Stay tuned! | ||
|
|
||
| __Corresponding recipe__ | ||
|
|
||
| https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8 | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove trailing whitespace.
The pre-commit hook detected trailing whitespace in the file. Ensure all lines, including blank lines at the end of the file, have no trailing spaces.
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4655/merge) by davilu-nvidia.
[error] 1-1: Trailing whitespace found. pre-commit hook trailing-whitespace failed and modified the file in place. Fix the trailing spaces in this file.
🪛 LanguageTool
[grammar] ~16-~16: Ensure spelling is correct
Context: ...n: 0 auto;"> ## Disagg pd QPS Matching Methology ### We can firstly __find a worker that meet...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[style] ~17-~17: Consider using “who” when you are referring to a person instead of an object.
Context: ...ogy ### We can firstly find a worker that meets SLA and under constraints - En...
(THAT_WHO)
[style] ~25-~25: ‘in view of’ might be wordy. Consider a shorter alternative.
Context: ...didates with M decode worker candidates in view of sequence throughput seq/s - Seq/s ...
(EN_WORDINESS_PREMIUM_IN_VIEW_OF)
[grammar] ~43-~43: Use a hyphen to join words.
Context: ... >= 60 - disable prefix caching ### AIC based full automation deployment [AIC a...
(QB_NEW_EN_HYPHEN)
[grammar] ~51-~51: Ensure spelling is correct
Context: ...- AIC projection and ai-perf actual run allignment
What's the problem here: this tps/user -...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~56-~56: Use a hyphen to join words.
Context: ...the most important SLAs ### Manual fine tuning based on AIC suggestions __agg a...
(QB_NEW_EN_HYPHEN)
[grammar] ~72-~72: Ensure spelling is correct
Context: ...t (60), which means we need more decode wokers and to tune decode max_batch_size (us...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~74-~74: Use a hyphen to join words.
Context: ...o enhance decoding capability. We fine tuned with more decode GPUs (2 x tp2 a...
(QB_NEW_EN_HYPHEN)
[grammar] ~74-~74: Ensure spelling is correct
Context: ...ker max_batch_size and prefill worker parallism setting, finally we found best disagg ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~80-~80: Use a hyphen to join words.
Context: ...Based on AIC run and minimum manual fine tuning process - Under TTFT constraint...
(QB_NEW_EN_HYPHEN)
[grammar] ~86-~86: Ensure spelling is correct
Context: ...e been working on fine-grained AIC perf allignment, advanced feature such as prefix cachin...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
88-88: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
90-90: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md lines 1-97: the file
contains trailing whitespace (including possible blank lines at EOF) which trips
the pre-commit hook; remove all trailing spaces from every line and ensure the
file ends with a single newline and no trailing spaces—use your editor's "trim
trailing whitespace" feature or run a command like `sed -i 's/[ \t]*$//' <file>`
(or enable trim on save) and re-save, then re-run the pre-commit checks.
|
|
||
| <img src="images/challenges_in_disagg.png" width="700" alt="challenges_in_disagg" style="display: block; margin: 0 auto;"> | ||
|
|
||
| ## Disagg pd QPS Matching Methology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix hyphenation and spelling errors in headings and text.
Several spelling and grammar errors need correction throughout the document:
- Line 16: "Methology" → "Methodology"
- Line 43: "AIC based" → "AIC-based"
- Line 56: "fine tuning" → "fine-tuning"
- Line 80: "fine tuning" → "fine-tuning"
Apply this diff:
-## Disagg pd QPS Matching Methology
+## Disagg pd QPS Matching Methodology -### AIC based full automation deployment
+### AIC-based full automation deployment -### Manual fine tuning based on AIC suggestions
+### Manual fine-tuning based on AIC suggestions -Based on AIC run and minimum manual fine tuning process
+Based on AIC run and minimum manual fine-tuning processAlso applies to: 43-43, 56-56, 80-80
🧰 Tools
🪛 LanguageTool
[grammar] ~16-~16: Ensure spelling is correct
Context: ...n: 0 auto;"> ## Disagg pd QPS Matching Methology ### We can firstly __find a worker that meet...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md around lines 16, 43, 56
and 80, fix heading and text hyphenation/spelling: change "Methology" to
"Methodology" on line 16, "AIC based" to "AIC-based" on line 43, and both
occurrences of "fine tuning" on lines 56 and 80 to "fine-tuning"; update the
content in-place ensuring punctuation and capitalization remain consistent.
| ### local deployment vs. k8s deployment | ||
| <img src="images/local_deploy_k8s.png" width="900" alt="local_deploy_k8s" style="display: block; margin: 0 auto;"> | ||
|
|
||
| ### disagg - AIC projection and ai-perf actual run allignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix spelling errors and hyphenation in case study and conclusion.
Multiple spelling errors need correction:
- Line 51: "allignment" → "alignment"
- Line 72: "wokers" → "workers"
- Line 74: "parallism" → "parallelism", "fine tuned" → "fine-tuned"
- Line 86: "allignment" → "alignment"
Apply this diff:
-### disagg - AIC projection and ai-perf actual run allignment
+### disagg - AIC projection and ai-perf actual run alignment -Actual run based on AIC's suggestion is noted in yellow, as we can observe, tps/user is a little bit less than SLA requirement (60), which means we need more decode wokers and to tune decode `max_batch_size` (usually equals to `request concurrency`) to enhance decoding capability.
+Actual run based on AIC's suggestion is noted in yellow, as we can observe, tps/user is a little bit less than SLA requirement (60), which means we need more decode workers and to tune decode `max_batch_size` (usually equals to `request concurrency`) to enhance decoding capability. -We fine tuned with more decode GPUs (`2 x tp2` and `1 x tp4`) with corresponding decode worker `max_batch_size` and prefill worker `parallism setting`, finally we found best disagg config __within minimum search area__
+We fine-tuned with more decode GPUs (`2 x tp2` and `1 x tp4`) with corresponding decode worker `max_batch_size` and prefill worker `parallelism setting`, finally we found best disagg config __within minimum search area__ -We've been working on fine-grained AIC perf allignment, advanced feature such as prefix caching modeling, vllm/sglang (including wide ep) backend supports etc. Stay tuned!
+We've been working on fine-grained AIC perf alignment, advanced feature such as prefix caching modeling, vllm/sglang (including wide ep) backend supports etc. Stay tuned!Also applies to: 72-72, 74-74, 86-86
🧰 Tools
🤖 Prompt for AI Agents
In recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md around lines 51, 72,
74, and 86, there are spelling and hyphenation issues: change "allignment" to
"alignment" on lines 51 and 86, change "wokers" to "workers" on line 72, and on
line 74 change "parallism" to "parallelism" and "fine tuned" to the hyphenated
"fine-tuned"; update each instance accordingly to correct spelling and apply
hyphenation.
| __Corresponding recipe__ | ||
|
|
||
| https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert emphasis to proper markdown heading and format bare URL.
Line 88 uses emphasis as a heading (violates MD036), and line 90 contains a bare URL without markdown link formatting (violates MD034).
Apply this diff:
-__Corresponding recipe__
+## Corresponding Recipe
-https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8
+[qwen3-32b-fp8 Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __Corresponding recipe__ | |
| https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8 | |
| ## Corresponding Recipe | |
| [qwen3-32b-fp8 Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
88-88: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
90-90: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In recipes/qwen3-32b-fp8/aic_based_disagg_perf_tuning.md around lines 88 to 90,
the text uses emphasis as a heading (MD036) and leaves a bare URL (MD034);
change the emphasized line to a proper Markdown heading (prepend one or more #
as appropriate) and replace the bare URL with a Markdown link (e.g.
[Corresponding
recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b-fp8)) so
the heading and link follow MD style rules.
|
Seeing mismatch between this guide and AIC 0.4.0 results: https://gist.github.com/BenHamm/3ec1e1e92312302e966ee75606fe1931 |
Overview:
To add corresponding qwen3-32b-fp8 aic based disagg performance tuning guide
Details:
Add related md which works with qwen3-32b-fp8-recipe
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.