Skip to content

Conversation

@tanmayv25
Copy link
Contributor

@tanmayv25 tanmayv25 commented Aug 9, 2025

Overview:

Declare these variables:

SERVED_MODEL_NAME=DeepSeek-R1-FP4
MODEL_PATH=<local path to the DS R1 model weights>
IMAGE=<path to dynamo image with trtllm>
export SLURM_JOB_NAME=<job_name>
export SLURM_ACCOUNT=<account_name>
export SLURM_PARTITION=<partition_name>

Run the disaggregated serving MTP0 sweeps with the following command:

./submit.sh mtp=off all

For aggregated use the following:

./submit_agg.sh

Run post processing on performance numbers:

python3 post_process.py dynamo_disagg-bm-8150-1024 --output-file disagg_results.json

Summary by CodeRabbit

  • New Features

    • End-to-end TensorRT-LLM benchmarking workflows on SLURM, supporting aggregated and disaggregated setups with submission helpers.
    • Automated result collation to JSON and visualization via Pareto comparison plots.
    • Convenience scripts to start services/workers, run benchmarks, and tune GPU clocks.
    • Config generation for multi-node runs with flexible parallelism and optional speculative decoding.
  • Documentation

    • Comprehensive guide covering prerequisites, environment setup, example commands, and result interpretation, including notes on model weights and configuration variants.

@tanmayv25 tanmayv25 marked this pull request as ready for review August 12, 2025 00:22
@tanmayv25 tanmayv25 requested a review from nnshah1 as a code owner August 12, 2025 00:22
@tanmayv25 tanmayv25 merged commit 1327e3b into main Aug 12, 2025
13 of 14 checks passed
@tanmayv25 tanmayv25 deleted the tanmayv-sweeps branch August 12, 2025 22:55
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking thoughts for future

  • Can submit.sh and submit_agg.sh be a single intuitive script?
  • If not, I think submit_disagg.sh and submit_agg.sh would be a bit more intuitive/symmetrical when scanning the names of the scripts. Not clear that submit.sh means disagg.
  • What's the minimum number of GB200 nodes do you need available to run the largest config used in the scripts?
  • Can you add a note to the relevant README about your findings on performance impact of using srun --exclusive somewhere?
  • Do you see the submit*.sh scripts converging with the multinode scripts (ex: srun_disaggregated.sh) for a single general purpose set?

@tanmayv25
Copy link
Contributor Author

Can submit.sh and submit_agg.sh be a single intuitive script?

I believe it would be too many configs in a single file.

If not, I think submit_disagg.sh and submit_agg.sh would be a bit more intuitive/symmetrical when scanning the names of the scripts. Not clear that submit.sh means disagg.

Will rename it in the follow-up PR.

What's the minimum number of GB200 nodes do you need available to run the largest config used in the scripts?

Will include in the notes.

Can you add a note to the relevant README about your findings on performance impact of using srun --exclusive somewhere?

Sure!

Do you see the submit*.sh scripts converging with the multinode scripts (ex: srun_disaggregated.sh) for a single general purpose set?

Yes. I have this in mind. Will look into refactoring in near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants