Skip to content

Conversation

@ilyasher
Copy link
Contributor

@ilyasher ilyasher commented Sep 17, 2025

Overview:

This PR adds a --use-ai-configurator flag to profile_sla.py. If this flag is passed, profile_sla.py will use aiconfigurator to estimate the model perf for the various configs it tests instead of running an actual dynamo deployment.

Advantages of --use-ai-configurator:

  • running profile_sla.py takes seconds instead of hours
  • no GPU access needed

Disadvantages:

  • Limited support for Models & GPU types
  • Estimated perf could be wrong

Example usage

python3 benchmarks/profiler/profile_sla.py --config components/backends/trtllm/deploy/disagg.yaml --backend trtllm --use-ai-configurator --aic-system h200_sxm --aic-model-name QWEN3_32B --backend-version 0.20.0

Details:

  • Add --use-ai-configurator flag as well as --backend-version, --aic-model-name, --aic-system
  • Change profile_sla.py to call aiconfigurator when --use-ai-configurator is called
  • Add file estimate_perf.py which contains a helper class for using aiconfigurator
  • Add tests for profile_sla.py + --use-ai-configurator

Where should the reviewer start?

Reviewer can read the added tests, then changes to profile_sla.py, then other files.

Summary by CodeRabbit

  • New Features

    • Optional AI-based performance estimation for profiling without real deployments.
    • Added CLI options: --use-ai-configurator, --aic-system, --aic-model-name, --backend-version, with input validation and warnings when unused.
    • Prefill and decode profiling now support AI-driven estimates while preserving existing outputs and plots.
  • Tests

    • New integration tests covering AI-configurator flows, missing/invalid arguments, and multiple backend/model versions.
    • Updated dry-run tests to include the new CLI flags.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 17, 2025

Walkthrough

Introduces AI-configurator-based performance estimation into the profiling flow, adding a new estimator module, pluggable callbacks in prefill/decode utilities, CLI flags to select AI-based paths, and tests validating AI-configurator integration and argument handling. Legacy live-deployment behavior remains unchanged when the AI-configurator option is not used.

Changes

Cohort / File(s) Summary of Changes
Profiler CLI and orchestration
benchmarks/profiler/profile_sla.py
Adds AI-configurator mode with new CLI flags (--use-ai-configurator, --aic-system, --aic-model-name, --backend-version); routes prefill/decode to AI-based estimators when enabled; validates required inputs; skips deployment in AI path; preserves legacy path otherwise.
AI Configurator Estimator
benchmarks/profiler/utils/estimate_perf.py
New module providing AIConfiguratorPerfEstimator with lazy import, database/backend resolution, and methods: estimate_perf, estimate_prefill_perf, get_max_batch_size, get_max_kv_tokens.
Decode profiling utils
benchmarks/profiler/utils/profile_decode.py
Refactors into a helper using a callback for ITL/throughput; adds profile_decode_aiconfigurator and optional estimator parameter; preserves data collection/plotting; skips entries on missing metrics.
Prefill profiling utils
benchmarks/profiler/utils/profile_prefill.py
Refactors into a helper using a TTFT callback; adds profile_prefill_aiconfigurator; keeps interpolation/plotting; inner get_ttft in AI path computes TTFT but lacks an explicit return.
Tests: AI-configurator integration
tests/profiler/test_profile_sla_aiconfigurator.py
New tests covering missing/invalid AI-configurator args and successful runs across versions/models without GPU.
Tests: Dry-run arg parity
tests/profiler/test_profile_sla_dryrun.py
Extends fixtures to include use_ai_configurator, aic_system, aic_model_name, backend_version defaults.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as profile_sla CLI
  participant Prefill as profile_prefill*
  participant Decode as profile_decode*
  participant Est as AIConfiguratorPerfEstimator
  participant Bench as Live Benchmark

  rect rgba(230,245,255,0.5)
    Note over CLI: --use-ai-configurator=true
    User->>CLI: run_profile(args)
    CLI->>Est: init(system, backend, version, model)
    CLI->>Prefill: profile_prefill_aiconfigurator(get_ttft via Est)
    Prefill->>Est: estimate_prefill_perf(isl)
    Est-->>Prefill: {context_latency}
    CLI->>Est: get_max_kv_tokens(isl, osl)
    CLI->>Decode: profile_decode_aiconfigurator(get_itl/throughput via Est)
    Decode->>Est: estimate_perf(isl, osl, batch)
    Est-->>Decode: {tpot, tokens/s/gpu}
    Decode-->>CLI: raw_data + plots
    Prefill-->>CLI: raw_data + plots
  end

  rect rgba(240,240,240,0.5)
    Note over CLI: --use-ai-configurator=false
    User->>CLI: run_profile(args)
    CLI->>Prefill: profile_prefill (benchmark)
    Prefill->>Bench: benchmark_prefill(isl)
    Bench-->>Prefill: TTFT
    CLI->>Decode: profile_decode (benchmark)
    Decode->>Bench: benchmark_decode(isl, osl, num_requests)
    Bench-->>Decode: ITL, throughput
    Decode-->>CLI: raw_data + plots
    Prefill-->>CLI: raw_data + plots
  end

  note right of CLI: *Helpers use callbacks to obtain metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

In burrows of benchmarks I hop and I scheme,
Estimating tokens like a caffeinated dream.
No pods to deploy, just numbers to weigh—
TTFT, ITL, I plot all day.
With paws on the CLI, I configure with glee,
“AI, estimate!”—and off we flee. 🐇📈

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately describes the primary change—adding the --use-ai-configurator CLI flag to benchmarks/profiler/profile_sla.py—and is concise, focused, and directly related to the main changes in this PR. It clearly signals the main intent to a reviewer scanning history.
Description Check ✅ Passed The PR description follows the repository template by providing Overview, Details, and "Where should the reviewer start" sections and clearly explains the purpose, advantages/disadvantages, added CLI flags, example usage, and reviewer guidance; it also documents added files and tests for reviewers to inspect. This makes the description mostly complete and actionable for review. The only missing template element is a "Related Issues" section linking an issue or explicitly stating none.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
benchmarks/profiler/profile_sla.py (2)

399-404: Return early if no prefill results to avoid min() on empty list

When prefill produced no results, the code logs an error but still proceeds to call min(prefill_ttft), which crashes. Exit the recommendations phase early.

Apply this diff:

         else:
             logger.info("Analyzing results and generate recommendations...")
             # Safety guards: no results → exit early with a clear message
             if not (prefill_tp_size and prefill_ttft and prefill_thpt_per_gpu):
                 logger.error("No prefill results produced; skipping recommendations.")
+                return

461-470: Guard KV-cache utilization division by zero

decode_kv_cache_size[...] can be 0 (parser returned 0), causing a crash. Add a safe path.

Apply this diff:

-            selected_decode_kv_cache_utilization = (
-                decode_concurrency[selected_decode_idx]
-                * (args.isl + (args.osl / 2))
-                / decode_kv_cache_size[selected_decode_idx]
-            )
+            denom = decode_kv_cache_size[selected_decode_idx]
+            if denom <= 0:
+                logger.warning("KV cache size is non-positive; skipping utilization recommendation.")
+                selected_decode_kv_cache_utilization = 0.0
+            else:
+                selected_decode_kv_cache_utilization = (
+                    decode_concurrency[selected_decode_idx]
+                    * (args.isl + (args.osl / 2))
+                    / denom
+                )
🧹 Nitpick comments (10)
benchmarks/profiler/utils/estimate_perf.py (1)

157-159: Document backend's private API usage

Using self.backend._get_memory_usage relies on a private method (prefixed with underscore). This could break in future aiconfigurator updates.

Consider adding a comment acknowledging this dependency:

 def get_mem_usage(bs: int):
+    # Note: Using private API _get_memory_usage from aiconfigurator backend
     return self.backend._get_memory_usage(
         model, self.database, bs, 1, isl, osl
     )["total"]
tests/profiler/test_profile_sla_aiconfigurator.py (2)

20-20: Remove unused noqa directive

The noqa: E402 directive is unnecessary as E402 is not enabled in your Ruff configuration.

-from benchmarks.profiler.profile_sla import run_profile  # noqa: E402
+from benchmarks.profiler.profile_sla import run_profile

31-31: Consider using a more secure temporary directory

Using /tmp/test_profiling_results could potentially lead to security issues in shared environments. Consider using Python's tempfile module for test directories.

You could enhance test isolation by using a unique temporary directory per test:

import tempfile

@pytest.fixture
def trtllm_args(self, tmp_path):
    class Args:
        backend = "trtllm"
        config = "components/backends/trtllm/deploy/disagg.yaml"
        output_dir = str(tmp_path / "test_profiling_results")
        # ... rest of the args

This would leverage pytest's built-in tmp_path fixture for better test isolation.

benchmarks/profiler/profile_sla.py (7)

328-335: Handle empty concurrency sweep and fix log message

Skip TP sizes where max_concurrency <= 0 (or none of the values fit) and fix the misleading log message.

Apply this diff:

             if not args.dry_run:
                 sweep_num_request = [
                     num for num in DECODE_NUM_REQUESTS_RANGE if num <= max_concurrency
                 ]
-                logger.info(
-                    f"Sweeping num_request range based on maximum number of kv tokens: {sweep_num_request}"
-                )
+                if not sweep_num_request:
+                    logger.warning(
+                        f"No feasible concurrency values under max_concurrency={max_concurrency}; skipping TP{tp_size}."
+                    )
+                    continue
+                logger.info(
+                    f"Sweeping num_request range based on computed max concurrency ({max_concurrency}): {sweep_num_request}"
+                )

392-395: Filter out empty-engine results before plotting decode

Avoid plotting with empty series to prevent downstream errors/warnings.

Apply this diff:

-        if decode_results:
-            plot_decode_performance(decode_results, args.itl, args.output_dir)
+        if decode_results:
+            nonempty_results = [
+                (tp, itl_list, thpt_list)
+                for (tp, itl_list, thpt_list) in decode_results
+                if itl_list and thpt_list
+            ]
+            if nonempty_results:
+                plot_decode_performance(nonempty_results, args.itl, args.output_dir)
+            else:
+                logger.warning("No non-empty decode results to plot.")

96-109: Consolidate flag validation (addresses Ruff TRY003/TRY301)

Reduce repetition and produce a single actionable error listing missing flags when --use-ai-configurator is set.

Apply this diff:

-        if args.use_ai_configurator:
-            if not args.aic_system:
-                raise ValueError(
-                    "Must provide --aic-system when using --use-ai-configurator."
-                )
-            if not args.aic_model_name:
-                raise ValueError(
-                    "Must provide --aic-model-name when using --use-ai-configurator."
-                )
-            if not args.backend_version:
-                raise ValueError(
-                    "Must provide --backend-version when using --use-ai-configurator."
-                )
+        if args.use_ai_configurator:
+            missing = []
+            if not args.aic_system:
+                missing.append("--aic-system")
+            if not args.aic_model_name:
+                missing.append("--aic-model-name")
+            if not args.backend_version:
+                missing.append("--backend-version")
+            if missing:
+                raise ValueError(
+                    "Missing required flags for --use-ai-configurator: " + ", ".join(missing)
+                )

112-116: Avoid forcing lowercase for --aic-system

Lowercasing may not match database keys (depends on aiconfigurator DB naming). Prefer passing the value verbatim and normalize inside the estimator if needed.

Apply this diff:

-                args.aic_system.lower(),
+                args.aic_system,

To confirm expected casing, please verify what AIConfiguratorPerfEstimator (and the underlying DB) expects for system names (e.g., h100_sxm vs H100_SXM). If normalization is needed, do it centrally in the estimator.


571-583: Decode interpolation: compute max_kv_tokens once and skip when zero

Mirror the earlier guard: bail out early if max_kv_tokens <= 0 to avoid empty plots or later errors.

Apply this diff:

         elif args.use_ai_configurator:
             max_kv_tokens = ai_configurator_perf_estimator.get_max_kv_tokens(
                 args.isl, args.osl, tp_size=best_decode_tp
             )
+            if max_kv_tokens <= 0:
+                logger.warning("Estimator returned non-positive max_kv_tokens; skipping decode interpolation.")
+                max_kv_tokens = 0
+                # Optionally return early or continue without plotting; choose one:
+                return
             profile_decode_aiconfigurator(

627-629: Surface estimator DB-not-found errors with actionable guidance

Catching ValueError from the estimator (e.g., DB not found) can produce clearer logs before re-raising.

Apply this diff:

-    except Exception as e:
+    except ValueError as e:
+        logger.error(f"AI-configurator initialization failed: {e}. "
+                     "Check --aic-system/--aic-model-name/--backend-version are supported by the local DB.")
+        raise
+    except Exception as e:
         logger.error(f"Profile job failed with error: {e}")
         raise

321-327: Skip decode TP when log parsing yields zero KV cache

If get_kv_cache_size_from_dynamo_log returns 0, max_concurrency becomes 0 and later steps may misbehave. Guard and continue.

Apply this diff:

             max_kv_tokens = config_modifier.get_kv_cache_size_from_dynamo_log(
                 f"{work_dir}/{client.deployment_name}/{WORKER_COMPONENT_NAMES[args.backend].decode_worker_k8s_name.lower()}/0.log"
             )
-            max_concurrency = max_kv_tokens // (args.isl + args.osl)
+            max_concurrency = max_kv_tokens // (args.isl + args.osl) if (args.isl + args.osl) > 0 else 0
+            if max_concurrency <= 0:
+                logger.warning("Parsed KV cache size is zero; skipping this TP size.")
+                await client.delete_deployment()
+                deployment_clients.remove(client)
+                logger.info("Deployment deleted")
+                continue
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0373b89 and d1d3ee0.

📒 Files selected for processing (6)
  • benchmarks/profiler/profile_sla.py (12 hunks)
  • benchmarks/profiler/utils/estimate_perf.py (1 hunks)
  • benchmarks/profiler/utils/profile_decode.py (4 hunks)
  • benchmarks/profiler/utils/profile_prefill.py (4 hunks)
  • tests/profiler/test_profile_sla_aiconfigurator.py (1 hunks)
  • tests/profiler/test_profile_sla_dryrun.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/profiler/test_profile_sla_aiconfigurator.py (2)
benchmarks/profiler/profile_sla.py (1)
  • run_profile (63-634)
tests/profiler/test_profile_sla_dryrun.py (4)
  • trtllm_args (99-125)
  • Args (30-51)
  • Args (59-80)
  • Args (102-123)
benchmarks/profiler/profile_sla.py (6)
benchmarks/profiler/utils/estimate_perf.py (5)
  • estimate_perf (72-113)
  • AIConfiguratorPerfEstimator (29-210)
  • estimate_prefill_perf (115-135)
  • get_max_batch_size (137-188)
  • get_max_kv_tokens (190-210)
benchmarks/profiler/utils/profile_decode.py (2)
  • profile_decode (93-128)
  • profile_decode_aiconfigurator (130-155)
benchmarks/profiler/utils/profile_prefill.py (2)
  • profile_prefill (72-100)
  • profile_prefill_aiconfigurator (102-125)
benchmarks/profiler/utils/config.py (4)
  • get_kv_cache_size_from_dynamo_log (195-196)
  • get_kv_cache_size_from_dynamo_log (383-403)
  • get_kv_cache_size_from_dynamo_log (591-603)
  • get_kv_cache_size_from_dynamo_log (828-854)
deploy/utils/dynamo_deployment.py (2)
  • get_service_url (211-217)
  • delete_deployment (463-481)
benchmarks/profiler/utils/genai_perf.py (1)
  • benchmark_decode (189-247)
benchmarks/profiler/utils/profile_decode.py (2)
benchmarks/profiler/utils/estimate_perf.py (2)
  • estimate_perf (72-113)
  • AIConfiguratorPerfEstimator (29-210)
benchmarks/profiler/utils/genai_perf.py (1)
  • benchmark_decode (189-247)
benchmarks/profiler/utils/profile_prefill.py (2)
benchmarks/profiler/utils/estimate_perf.py (3)
  • estimate_perf (72-113)
  • AIConfiguratorPerfEstimator (29-210)
  • estimate_prefill_perf (115-135)
benchmarks/profiler/utils/genai_perf.py (1)
  • benchmark_prefill (154-186)
🪛 Ruff (0.12.2)
tests/profiler/test_profile_sla_aiconfigurator.py

20-20: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)


31-31: Probable insecure usage of temporary file or directory: "/tmp/test_profiling_results"

(S108)

benchmarks/profiler/utils/estimate_perf.py

53-55: Avoid specifying long messages outside the exception class

(TRY003)

benchmarks/profiler/profile_sla.py

98-100: Abstract raise to an inner function

(TRY301)


98-100: Avoid specifying long messages outside the exception class

(TRY003)


102-104: Abstract raise to an inner function

(TRY301)


102-104: Avoid specifying long messages outside the exception class

(TRY003)


106-108: Abstract raise to an inner function

(TRY301)


106-108: Avoid specifying long messages outside the exception class

(TRY003)

benchmarks/profiler/utils/profile_decode.py

26-26: Unused function argument: num_gpus

(ARG001)


102-102: Unused function argument: ai_configurator_perf_estimator

(ARG001)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3079/merge) by ilyasher.
benchmarks/profiler/utils/profile_decode.py

[error] 1-1: Ruff linting: 1 error found and auto-fixed by pre-commit; please re-run pre-commit to verify.

benchmarks/profiler/utils/profile_prefill.py

[error] 1-1: Ruff linting: 1 error found and auto-fixed by pre-commit; please re-run pre-commit to verify.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
benchmarks/profiler/utils/estimate_perf.py (1)

53-55: Move exception message construction into ValueError

Following TRY003, avoid constructing long error messages outside the exception class.

Apply this fix:

 if not self.database:
     raise ValueError(
-        f"Database not found for system: {system}, backend: {backend}, version: {version}"
+        f"Database not found for system: {system}, backend: {backend}, version: {version}"
     )

Wait, the code already has the message inside the ValueError. This appears to be a false positive from the static analysis tool.

tests/profiler/test_profile_sla_dryrun.py (1)

48-51: LGTM! Consistent addition of AI configurator fields

The new fields (use_ai_configurator, aic_system, aic_model_name, backend_version) are properly initialized across all three test fixtures, maintaining consistency with the new AI configurator functionality.

Also applies to: 77-80, 120-123

tests/profiler/test_profile_sla_aiconfigurator.py (1)

87-103: LGTM! Comprehensive test coverage for AI configurator

The parameterized tests provide good coverage across multiple backend versions and model names, ensuring the AI configurator integration works with various configurations.

benchmarks/profiler/utils/profile_decode.py (2)

130-155: LGTM! Clean implementation of AI configurator integration

The profile_decode_aiconfigurator function properly delegates to the helper with a clean callback implementation that extracts the required metrics from the estimator's response.


1-156: Run pre-commit to apply Ruff auto-fixes

Pipeline reported a Ruff lint error that pre-commit auto-fixed — run pre-commit locally (e.g., pre-commit run --all-files) and commit the resulting fixes for benchmarks/profiler/utils/profile_decode.py.

benchmarks/profiler/utils/profile_prefill.py (1)

1-126: Run pre-commit or remove unused import Tuple in benchmarks/profiler/utils/profile_prefill.py

Ruff flagged an unused import — the file still contains from typing import Callable, Optional, Tuple (line 5). Run pre-commit run --all-files or remove Tuple and commit.

@tedzhouhk
Copy link
Contributor

shall we add doc to

@ilyasher
Copy link
Contributor Author

@tedzhouhk I just updated the README

ilyasher and others added 11 commits September 18, 2025 13:58
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Co-authored-by: Hongkuan Zhou <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Co-authored-by: Hongkuan Zhou <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
Signed-off-by: Ilya Sherstyuk <[email protected]>
@ilyasher ilyasher force-pushed the dev-isherstyuk-use-llm-pet-in-profile-sla branch from dd3cace to 1e5dd89 Compare September 18, 2025 20:58
@ilyasher
Copy link
Contributor Author

/ok to test 6b6c005

@tedzhouhk tedzhouhk merged commit 19948b7 into main Sep 19, 2025
13 checks passed
@tedzhouhk tedzhouhk deleted the dev-isherstyuk-use-llm-pet-in-profile-sla branch September 19, 2025 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants