Skip to content

fix: serialize disagg first_gen_log_probs int keys for Rust transport#7145

Merged
nv-yna merged 4 commits intoai-dynamo:mainfrom
nv-yna:fix/disagg-logprobs-int-key-roundtrip
Mar 11, 2026
Merged

fix: serialize disagg first_gen_log_probs int keys for Rust transport#7145
nv-yna merged 4 commits intoai-dynamo:mainfrom
nv-yna:fix/disagg-logprobs-int-key-roundtrip

Conversation

@nv-yna
Copy link
Contributor

@nv-yna nv-yna commented Mar 10, 2026

Summary

  • Fix disaggregated logprobs CI failure (test_deployment[disaggregated_logprobs-2])
  • TRT-LLM PR #11727 adds first_gen_log_probs to DisaggregatedParams with integer token-ID dict keys (e.g. {4710: Logprob(...)})
  • After dataclasses.asdict(), these int keys break Rust pythonize 0.23 depythonize (dict_key_not_string error) → cascades into Disconnected: Stream ended before generation completed
  • Add serialize_first_gen_log_probs / deserialize_first_gen_log_probs to DisaggregatedParamsCodec using TRT-LLM own list-of-dicts format ([{"token_id": id, "logprob": float, "rank": int}])
  • Add unit tests for the serialization round-trip

Test plan

  • Reproduced original failure on H100x2 cluster
  • test_deployment[disaggregated_logprobs-2] passes with fix (all logprobs payloads validated — 300 tokens with logprobs each)
  • test_deployment[disaggregated-2] passes (no regression on non-logprobs disagg)
  • Unit tests for serialization round-trip (no GPU required)
  • CI multi-gpu tests pass

Fixes: DYN-2265 / NVBugs 5926823

Summary by CodeRabbit

  • Bug Fixes

    • Improved token log probability handling in distributed inference scenarios for better data format compatibility.
  • Tests

    • Added comprehensive test coverage for log probability serialization and deserialization.

@github-actions
Copy link
Contributor

👋 Hi nv-yna! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added fix external-contribution Pull request is from an external contributor backend::trtllm Relates to the trtllm backend labels Mar 10, 2026
@nv-yna nv-yna force-pushed the fix/disagg-logprobs-int-key-roundtrip branch from a5663fd to da8a7bd Compare March 10, 2026 21:37
@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 10, 2026
@nv-yna nv-yna force-pushed the fix/disagg-logprobs-int-key-roundtrip branch from da8a7bd to 2ea183a Compare March 10, 2026 21:56
@nv-yna nv-yna marked this pull request as ready for review March 10, 2026 21:57
@nv-yna nv-yna requested review from a team as code owners March 10, 2026 21:57
@nv-yna nv-yna requested review from indrajit96 and tanmayv25 March 10, 2026 21:57
@nv-yna nv-yna enabled auto-merge (squash) March 10, 2026 22:00
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

This pull request adds serialization and deserialization logic for first_gen_log_probs in the disaggregated request handling pipeline. New codec methods transform log probabilities between TRT-LLM's internal dictionary format and a transport-compatible list format, with handler integration and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Codec Implementation
components/src/dynamo/trtllm/utils/disagg_utils.py
Added Logprob import and two static methods: serialize_first_gen_log_probs() converts internal dict-of-dicts to list-of-lists format; deserialize_first_gen_log_probs() converts back to dict mapping token_id to Logprob instances. Both handle missing keys gracefully.
Handler Integration
components/src/dynamo/trtllm/request_handlers/handler_base.py
Integrated codec methods into request handling pipeline: deserialization in _decode_disaggregated_params_from_prefill() and serialization in _encode_and_pack_disaggregated_params().
Test Coverage
tests/serve/test_disagg_logprobs_serialization.py
New test module with six test cases validating round-trip serialization/deserialization: None passthrough, missing field handling, single/multi-token roundtrips, rank preservation, and empty list scenarios. Includes JSON safety verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A hop through the logs, from dict to a list,
The codec transforms what transport won't miss—
Token by token, with rank shining bright,
The rabbits rejoice at serialization done right! 🐰✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: serializing disaggregated first_gen_log_probs integer keys for Rust transport, which directly addresses the CI failure detailed in the description.
Description check ✅ Passed The description provides a clear summary, details the root cause and solution, includes specific test validation, and identifies related issues. However, it does not follow the repository's template structure with explicit 'Overview', 'Details', 'Where should the reviewer start', and 'Related Issues' sections.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/serve/test_disagg_logprobs_serialization.py (1)

104-107: Add strict=True to zip() for safer iteration.

The static analysis flagged this zip() call. Since original and recovered should have equal lengths after a successful round-trip, adding strict=True would catch any unexpected length mismatch.

🔧 Proposed fix
-        for orig_pos, rec_pos in zip(original, recovered):
+        for orig_pos, rec_pos in zip(original, recovered, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/serve/test_disagg_logprobs_serialization.py` around lines 104 - 107,
The zip over original and recovered should be strict to catch length mismatches;
update the loop that iterates with "for orig_pos, rec_pos in zip(original,
recovered):" to use "zip(original, recovered, strict=True)" so any unexpected
unequal lengths after the round-trip fail fast; keep the loop body (asserting
rec_pos[tid].logprob and rank) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/serve/test_disagg_logprobs_serialization.py`:
- Around line 36-38: The test class TestDisaggLogprobsSerializationRoundtrip
currently has only two markers; add the required GPU and type markers by
annotating the class with `@pytest.mark.gpu_0` and `@pytest.mark.unit` (in addition
to the existing `@pytest.mark.pre_merge` and `@pytest.mark.trtllm`) so the class has
a scheduling marker, a GPU marker, and a type marker per guidelines.
- Around line 30-33: The test imports optional dependencies Logprob and
DisaggregatedParamsCodec directly; wrap those imports in a try/except
ImportError at module import time and call pytest.skip("missing optional
dependency: tensorrt_llm", allow_module_level=True) inside the except so the
test module is not collected when tensorrt_llm (and its Logprob) or disagg utils
are absent; update the import block referencing Logprob and
DisaggregatedParamsCodec accordingly to perform the guarded import and early
skip.

---

Nitpick comments:
In `@tests/serve/test_disagg_logprobs_serialization.py`:
- Around line 104-107: The zip over original and recovered should be strict to
catch length mismatches; update the loop that iterates with "for orig_pos,
rec_pos in zip(original, recovered):" to use "zip(original, recovered,
strict=True)" so any unexpected unequal lengths after the round-trip fail fast;
keep the loop body (asserting rec_pos[tid].logprob and rank) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8e1547d0-90eb-41c6-99b3-2453c325b748

📥 Commits

Reviewing files that changed from the base of the PR and between 5e51d6d and 2ea183a.

📒 Files selected for processing (3)
  • components/src/dynamo/trtllm/request_handlers/handler_base.py
  • components/src/dynamo/trtllm/utils/disagg_utils.py
  • tests/serve/test_disagg_logprobs_serialization.py

Copy link
Contributor

@indrajit96 indrajit96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Some minor comments for test

@nv-yna nv-yna force-pushed the fix/disagg-logprobs-int-key-roundtrip branch from 6d3fae5 to 69ffb5c Compare March 10, 2026 22:53
TRT-LLM PR #11727 adds first_gen_log_probs to DisaggregatedParams to
carry the first generated token's logprobs from prefill to decode in
disaggregated serving. This field uses dicts with integer token-ID keys
(e.g. {4710: Logprob(...)}).

After dataclasses.asdict(), these int keys break pythonize 0.23's
depythonize which requires string map keys for serde_json::Value
(dict_key_not_string error). This cascades into a "Disconnected: Stream
ended before generation completed" error because the error response
can't be published before the stream context stops.

Add serialize/deserialize methods to DisaggregatedParamsCodec that
convert between TRT-LLM's internal {int: Logprob} format and a
JSON-safe list-of-dicts transport format, matching TRT-LLM's own
_serialize_first_gen_log_probs in openai_protocol.py.

Add unit tests for the serialization round-trip.

Fixes: DYN-2265
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna nv-yna force-pushed the fix/disagg-logprobs-int-key-roundtrip branch from aa905a3 to afb12a4 Compare March 11, 2026 02:00
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Signed-off-by: Yuewei Na <248773860+nv-yna@users.noreply.github.com>
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Signed-off-by: Yuewei Na <248773860+nv-yna@users.noreply.github.com>
@nv-yna nv-yna merged commit e930526 into ai-dynamo:main Mar 11, 2026
80 of 81 checks passed
nv-yna added a commit to nv-yna/dynamo that referenced this pull request Mar 11, 2026
…ai-dynamo#7145)

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::trtllm Relates to the trtllm backend external-contribution Pull request is from an external contributor fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants