[Inference] Add MCore inference examples and model wrappers by cuichenx · Pull Request #3897 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-20T00:15:02Z

Summary

Supersedes [Inference] Add MCore high-level inference examples #3896, which was closed when the branch was renamed from tess/inference-mcore4697 to chcui/inference-mcore4697.
Add Bridge/AutoBridge synchronous offline text generation under examples/inference/text_generation.py.
Add direct MCore-style concurrent async generation and OpenAI-compatible server examples under examples/inference/.
Add launcher scripts and README for the new generic inference examples.
Refactor text-only model inference wrappers to use examples/inference/text_generation.py as the efficient inference entry point.
Keep examples/conversion/hf_to_megatron_generate_text.py as a debugging/parity-forward path rather than the primary inference path.
Update the Megatron-LM submodule pointer to the MCore inference API PR head.

Dependency

Depends on unmerged MCore PR: NVIDIA/Megatron-LM#4697

The new examples import the high-level inference APIs from that PR, including MegatronLLM, MegatronAsyncLLM, and ServeConfig.

Validation

uv run --no-sync pre-commit run --all-files
Static validation checks passed:
- bash -n for the new launcher scripts and updated model inference wrappers
- python -m py_compile for the new Python examples
- git diff --check
- targeted grep checks confirming the updated text-only model wrappers call examples/inference/text_generation.py and no longer call examples/conversion/hf_to_megatron_generate_text.py
- ruff check for the new inference examples
- ruff format --check for the new inference examples
Runtime validation passed:
- ran examples/inference/text_generation.py for synchronous AutoBridge text generation
- ran examples/inference/async_text_generation.py for direct MCore async text generation
- started examples/inference/openai_server.py and verified the OpenAI-compatible server reached readiness
- ran the GPT-OSS text generation path through examples/inference/text_generation.py with the same TP/PP/EP shape used by examples/models/gpt_oss/inference.sh

Model Wrapper Runtime Notes

Wrapper	Runtime result
`examples/models/gpt_oss/inference.sh`	Passed one-node runtime validation via the new generic text generation entry point. A short raw-prompt quality sample generated text, though the output was repetitive rather than a strong answer.
`examples/models/bailing/inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.
`examples/models/falcon_h1/inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.
`examples/models/glm47/inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.
`examples/models/sarvam/inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.
`examples/models/glm/glm5/slurm_inference.sh`	Not launched because the available artifact is multi-node scale; static validation passed.
`examples/models/glm47/slurm_inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.
`examples/models/minimax/minimax_m2/slurm_inference.sh`	Not run at runtime because a suitable cached artifact was not available. Static validation passed.

Note: uv run pre-commit run --all-files without --no-sync was not usable in the local environment because dependency resolution requires a platform-specific nvidia-resiliency-ext==0.6.0 wheel that is unavailable there.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-20T00:15:05Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

[Inference] Add MCore text generation examples

f695435

Signed-off-by: Chen Cui <chcui@nvidia.com>

[Inference] Route model examples through MCore text generation

28468f9

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx changed the title ~~[Inference] Add MCore high-level inference examples~~ [Inference] Add MCore inference examples and model wrappers May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Add MCore inference examples and model wrappers#3897

[Inference] Add MCore inference examples and model wrappers#3897
cuichenx wants to merge 2 commits into
mainfrom
chcui/inference-mcore4697

cuichenx commented May 20, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuichenx commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Validation

Model Wrapper Runtime Notes

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuichenx commented May 20, 2026 •

edited

Loading