Skip to content

[Inference] Add MCore inference examples and model wrappers#3897

Draft
cuichenx wants to merge 2 commits into
mainfrom
chcui/inference-mcore4697
Draft

[Inference] Add MCore inference examples and model wrappers#3897
cuichenx wants to merge 2 commits into
mainfrom
chcui/inference-mcore4697

Conversation

@cuichenx
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx commented May 20, 2026

Summary

  • Supersedes [Inference] Add MCore high-level inference examples #3896, which was closed when the branch was renamed from tess/inference-mcore4697 to chcui/inference-mcore4697.
  • Add Bridge/AutoBridge synchronous offline text generation under examples/inference/text_generation.py.
  • Add direct MCore-style concurrent async generation and OpenAI-compatible server examples under examples/inference/.
  • Add launcher scripts and README for the new generic inference examples.
  • Refactor text-only model inference wrappers to use examples/inference/text_generation.py as the efficient inference entry point.
  • Keep examples/conversion/hf_to_megatron_generate_text.py as a debugging/parity-forward path rather than the primary inference path.
  • Update the Megatron-LM submodule pointer to the MCore inference API PR head.

Dependency

Depends on unmerged MCore PR: NVIDIA/Megatron-LM#4697

The new examples import the high-level inference APIs from that PR, including MegatronLLM, MegatronAsyncLLM, and ServeConfig.

Validation

  • uv run --no-sync pre-commit run --all-files
  • Static validation checks passed:
    • bash -n for the new launcher scripts and updated model inference wrappers
    • python -m py_compile for the new Python examples
    • git diff --check
    • targeted grep checks confirming the updated text-only model wrappers call examples/inference/text_generation.py and no longer call examples/conversion/hf_to_megatron_generate_text.py
    • ruff check for the new inference examples
    • ruff format --check for the new inference examples
  • Runtime validation passed:
    • ran examples/inference/text_generation.py for synchronous AutoBridge text generation
    • ran examples/inference/async_text_generation.py for direct MCore async text generation
    • started examples/inference/openai_server.py and verified the OpenAI-compatible server reached readiness
    • ran the GPT-OSS text generation path through examples/inference/text_generation.py with the same TP/PP/EP shape used by examples/models/gpt_oss/inference.sh

Model Wrapper Runtime Notes

Wrapper Runtime result
examples/models/gpt_oss/inference.sh Passed one-node runtime validation via the new generic text generation entry point. A short raw-prompt quality sample generated text, though the output was repetitive rather than a strong answer.
examples/models/bailing/inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.
examples/models/falcon_h1/inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.
examples/models/glm47/inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.
examples/models/sarvam/inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.
examples/models/glm/glm5/slurm_inference.sh Not launched because the available artifact is multi-node scale; static validation passed.
examples/models/glm47/slurm_inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.
examples/models/minimax/minimax_m2/slurm_inference.sh Not run at runtime because a suitable cached artifact was not available. Static validation passed.

Note: uv run pre-commit run --all-files without --no-sync was not usable in the local environment because dependency resolution requires a platform-specific nvidia-resiliency-ext==0.6.0 wheel that is unavailable there.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx changed the title [Inference] Add MCore high-level inference examples [Inference] Add MCore inference examples and model wrappers May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant