[recipe][megatron] refactor: isolate megatron recipe and extend distillation losses by process-cxr · Pull Request #4797 · verl-project/verl

process-cxr · 2026-01-05T10:43:45Z

This PR refactors the Megatron-based GKD training pipeline and extends the distillation loss implementation to improve modularity and future extensibility.

Specifically, it:

Adds megatron_distill_losses with support for RKL, JSD, and KL+RL losses, implemented as reusable loss operators
Exposes distillation losses for flexible invocation inside Megatron workers
Moves all Megatron-related training code under recipe/gkd/megatron to clearly isolate it from other backends
Prepares the directory structure for future integration of an FSDP-based training framework

What does this PR do?

This PR reorganizes the Megatron training recipe in GKD by isolating Megatron-specific code into a dedicated directory and extending the distillation loss module.

The refactor improves code clarity and separation of concerns, while the new loss operators make it easier to experiment with alternative distillation objectives. The new directory layout also lays the groundwork for adding an FSDP-based training backend.

Checklist Before Starting

Format the PR title as [{modules}] {type}: {description}

Test

This PR mainly introduces refactoring and loss function extensions that are not fully covered by existing CI tests.

The changes have been validated by running Megatron-based GKD training and verifying correct loss computation and training stability when switching between KL, RKL, JSD, and KL+RL losses.

API and Usage Example

No public API is changed in this PR. The new distillation losses are internal to the Megatron training pipeline.

Design & Code Changes

Introduced a unified distillation loss module for Megatron-based GKD training
Refactored directory structure to move Megatron training logic into recipe/gkd/megatron
Reduced coupling between Megatron and other training backends to improve maintainability

Checklist Before Submitting

Read the Contribute Guide
Apply pre-commit checks (basic formatting and lint checks applied)
Add / Update documentation (not required for this refactor)
Add unit or end-to-end tests (not feasible due to training-level validation)

…tion losses Add megatron_distill_losses with support for RKL, JSD and KL+RL losses Expose distillation losses as callable operators in Megatron workers Move all Megatron-related training code under recipe/gkd/megatron Prepare directory structure for future FSDP training framework

CLAassistant · 2026-01-05T10:43:52Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request refactors the Megatron-based GKD training pipeline by isolating Megatron-specific code and extending the distillation loss implementations. The new megatron_distill_losses.py file introduces several distillation loss functions (KL, RKL, JSD, etc.) as custom PyTorch autograd Functions.

My review focuses on the correctness and robustness of these new loss implementations. I've identified a critical issue in all custom autograd.Function implementations where an input tensor is modified in-place without being marked as dirty using ctx.mark_dirty(). This can lead to incorrect gradients and must be fixed. Additionally, I've found a high-severity issue in the configuration factory function where a broad except Exception can hide configuration errors, leading to silent failures.

The rest of the changes, which mainly involve refactoring file structures and updating call sites to use the new loss factory, look good and align with the goal of improving modularity.

gemini-code-assist · 2026-01-05T10:46:49Z