Skip to content

feat(moe): add use_expert_bias config for optional expert biases#2214

Closed
eous wants to merge 2 commits intopytorch:mainfrom
eous:moe-expert-bias
Closed

feat(moe): add use_expert_bias config for optional expert biases#2214
eous wants to merge 2 commits intopytorch:mainfrom
eous:moe-expert-bias

Conversation

@eous
Copy link
Contributor

@eous eous commented Jan 8, 2026

Add support for optional expert biases (mlp1_bias, mlp2_bias) in GptOssGroupedExperts, required for loading GPT-OSS pretrained models.

Changes to GptOssGroupedExperts:

  • Add use_expert_bias parameter (default=True for GPT-OSS)
  • Add compute_dtype parameter for configurable compute precision
  • Add caching (_cached_tp_degree, _is_dtensor) for performance
  • Add _get_tp_degree() method with cache lookup
  • Proper None checks in _run_experts_for_loop and _run_experts_grouped_mm
  • ScaleBiasForward custom autograd for TP bias scaling

Changes to expert_parallel.py:

  • Handle optional bias distribution in GptossTensorParallel
  • Handle optional bias distribution in GptossExpertTensorParallel
  • Invalidate caches after parallelization

Changes to MoEArgs:

  • Add use_expert_bias: bool = False field

Config updates:

  • Set use_expert_bias=True for GPT-OSS 20B/120B models

Add support for optional expert biases (mlp1_bias, mlp2_bias) in
GptOssGroupedExperts, required for loading GPT-OSS pretrained models.

Changes to GptOssGroupedExperts:
- Add use_expert_bias parameter (default=True for GPT-OSS)
- Add compute_dtype parameter for configurable compute precision
- Add caching (_cached_tp_degree, _is_dtensor) for performance
- Add _get_tp_degree() method with cache lookup
- Proper None checks in _run_experts_for_loop and _run_experts_grouped_mm
- ScaleBiasForward custom autograd for TP bias scaling

Changes to expert_parallel.py:
- Handle optional bias distribution in GptossTensorParallel
- Handle optional bias distribution in GptossExpertTensorParallel
- Invalidate caches after parallelization

Changes to MoEArgs:
- Add use_expert_bias: bool = False field

Config updates:
- Set use_expert_bias=True for GPT-OSS 20B/120B models
Copilot AI review requested due to automatic review settings January 8, 2026 23:50
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for optional expert biases in GPT-OSS MoE models, which is required for loading pretrained GPT-OSS models. The changes enable models to optionally include mlp1_bias and mlp2_bias parameters based on a configurable flag.

Key changes:

  • Added use_expert_bias configuration flag (default=False) to control whether expert biases are used
  • Added compute_dtype parameter for configurable compute precision in grouped matrix multiplications
  • Implemented caching mechanism (_cached_tp_degree, _is_dtensor) to avoid repeated isinstance checks and device mesh lookups

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
torchtitan/models/moe/moe.py Added use_expert_bias field to MoEArgs dataclass with clear documentation
torchtitan/models/gpt_oss/model/moe.py Modified GptOssGroupedExperts to support optional biases, added compute_dtype parameter, implemented caching, and updated ScaleBiasForward with proper type annotations
torchtitan/models/gpt_oss/infra/expert_parallel.py Updated tensor parallel distribution to handle optional biases and invalidate caches after parallelization
torchtitan/models/gpt_oss/init.py Enabled use_expert_bias=True for 20B and 120B GPT-OSS model configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 271 to 274
if (
not isinstance(self.mlp1_weight, DTensor)
# pyrefly: ignore [not-iterable]
not self._is_dtensor
or "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names
):
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError when checking device_mesh on non-DTensor. The condition not self._is_dtensor or "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names will attempt to access .device_mesh even when self._is_dtensor is False due to Python's evaluation order in logical OR expressions. This should be rewritten to short-circuit properly, such as: if not self._is_dtensor: followed by elif "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names:

Copilot uses AI. Check for mistakes.
Refactor conditional to make short-circuit logic explicit and add
comment explaining when device_mesh access is safe.
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing implementation already supports expert bias, what is this PR trying to achieve?

@eous
Copy link
Contributor Author

eous commented Jan 20, 2026

existing implementation already supports expert bias, what is this PR trying to achieve?

Apologies, this appears to be an experiment I was running that got sweeped into its own change. Closing it out.

@eous eous closed this Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants