feat(moe): add use_expert_bias config for optional expert biases#2214
feat(moe): add use_expert_bias config for optional expert biases#2214eous wants to merge 2 commits intopytorch:mainfrom
Conversation
Add support for optional expert biases (mlp1_bias, mlp2_bias) in GptOssGroupedExperts, required for loading GPT-OSS pretrained models. Changes to GptOssGroupedExperts: - Add use_expert_bias parameter (default=True for GPT-OSS) - Add compute_dtype parameter for configurable compute precision - Add caching (_cached_tp_degree, _is_dtensor) for performance - Add _get_tp_degree() method with cache lookup - Proper None checks in _run_experts_for_loop and _run_experts_grouped_mm - ScaleBiasForward custom autograd for TP bias scaling Changes to expert_parallel.py: - Handle optional bias distribution in GptossTensorParallel - Handle optional bias distribution in GptossExpertTensorParallel - Invalidate caches after parallelization Changes to MoEArgs: - Add use_expert_bias: bool = False field Config updates: - Set use_expert_bias=True for GPT-OSS 20B/120B models
There was a problem hiding this comment.
Pull request overview
This PR adds support for optional expert biases in GPT-OSS MoE models, which is required for loading pretrained GPT-OSS models. The changes enable models to optionally include mlp1_bias and mlp2_bias parameters based on a configurable flag.
Key changes:
- Added
use_expert_biasconfiguration flag (default=False) to control whether expert biases are used - Added
compute_dtypeparameter for configurable compute precision in grouped matrix multiplications - Implemented caching mechanism (
_cached_tp_degree,_is_dtensor) to avoid repeated isinstance checks and device mesh lookups
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| torchtitan/models/moe/moe.py | Added use_expert_bias field to MoEArgs dataclass with clear documentation |
| torchtitan/models/gpt_oss/model/moe.py | Modified GptOssGroupedExperts to support optional biases, added compute_dtype parameter, implemented caching, and updated ScaleBiasForward with proper type annotations |
| torchtitan/models/gpt_oss/infra/expert_parallel.py | Updated tensor parallel distribution to handle optional biases and invalidate caches after parallelization |
| torchtitan/models/gpt_oss/init.py | Enabled use_expert_bias=True for 20B and 120B GPT-OSS model configurations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ( | ||
| not isinstance(self.mlp1_weight, DTensor) | ||
| # pyrefly: ignore [not-iterable] | ||
| not self._is_dtensor | ||
| or "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names | ||
| ): |
There was a problem hiding this comment.
Potential AttributeError when checking device_mesh on non-DTensor. The condition not self._is_dtensor or "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names will attempt to access .device_mesh even when self._is_dtensor is False due to Python's evaluation order in logical OR expressions. This should be rewritten to short-circuit properly, such as: if not self._is_dtensor: followed by elif "ep" not in self.mlp1_weight.device_mesh.mesh_dim_names:
Refactor conditional to make short-circuit logic explicit and add comment explaining when device_mesh access is safe.
tianyu-l
left a comment
There was a problem hiding this comment.
existing implementation already supports expert bias, what is this PR trying to achieve?
Apologies, this appears to be an experiment I was running that got sweeped into its own change. Closing it out. |
Add support for optional expert biases (mlp1_bias, mlp2_bias) in GptOssGroupedExperts, required for loading GPT-OSS pretrained models.
Changes to GptOssGroupedExperts:
Changes to expert_parallel.py:
Changes to MoEArgs:
Config updates: