Initialize bias to zero by rthekini-aws · Pull Request #2450 · pytorch/torchtitan

rthekini-aws · 2026-02-27T02:31:52Z

Router.init_weights initializes self.gate.weight via trunc_normal_ but never initializes self.gate.bias. Under torch.use_deterministic_algorithms(True), PyTorch's fill_uninitialized_memory fills the bias with NaN, which poisons all router scores and produces NaN loss from step 1.

Also defensively initializes FeedForward biases (not currently triggered since bias=False by default).

Root cause

nn.Linear allocates bias with torch.empty. Normally this contains finite garbage that gets overwritten during training. With fill_uninitialized_memory=True (enabled by deterministic mode), uninitialized memory is filled with NaN to surface exactly this kind of bug.

Fix

Zero-initialize biases in Router.init_weights and FeedForward.init_weights.

Testing

Verified with gpt_oss debugmodel (NGPU=1 MODULE=gpt_oss CONFIG=gpt_oss_debugmodel) and --debug.deterministic across seeds 0, 42, 123, 999 — all produce converging loss where previously every step was NaN.

Seed	Step 1	Step 5
0	8.133	4.454
42	8.087	4.285
123	7.982	4.408
999	8.138	4.330

tianyu-l

Thanks a lot!

wwwjn

CI failing unrelated

rthekini-aws · 2026-02-27T22:40:20Z

@tianyu-l, @wwwjn Are there any steps I need to take to merge this?

Initialize bias to zero

c268250

rthekini-aws requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 27, 2026 02:31

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026

rthekini-aws mentioned this pull request Feb 27, 2026

Remove device to host synchronizations from repeat_interleave and tail_slack #2440

Open

tianyu-l approved these changes Feb 27, 2026

View reviewed changes

wwwjn approved these changes Feb 27, 2026

View reviewed changes

tianyu-l merged commit d6a9434 into pytorch:main Feb 28, 2026
9 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize bias to zero#2450

Initialize bias to zero#2450
tianyu-l merged 1 commit intopytorch:mainfrom
rthekini-aws:initialize-bias

rthekini-aws commented Feb 27, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

wwwjn left a comment

Uh oh!

rthekini-aws commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rthekini-aws commented Feb 27, 2026

Root cause

Fix

Testing

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

rthekini-aws commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants