fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward by Chessing234 · Pull Request #2707 · d2l-ai/d2l-en

Chessing234 · 2026-05-04T15:09:43Z

Bug

MaskedSoftmaxCELoss.forward reduces the masked per-token losses with .mean(dim=1) instead of .sum(dim=1).

# d2l/torch.py — current
weighted_loss = (unweighted_loss * weights).mean(dim=1)

Root cause

.mean(dim=1) divides each sample's loss by num_steps (the full sequence length), including padding positions whose weight is 0. The result is artificially small by a factor proportional to the padding fraction.

train_seq2seq then accumulates the loss and reports metric[0] / metric[1] where metric[1] = Y_valid_len.sum(). That second division is correct only if the accumulated loss is an unscaled sum over valid tokens. With .mean, the final metric is divided by num_tokens and by num_steps, making the loss num_steps times too small and causing the model to behave as though sequences should be as long as possible (no EOS is learned).

Fix

# d2l/torch.py — after
weighted_loss = (unweighted_loss * weights).sum(dim=1)

.sum(dim=1) accumulates only the valid-token losses (padding weights are 0, so their contribution is exactly 0). train_seq2seq then correctly normalises by the total number of valid tokens.

Fixes #2076.

.mean(dim=1) divides each sample's masked loss by the full sequence length (num_steps), including padding steps whose weight is 0. This under-scales the loss and, combined with train_seq2seq dividing again by num_tokens, causes the reported per-token loss to be num_steps times smaller than the true value. .sum(dim=1) is correct: padding steps contribute 0 (weights=0), so the sum equals the total loss over valid tokens only. train_seq2seq then divides this sum by num_tokens to obtain the correct per-token loss. Fixes d2l-ai#2076.

Chessing234 mentioned this pull request May 5, 2026

fix(tf): use reduce_sum not reduce_mean in MaskedSoftmaxCELoss.call #2708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707

fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/masked-softmax-ce-loss-mean-to-sum

Chessing234 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented May 4, 2026

Bug

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant