Skip to content

fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707

Open
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/masked-softmax-ce-loss-mean-to-sum
Open

fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707
Chessing234 wants to merge 1 commit into
d2l-ai:masterfrom
Chessing234:fix/masked-softmax-ce-loss-mean-to-sum

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

MaskedSoftmaxCELoss.forward reduces the masked per-token losses with .mean(dim=1) instead of .sum(dim=1).

# d2l/torch.py — current
weighted_loss = (unweighted_loss * weights).mean(dim=1)

Root cause

.mean(dim=1) divides each sample's loss by num_steps (the full sequence length), including padding positions whose weight is 0. The result is artificially small by a factor proportional to the padding fraction.

train_seq2seq then accumulates the loss and reports metric[0] / metric[1] where metric[1] = Y_valid_len.sum(). That second division is correct only if the accumulated loss is an unscaled sum over valid tokens. With .mean, the final metric is divided by num_tokens and by num_steps, making the loss num_steps times too small and causing the model to behave as though sequences should be as long as possible (no EOS is learned).

Fix

# d2l/torch.py — after
weighted_loss = (unweighted_loss * weights).sum(dim=1)

.sum(dim=1) accumulates only the valid-token losses (padding weights are 0, so their contribution is exactly 0). train_seq2seq then correctly normalises by the total number of valid tokens.

Fixes #2076.

.mean(dim=1) divides each sample's masked loss by the full sequence
length (num_steps), including padding steps whose weight is 0. This
under-scales the loss and, combined with train_seq2seq dividing again
by num_tokens, causes the reported per-token loss to be num_steps times
smaller than the true value.

.sum(dim=1) is correct: padding steps contribute 0 (weights=0), so the
sum equals the total loss over valid tokens only. train_seq2seq then
divides this sum by num_tokens to obtain the correct per-token loss.

Fixes d2l-ai#2076.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MaskedSoftmaxCELoss code wrong

1 participant