fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707
Open
Chessing234 wants to merge 1 commit into
Open
fix(torch): use sum not mean in MaskedSoftmaxCELoss.forward#2707Chessing234 wants to merge 1 commit into
Chessing234 wants to merge 1 commit into
Conversation
.mean(dim=1) divides each sample's masked loss by the full sequence length (num_steps), including padding steps whose weight is 0. This under-scales the loss and, combined with train_seq2seq dividing again by num_tokens, causes the reported per-token loss to be num_steps times smaller than the true value. .sum(dim=1) is correct: padding steps contribute 0 (weights=0), so the sum equals the total loss over valid tokens only. train_seq2seq then divides this sum by num_tokens to obtain the correct per-token loss. Fixes d2l-ai#2076.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
MaskedSoftmaxCELoss.forwardreduces the masked per-token losses with.mean(dim=1)instead of.sum(dim=1).Root cause
.mean(dim=1)divides each sample's loss bynum_steps(the full sequence length), including padding positions whose weight is 0. The result is artificially small by a factor proportional to the padding fraction.train_seq2seqthen accumulates the loss and reportsmetric[0] / metric[1]wheremetric[1] = Y_valid_len.sum(). That second division is correct only if the accumulated loss is an unscaled sum over valid tokens. With.mean, the final metric is divided bynum_tokensand bynum_steps, making the lossnum_stepstimes too small and causing the model to behave as though sequences should be as long as possible (no EOS is learned).Fix
.sum(dim=1)accumulates only the valid-token losses (padding weights are 0, so their contribution is exactly 0).train_seq2seqthen correctly normalises by the total number of valid tokens.Fixes #2076.