Skip to content

Inconsistent tokenization and BLEU scores between AutoTokinzer and NllbTokenizerFast #44993

@AdrianSteene

Description

@AdrianSteene

System Info

System Info

  • transformers version: 5.0.0
  • Platform: macOS-26.3.1-arm64-arm-64bit
  • Python version: 3.10.19
  • PyTorch version: 2.10.0

Information

I've been evaluating facebook/nllb-200-distilled-600M across 36 different language pairs and ran into a significant discrepancy depending on which tokenizer class is instantiated.

When using NllbTokenizerFast versus AutoTokenizer, the resulting BLEU scores are drastically different for the exact same generation parameters.

For example:

  • swe_Latn -> fra_Latn: Drops from ~43.35 BLEU (Fast) to ~9.02 BLEU (Auto).
  • spa_Latn -> fra_Latn: Jumps from ~33.97 BLEU (Dast) to ~53.25 BLEU (Auto).

To understand the massive gap in BLEU scores, I inspected the raw token outputs. I noticed that AutoTokenizer completely ignores the src_lang argument and drops the routing prefix.

However, when testing this on a second machine, both AutoTokenizer and NllbTokenizerFast produced the exact same output. After comparing the environments, I realized the only variable was the presence of the sentencepiece library:

  • With sentencepiece installed: AutoTokenizer fails to prepend the src_lang token and appends an <unk> token at the end
  • Without sentencepiece: AutoTokenizer and NllbTokenizerFast produce the same tokens

BLEU Score Heatmaps

Here is the side-by-side comparison of the 36 language pairs.

NllbTokenizerFast AutoTokenizer
NllbTokenizer Heatmap AutoTokenizer Heatmap

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"

inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")


print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))

Expected behavior

With sentencepiece installed

NllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']

AutoTokenizer:
Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3]
Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '', '']

Without sentencepiece installed (For me the expected results)

NllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']

AutoTokenizer:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions