System Info
System Info
transformers version: 5.0.0
- Platform: macOS-26.3.1-arm64-arm-64bit
- Python version: 3.10.19
- PyTorch version: 2.10.0
Information
I've been evaluating facebook/nllb-200-distilled-600M across 36 different language pairs and ran into a significant discrepancy depending on which tokenizer class is instantiated.
When using NllbTokenizerFast versus AutoTokenizer, the resulting BLEU scores are drastically different for the exact same generation parameters.
For example:
swe_Latn -> fra_Latn: Drops from ~43.35 BLEU (Fast) to ~9.02 BLEU (Auto).
spa_Latn -> fra_Latn: Jumps from ~33.97 BLEU (Dast) to ~53.25 BLEU (Auto).
To understand the massive gap in BLEU scores, I inspected the raw token outputs. I noticed that AutoTokenizer completely ignores the src_lang argument and drops the routing prefix.
However, when testing this on a second machine, both AutoTokenizer and NllbTokenizerFast produced the exact same output. After comparing the environments, I realized the only variable was the presence of the sentencepiece library:
- With
sentencepiece installed: AutoTokenizer fails to prepend the src_lang token and appends an <unk> token at the end
- Without
sentencepiece: AutoTokenizer and NllbTokenizerFast produce the same tokens
BLEU Score Heatmaps
Here is the side-by-side comparison of the 36 language pairs.
NllbTokenizerFast |
AutoTokenizer |
 |
 |
Who can help?
@ArthurZucker @itazap
Information
Tasks
Reproduction
from transformers import AutoTokenizer, NllbTokenizerFast
model_name = "facebook/nllb-200-distilled-600M"
tokenizer_auto = AutoTokenizer.from_pretrained(model_name)
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)
sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."
tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"
inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")
print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))
print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))
Expected behavior
With sentencepiece installed
NllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']
AutoTokenizer:
Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3]
Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '', '']
Without sentencepiece installed (For me the expected results)
NllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']
AutoTokenizer:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']
System Info
System Info
transformersversion: 5.0.0Information
I've been evaluating
facebook/nllb-200-distilled-600Macross 36 different language pairs and ran into a significant discrepancy depending on which tokenizer class is instantiated.When using
NllbTokenizerFastversusAutoTokenizer, the resulting BLEU scores are drastically different for the exact same generation parameters.For example:
swe_Latn->fra_Latn: Drops from ~43.35 BLEU (Fast) to ~9.02 BLEU (Auto).spa_Latn->fra_Latn: Jumps from ~33.97 BLEU (Dast) to ~53.25 BLEU (Auto).To understand the massive gap in BLEU scores, I inspected the raw token outputs. I noticed that
AutoTokenizercompletely ignores thesrc_langargument and drops the routing prefix.However, when testing this on a second machine, both
AutoTokenizerandNllbTokenizerFastproduced the exact same output. After comparing the environments, I realized the only variable was the presence of thesentencepiecelibrary:sentencepieceinstalled:AutoTokenizerfails to prepend thesrc_langtoken and appends an<unk>token at the endsentencepiece:AutoTokenizerandNllbTokenizerFastproduce the same tokensBLEU Score Heatmaps
Here is the side-by-side comparison of the 36 language pairs.
NllbTokenizerFastAutoTokenizerWho can help?
@ArthurZucker @itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
With
sentencepieceinstalledNllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']
AutoTokenizer:
Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3]
Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '', '']
Without
sentencepieceinstalled (For me the expected results)NllbTokenizerFast:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']
AutoTokenizer:
Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2]
Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '']