Skip to content

added better guidance for if deprecated tokenizer path fails#1568

Merged
wesleytruong merged 2 commits intomainfrom
tokenizer_path_warning
Aug 15, 2025
Merged

added better guidance for if deprecated tokenizer path fails#1568
wesleytruong merged 2 commits intomainfrom
tokenizer_path_warning

Conversation

@wesleytruong
Copy link
Contributor

Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported hf_assets_path and download_hf_assets.py script

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 14, 2025
@wesleytruong wesleytruong changed the title added better guidance for when tokenizer path fails added better guidance for if deprecated tokenizer path fails Aug 14, 2025
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but i have some suggestions.

if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding more info about path deprecation:
"detected ... path that's deprecated in #1526"

Comment on lines 168 to 178
if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py"
)
else:
raise FileNotFoundError(
f"No supported tokenizer files found in '{tokenizer_path}'. "
f"Available files: {available_files}. "
"Looking for: tokenizer.json, vocab.txt+merges.txt, or vocab.json+merges.txt"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change the check here because the logic was:

  • check "tokenizer_path" folder for available tokenizer files.

While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logic is not required as @XilunWu mentioned, please removed it. If the logic is required, this if/else plus the error messages look almost identical to the error handling block above, please consider to make it as a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change the check here because the logic was:

  • check "tokenizer_path" folder for available tokenizer files.

While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.

Ok I think I'll remove the second check. I added it because I wanted to cover the use case of people who may have this path due to using legacy script but are missing the tokenizer file, but I realize people will probably either not have this path at all or have the path with the files as well.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to unblock, please address the comments before landing.


if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected ./assets/tokenizer path which was deprecated in https://github.com/pytorch/torchtitan/pull/1540. \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for the extra space?

@wesleytruong wesleytruong force-pushed the tokenizer_path_warning branch 2 times, most recently from b768745 to 58bd436 Compare August 14, 2025 21:05
@wesleytruong wesleytruong force-pushed the tokenizer_path_warning branch from 58bd436 to 293876e Compare August 14, 2025 21:06
@wesleytruong wesleytruong merged commit a59abea into main Aug 15, 2025
5 of 7 checks passed
@ruisizhang123
Copy link
Contributor

Hi, found this pr after failing to load tokenizer in deepseekv3 due to the assertion error in tokenizer_path = "./assets/tokenizer/deepseek-moe-16b-base". (Link)

Wonder if there is a reason for not updating it as assets/hf or not updating hf_assets_path in deepseekv3 config here?

@wesleytruong
Copy link
Contributor Author

Hi, found this pr after failing to load tokenizer in deepseekv3 due to the assertion error in tokenizer_path = "./assets/tokenizer/deepseek-moe-16b-base". (Link)

Wonder if there is a reason for not updating it as assets/hf or not updating hf_assets_path in deepseekv3 config here?

Sorry for the inconvenience, I've been waiting until #1582 lands since the current hf_asset_path currently isn't well supported for special multimodal nested repositories such as in FLUX experiment without specifying initial_load_path separately from hf_assets_path when using initial_load_in_hf.

You bring up a good point though, I'll change the toml files now for non-flux models.

@tianyu-l tianyu-l deleted the tokenizer_path_warning branch August 19, 2025 03:18
xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 13, 2026
…#1568)

Adds a check to see if the old tokenizer path is being used when
tokenizer path fails. This way it can provide guidance to people to
update to the supported `hf_assets_path` and `download_hf_assets.py`
script
xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 25, 2026
…#1568)

Adds a check to see if the old tokenizer path is being used when
tokenizer path fails. This way it can provide guidance to people to
update to the supported `hf_assets_path` and `download_hf_assets.py`
script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants