added better guidance for if deprecated tokenizer path fails#1568
added better guidance for if deprecated tokenizer path fails#1568wesleytruong merged 2 commits intomainfrom
Conversation
XilunWu
left a comment
There was a problem hiding this comment.
LGTM but i have some suggestions.
torchtitan/components/tokenizer.py
Outdated
| if "assets/tokenizer" in tokenizer_path: | ||
| raise FileNotFoundError( | ||
| "Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " | ||
| "and download to --model.hf_assets_path using ./scripts/download_hf_assets.py" |
There was a problem hiding this comment.
I suggest we add link https://github.com/pytorch/torchtitan/tree/main/torchtitan/models/deepseek_v3#download-tokenizer for better user guidance.
torchtitan/components/tokenizer.py
Outdated
|
|
||
| if "assets/tokenizer" in tokenizer_path: | ||
| raise FileNotFoundError( | ||
| "Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " |
There was a problem hiding this comment.
what about adding more info about path deprecation:
"detected ... path that's deprecated in #1526"
torchtitan/components/tokenizer.py
Outdated
| if "assets/tokenizer" in tokenizer_path: | ||
| raise FileNotFoundError( | ||
| "Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " | ||
| "and download to --model.hf_assets_path using ./scripts/download_hf_assets.py" | ||
| ) | ||
| else: | ||
| raise FileNotFoundError( | ||
| f"No supported tokenizer files found in '{tokenizer_path}'. " | ||
| f"Available files: {available_files}. " | ||
| "Looking for: tokenizer.json, vocab.txt+merges.txt, or vocab.json+merges.txt" | ||
| ) |
There was a problem hiding this comment.
I think we don't need to change the check here because the logic was:
- check "tokenizer_path" folder for available tokenizer files.
While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.
There was a problem hiding this comment.
If the logic is not required as @XilunWu mentioned, please removed it. If the logic is required, this if/else plus the error messages look almost identical to the error handling block above, please consider to make it as a function.
There was a problem hiding this comment.
I think we don't need to change the check here because the logic was:
- check "tokenizer_path" folder for available tokenizer files.
While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.
Ok I think I'll remove the second check. I added it because I wanted to cover the use case of people who may have this path due to using legacy script but are missing the tokenizer file, but I realize people will probably either not have this path at all or have the path with the files as well.
fegin
left a comment
There was a problem hiding this comment.
Stamp to unblock, please address the comments before landing.
torchtitan/components/tokenizer.py
Outdated
|
|
||
| if "assets/tokenizer" in tokenizer_path: | ||
| raise FileNotFoundError( | ||
| "Detected ./assets/tokenizer path which was deprecated in https://github.com/pytorch/torchtitan/pull/1540. \n" |
There was a problem hiding this comment.
any reason for the extra space?
b768745 to
58bd436
Compare
58bd436 to
293876e
Compare
Sorry for the inconvenience, I've been waiting until #1582 lands since the current hf_asset_path currently isn't well supported for special multimodal nested repositories such as in FLUX experiment without specifying You bring up a good point though, I'll change the toml files now for non-flux models. |
…#1568) Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported `hf_assets_path` and `download_hf_assets.py` script
…#1568) Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported `hf_assets_path` and `download_hf_assets.py` script
Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported
hf_assets_pathanddownload_hf_assets.pyscript