add infra support for HF checkpoint conversion by tianyu-l · Pull Request #1404 · pytorch/torchtitan

tianyu-l · 2025-07-16T05:29:26Z

This PRs adds a new field StateDictAdapter in TrainSpec

Currently it only contains a pair of static methods to_hf and from_hf. Later we could add other pairs like to_meta / from_meta, to_vllm / from_vllm, etc.
It is passed to CheckpointManager to convert between torchtitan model and HF model during checkpoint save / load.
It could also be potentially used by downstream inference engines which only supports HF models.

In order to save / load in HF format, a model is required to have a corresponding StateDictAdapter subclass implementation. For Llama3, I created a placeholder Llama3StateDictAdapter to be implemented. cc @wesleytruong

This PR also renames checkpoint config options initial_load_model_weights_only last_save_model_weights_only to simply initial_load_model_only last_save_model_only, respectively.
It seems to me that the original names were made corresponding to torch.load(..., weights_only=True). As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

fegin · 2025-07-16T06:56:45Z

torchtitan/components/checkpoint.py

+            )
+            list(map(func, self.states[MODEL].model))
        else:
            dcp.load(state_dict, checkpoint_id=checkpoint_id)


Given that we now always flatten the model state_dict, meaning no model.load_state_dict() will be called, we should call model.load_state_dict() manually if model state_dict is in state_dict. You can hard code this information by passing a flag to this API. Please add a TODO so that I know what to fix after I'm back.

torchtitan/protocols/hf_adapter.py

fegin · 2025-07-16T07:02:26Z

torchtitan/protocols/hf_adapter.py

+
+    @staticmethod
+    @abstractmethod
+    def get_hf_state_dict(state_dict: dict[str, Any]) -> dict[str, Any]:


convert_to_hf_state_dict is a more accurate name.

Please see new API. I'm taking a titan-centric approach -- I only define to_hf / from_hf which I think from the context should be clear.

The context is that the class is called StateDictAdapter and args & returns are both state_dict.

In the future we can have pairs like to_meta / from_meta, to_vllm / from_vllm, etc.

I'm a little bit confusing what "Any" represents here, it might be too ideal to keep the whole state_dict with Tensors in memory. For example large model like deepseek-v3, should we only keep the state_dict name here, and leave the "Any" to be "None"?

torchtitan/protocols/hf_adapter.py

idoh · 2025-07-16T17:12:51Z

@tianyu-l Thanks for getting HF checkpoint conversion working.
This is incredibly helpful for research and compatibility with evaluation frameworks.

torchtitan/components/checkpoint.py

wwwjn · 2025-07-17T05:25:22Z

torchtitan/components/checkpoint.py

@@ -605,8 +624,8 @@ def _find_checkpoint_type(self, checkpoint_id: str) -> CheckpointType:

        for filename in os.listdir(checkpoint_id):
            if filename == "model.safetensors.index.json":


Quick note: This function is not accurate. For smaller models (with only one safetensor file, eg, https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main) , it doesn't have "model.safetensors.index.json" file. This file only exists when models weights are split into several safetensor files.

Good point. Let me address this in the next PR, as CI is broken. I don't want to submit another commit and get blocked.

@wesleytruong

This PRs adds a new field `StateDictAdapter` in `TrainSpec` - Currently it only contains a pair of static methods `to_hf` and `from_hf`. Later we could add other pairs like `to_meta` / `from_meta`, `to_vllm` / `from_vllm`, etc. - It is passed to `CheckpointManager` to convert between torchtitan model and HF model during checkpoint save / load. - It could also be potentially used by downstream inference engines which only supports HF models. In order to save / load in HF format, a model is required to have a corresponding `StateDictAdapter` subclass implementation. For Llama3, I created a placeholder `Llama3StateDictAdapter` to be implemented. cc @wesleytruong This PR also renames checkpoint config options `initial_load_model_weights_only` `last_save_model_weights_only` to simply `initial_load_model_only` `last_save_model_only`, respectively. It seems to me that the original names were made corresponding to `torch.load(..., weights_only=True)`. As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

@wesleytruong

This PRs adds a new field `StateDictAdapter` in `TrainSpec` - Currently it only contains a pair of static methods `to_hf` and `from_hf`. Later we could add other pairs like `to_meta` / `from_meta`, `to_vllm` / `from_vllm`, etc. - It is passed to `CheckpointManager` to convert between torchtitan model and HF model during checkpoint save / load. - It could also be potentially used by downstream inference engines which only supports HF models. In order to save / load in HF format, a model is required to have a corresponding `StateDictAdapter` subclass implementation. For Llama3, I created a placeholder `Llama3StateDictAdapter` to be implemented. cc @wesleytruong This PR also renames checkpoint config options `initial_load_model_weights_only` `last_save_model_weights_only` to simply `initial_load_model_only` `last_save_model_only`, respectively. It seems to me that the original names were made corresponding to `torch.load(..., weights_only=True)`. As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

@wesleytruong

This PRs adds a new field `StateDictAdapter` in `TrainSpec` - Currently it only contains a pair of static methods `to_hf` and `from_hf`. Later we could add other pairs like `to_meta` / `from_meta`, `to_vllm` / `from_vllm`, etc. - It is passed to `CheckpointManager` to convert between torchtitan model and HF model during checkpoint save / load. - It could also be potentially used by downstream inference engines which only supports HF models. In order to save / load in HF format, a model is required to have a corresponding `StateDictAdapter` subclass implementation. For Llama3, I created a placeholder `Llama3StateDictAdapter` to be implemented. cc @wesleytruong This PR also renames checkpoint config options `initial_load_model_weights_only` `last_save_model_weights_only` to simply `initial_load_model_only` `last_save_model_only`, respectively. It seems to me that the original names were made corresponding to `torch.load(..., weights_only=True)`. As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

@wesleytruong

This PRs adds a new field `StateDictAdapter` in `TrainSpec` - Currently it only contains a pair of static methods `to_hf` and `from_hf`. Later we could add other pairs like `to_meta` / `from_meta`, `to_vllm` / `from_vllm`, etc. - It is passed to `CheckpointManager` to convert between torchtitan model and HF model during checkpoint save / load. - It could also be potentially used by downstream inference engines which only supports HF models. In order to save / load in HF format, a model is required to have a corresponding `StateDictAdapter` subclass implementation. For Llama3, I created a placeholder `Llama3StateDictAdapter` to be implemented. cc @wesleytruong This PR also renames checkpoint config options `initial_load_model_weights_only` `last_save_model_weights_only` to simply `initial_load_model_only` `last_save_model_only`, respectively. It seems to me that the original names were made corresponding to `torch.load(..., weights_only=True)`. As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

@wesleytruong

This PRs adds a new field `StateDictAdapter` in `TrainSpec` - Currently it only contains a pair of static methods `to_hf` and `from_hf`. Later we could add other pairs like `to_meta` / `from_meta`, `to_vllm` / `from_vllm`, etc. - It is passed to `CheckpointManager` to convert between torchtitan model and HF model during checkpoint save / load. - It could also be potentially used by downstream inference engines which only supports HF models. In order to save / load in HF format, a model is required to have a corresponding `StateDictAdapter` subclass implementation. For Llama3, I created a placeholder `Llama3StateDictAdapter` to be implemented. cc @wesleytruong This PR also renames checkpoint config options `initial_load_model_weights_only` `last_save_model_weights_only` to simply `initial_load_model_only` `last_save_model_only`, respectively. It seems to me that the original names were made corresponding to `torch.load(..., weights_only=True)`. As long as we document & test clearly when this correspondence holds, I prefer the name in torchtitan to be simple and less ambiguous.

tianyu-l requested review from ankitageorge, pradeepfn, saumishr and wesleytruong July 16, 2025 05:29

tianyu-l requested review from fegin, wconstab and wwwjn as code owners July 16, 2025 05:29

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 16, 2025

tianyu-l force-pushed the hf_conversion branch 2 times, most recently from 06eb133 to a5dfd45 Compare July 16, 2025 06:48

fegin reviewed Jul 16, 2025

View reviewed changes

tianyu-l force-pushed the hf_conversion branch 3 times, most recently from de3fc7f to 9291c71 Compare July 16, 2025 09:11

add infra support for HF checkpoint conversion

15ee8a4

tianyu-l force-pushed the hf_conversion branch from 9291c71 to 15ee8a4 Compare July 17, 2025 03:15

wwwjn reviewed Jul 17, 2025

View reviewed changes

idoh mentioned this pull request Jul 17, 2025

[llama3] Add tied weights support #1409

Draft

wwwjn approved these changes Jul 20, 2025

View reviewed changes

tianyu-l merged commit beb29a1 into main Jul 20, 2025
10 checks passed

tianyu-l deleted the hf_conversion branch July 20, 2025 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add infra support for HF checkpoint conversion#1404

add infra support for HF checkpoint conversion#1404
tianyu-l merged 1 commit intomainfrom
hf_conversion

tianyu-l commented Jul 16, 2025 •

edited

Loading

Uh oh!

fegin Jul 16, 2025

Uh oh!

tianyu-l Jul 16, 2025

Uh oh!

Uh oh!

fegin Jul 16, 2025

Uh oh!

tianyu-l Jul 16, 2025

Uh oh!

wwwjn Jul 17, 2025

Uh oh!

Uh oh!

idoh commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

wwwjn Jul 17, 2025 •

edited

Loading

Uh oh!

tianyu-l Jul 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -605,8 +624,8 @@ def _find_checkpoint_type(self, checkpoint_id: str) -> CheckpointType:

		for filename in os.listdir(checkpoint_id):
		if filename == "model.safetensors.index.json":

Conversation

tianyu-l commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

idoh commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wwwjn Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tianyu-l commented Jul 16, 2025 •

edited

Loading

idoh commented Jul 16, 2025 •

edited

Loading

wwwjn Jul 17, 2025 •

edited

Loading