Skip to content

handle unable to load ft checkpoint#1729

Closed
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr1729
Closed

handle unable to load ft checkpoint#1729
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr1729

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Sep 19, 2025

Summary:

  • not being able to load ft checkpoint crashes the trainer
  • avoid loading the ft checkpoint for now to continue training

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025
@tushar00jain tushar00jain marked this pull request as draft September 19, 2025 20:06
@tushar00jain tushar00jain force-pushed the pr1729 branch 3 times, most recently from aa92478 to 2abfb3c Compare September 20, 2025 00:54
@tushar00jain tushar00jain marked this pull request as ready for review September 20, 2025 00:54
@tushar00jain tushar00jain requested a review from d4l3k September 20, 2025 00:54
Summary:
- not being able to load ft checkpoint crashes the trainer
- avoid loading the ft checkpoint for now to continue training
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be very dangerous. Do we know why the checkpoint isn't available? _find_load_step already checks the .metadata. So it seems to me that the checkpoint should exist. Can this be the storage consistency issue? If so TorchTitan shouldn't be responsible for doing such check.

@tushar00jain
Copy link
Contributor Author

tushar00jain commented Sep 22, 2025

@fegin titan isn't doing the check but if the underlying layer returns an error e.g. because the checkpoint is corrupt, titan shouldn't crash. the metadata check seems it's only looking for the metadata file but it doesn't check the contents of the file or weather the content is intact. so this change makes reading the actual checkpoint safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants