[ft] Skip extra quorum when using semi-sync training#1221
[ft] Skip extra quorum when using semi-sync training#1221H-Huang merged 1 commit intopytorch:mainfrom
Conversation
93f426d to
6fab72e
Compare
torchtitan/components/optimizer.py
Outdated
| self._ft_optimizer = ft.Optimizer(ft_manager, self) | ||
| self._call_from_ft: bool = False | ||
| # Originally this is False, True means we just call the step() as normally | ||
| self._call_from_ft: bool = True |
There was a problem hiding this comment.
Why changes this to True? step() will manually update _call_from_ft() to ensure that the call path is correctly routed through ft.Optimizer.step() then OptimizersContainer.step(). If we set to False, it will only go though OptimizersContainer.step() not ft.Optimizer.step().
There was a problem hiding this comment.
Yeah I had hardcoded this originally to get it working for the semi-sync training path. I updated this to be set by an argument in the constructor.
We only want to go through OptimizersContainer.step() not ft.Optimizer.step() when doing localsgd/diloco
d8acd57 to
abdadd9
Compare
fegin
left a comment
There was a problem hiding this comment.
LGTM, I think the fault tolerance logic in train.py is becoming larger enough to be moved inside ft.py. We can do the refactor after semi-sync training is more stable.
Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.
Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.
Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.
Missed that we were using the ftOptimizer when doing fault tolerant communication for HSDP, we should skip this when we have semi-sync training enabled and only do quorum when the replica groups sync.