Execute moe.gate in float32 by chelsea0x3b · Pull Request #2389 · pytorch/torchtitan

chelsea0x3b · 2026-02-17T22:37:54Z

Original discussion #2225.

Per comments this PR now changes the gate to happen in f32.

Run on 8xb200.

	AC=none	AC=full
gate in bfloat16 (main branch)	memory: 175.60GiB(98.45%) tps: 12,437 tflops: 389.45 mfu: 17.31%	memory: 66.91GiB(37.51%) tps: 10,050 tflops: 314.69 mfu: 13.99%
gate in float32 (this PR)	memory: 171.76GiB(96.30%) tps: 12,393 tflops: 388.07 mfu: 17.25%	memory: 66.91GiB(37.51%) tps: 9,990 tflops: 312.83 mfu: 13.90%

output from runs

Full AC with float32 gate (this PR)

[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.97  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,072  tflops: 64.87  mfu: 2.88%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,106  tflops: 65.95  mfu: 2.93%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,062  tflops: 64.58  mfu: 2.87%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.96  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.97  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,059  tflops: 64.48  mfu: 2.87%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,073  tflops: 64.90  mfu: 2.88%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,987  tflops: 312.72  mfu: 13.90%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.92  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,866  tflops: 308.94  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.92  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,866  tflops: 308.93  mfu: 13.73%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.93  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,994  tflops: 312.94  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.91  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.91  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.93  mfu: 13.91%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.06  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,933  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.07  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.07  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,933  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.82%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,991  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,991  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.92  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.96  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.92  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.94  mfu: 13.86%

Full AC with bfloat16 gate (main branch)

[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,861  tflops: 58.29  mfu: 2.59%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,865  tflops: 58.41  mfu: 2.60%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,940  tflops: 60.73  mfu: 2.70%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,986  tflops: 62.19  mfu: 2.76%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,923  tflops: 60.22  mfu: 2.68%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,927  tflops: 60.33  mfu: 2.68%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,921  tflops: 60.14  mfu: 2.67%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,865  tflops: 58.41  mfu: 2.60%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.23  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,908  tflops: 310.24  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.77  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.76  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.76  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,955  tflops: 311.73  mfu: 13.85%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.71  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,049  tflops: 314.68  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,049  tflops: 314.68  mfu: 13.99%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,008  tflops: 313.37  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,008  tflops: 313.37  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.13  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.14  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.17  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.14  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.16  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.23  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,004  tflops: 313.25  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.23  mfu: 13.92%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.30  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.30  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,039  tflops: 314.35  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.33  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,039  tflops: 314.35  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,053  tflops: 314.78  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.75  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.77  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.77  mfu: 13.99%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.43  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.42  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.42  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%

No AC with bfloat16 gate (main branch)

[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,092  tflops: 65.50  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,081  tflops: 65.16  mfu: 2.90%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 165.98GiB(93.06%)  tps: 2,091  tflops: 65.48  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,098  tflops: 65.69  mfu: 2.92%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 165.98GiB(93.06%)  tps: 2,094  tflops: 65.57  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,096  tflops: 65.63  mfu: 2.92%
[titan] 2026-02-24 19:44:29,793 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,089  tflops: 65.41  mfu: 2.91%
[titan] 2026-02-24 19:44:29,793 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,094  tflops: 65.57  mfu: 2.91%
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.18GiB(98.22%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.95  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.18GiB(98.22%)  tps: 7,439  tflops: 232.93  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,440  tflops: 232.96  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.58GiB(98.44%)  tps: 6,672  tflops: 208.94  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,451  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,453  tflops: 389.96  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,453  tflops: 389.93  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,769 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.58GiB(98.44%)  tps: 12,451  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,414  tflops: 388.71  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,414  tflops: 388.72  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.69  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,413  tflops: 388.69  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,089 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.58GiB(98.44%)  tps: 12,418  tflops: 388.85  mfu: 17.28%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.46  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.58GiB(98.44%)  tps: 12,440  tflops: 389.55  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.45  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,437  tflops: 389.45  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,436  tflops: 389.42  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.44  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.43  mfu: 17.31%
[titan] 2026-02-24 19:44:38,407 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,407 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,437  tflops: 389.43  mfu: 17.31%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,408  tflops: 388.53  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.58GiB(98.44%)  tps: 12,408  tflops: 388.54  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,408  tflops: 388.55  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,417  tflops: 388.82  mfu: 17.28%
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,410  tflops: 388.60  mfu: 17.27%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.58GiB(98.44%)  tps: 12,464  tflops: 390.31  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,466  tflops: 390.34  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,465  tflops: 390.32  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,462  tflops: 390.24  mfu: 17.34%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,463  tflops: 390.27  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,465  tflops: 390.32  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,464  tflops: 390.28  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,465  tflops: 390.33  mfu: 17.35%
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.66  mfu: 17.36%
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,477  tflops: 390.70  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,477  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,477  tflops: 390.71  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.58GiB(98.44%)  tps: 12,474  tflops: 390.61  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,459  tflops: 390.13  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,460  tflops: 390.15  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.58GiB(98.44%)  tps: 12,460  tflops: 390.16  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,459  tflops: 390.14  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,458  tflops: 390.10  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - Sleeping 2 seconds for other ranks to complete
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,458  tflops: 390.11  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,672 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,463  tflops: 390.27  mfu: 17.35%
[titan] 2026-02-24 19:44:43,672 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,672 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,459  tflops: 390.12  mfu: 17.34%

No AC with float32 gate

[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,146  tflops: 67.20  mfu: 2.99%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,144  tflops: 67.15  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,139  tflops: 66.99  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,145  tflops: 67.16  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,157  tflops: 67.55  mfu: 3.00%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.29GiB(94.35%)  tps: 2,147  tflops: 67.22  mfu: 2.99%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,137  tflops: 66.92  mfu: 2.97%
[titan] 2026-02-24 19:40:49,877 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.29GiB(94.35%)  tps: 2,146  tflops: 67.19  mfu: 2.99%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.38GiB(98.89%)  tps: 5,749  tflops: 180.02  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.38GiB(98.89%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,748  tflops: 180.00  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,748  tflops: 180.00  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.81  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.80  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.79GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.77  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.79GiB(94.63%)  tps: 10,116  tflops: 316.76  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,630  tflops: 270.23  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.48GiB(98.94%)  tps: 8,630  tflops: 270.23  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.19  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,630  tflops: 270.22  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.48GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.13GiB(94.83%)  tps: 10,378  tflops: 324.98  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.01  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,378  tflops: 324.99  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.01  mfu: 14.44%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.00  mfu: 14.44%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.13GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.03  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.45GiB(98.93%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.03  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.45GiB(98.93%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.02  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.06  mfu: 13.38%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.88  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.87  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.44%)  tps: 10,918  tflops: 341.89  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,917  tflops: 341.84  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.21GiB(95.43%)  tps: 10,918  tflops: 341.88  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.44%)  tps: 10,917  tflops: 341.86  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.21GiB(95.43%)  tps: 10,918  tflops: 341.89  mfu: 15.20%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,032 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.87  mfu: 15.19%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.31  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,403  tflops: 388.39  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,402  tflops: 388.33  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.76GiB(96.30%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.76GiB(96.30%)  tps: 12,401  tflops: 388.33  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,402  tflops: 388.36  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.27  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.76GiB(96.30%)  tps: 12,400  tflops: 388.30  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.25  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.31  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.76GiB(96.30%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,400  tflops: 388.30  mfu: 17.26%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.76GiB(96.30%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,394  tflops: 388.09  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.06  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,395  tflops: 388.13  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.76GiB(96.30%)  tps: 12,394  tflops: 388.11  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%

Hope this helps!

The numerics don't change with this.

chelsea0x3b · 2026-02-17T22:40:45Z

Here's the config file I was using. I used the same config file for both the logs in the description

``` [job] dump_folder = "./outputs" description = "Gpt-oss debug training" print_config = false

[profiling]
enable_profiling = false
profile_freq = 5

[metrics]
log_freq = 1
disable_color_printing = false
enable_tensorboard = false
enable_wandb = false

[model]
name = "gpt_oss"
flavor = "20b"
hf_assets_path = "./tests/assets/tokenizer"

[optimizer]
name = "AdamW"
lr = 8e-4
eps = 1e-8
implementation = "fused"

[lr_scheduler]
warmup_steps = 2 # lr scheduler warm up, normally 20% of the train steps
decay_ratio = 0.8 # lr scheduler decay ratio, 80% of the train steps
decay_type = "linear"
min_lr_factor = 0.0

[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0 # grad norm clipping
steps = 10
dataset = "c4" # supported datasets: c4_test (2K), c4 (177M)
dtype = "bfloat16"

[training.dataloader]
num_workers = 4
pin_memory = true
persistent_workers = true
prefetch_factor = 2

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
enable_async_tensor_parallel = true
expert_parallel_degree = 1
expert_tensor_parallel_degree = 1
expert_parallel_comm_backend = "standard" # or "deepep"

[checkpoint]
enable = false
folder = "checkpoint"
interval = 10
last_save_model_only = false
export_dtype = "float32"
async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"]

[activation_checkpoint]
mode = "none" # ["none", "selective", "full"]
selective_ac_option = '2' # 'int' = ac every positive int layer or 'op', ac based on ops policy

[compile]
enable = true
components = ["model", "loss"]

[validation]
enable = false
dataset = "c4_validation"
freq = 5
steps = 10

</details>

torchtitan/models/moe/moe.py

rakkit · 2026-02-18T13:32:22Z

Thx @chelsea0x3b

I think the probelm is we don't know what is "correct" for [bf16 or FP32] reduce.

In DeepEP it seems to be FP32 reduce.

In Megatron the probs is actually cast back to BF16 right after sigmoid (so gather can also be at least 2x faster):
scores = torch.sigmoid(logits.float()).type_as(logits)

in downstream infer lib, e.g. Sglang both bf16 and fp32 exits. (at some point there is smth about FP8 training/infer consistency, sglang for infer and megatron for training).

So its hard to tell what is correct especially when scale up.
If we decide to keep BF16 -> we can maybe do like megatron way to make it even faster or -> only do BF16 reduce but keep topk on fp32 for stable
Or we keep FP32 reduce (more deepseek style) by either via bmm or. .sum(dim=1). [i think bmm broken is more like from pytorch side?]
or we make options to let user decide what to go.

garrett361 · 2026-02-18T14:36:47Z

I think the probelm is we don't know what is "correct" for [bf16 or FP32] reduce.

Yeah, agreed that's the real issue.

In Megatron the probs is actually cast back to BF16 right after sigmoid

I found that Megatron does have a recommended --moe-router-dtype flag which forces higher-dtype computations, though.

or we make options to let user decide what to go

IMO making this configurable is reasonable.

rakkit · 2026-02-18T14:55:12Z

--moe-router-dtype in magatron is for gate's linear
logits = router_gating_linear(input, self.weight, self.bias, router_dtype)
in torchtitan we do BF16 gemm -> Cast to fp32

garrett361 · 2026-02-18T15:06:10Z

IIUC in Megatron --moe-router-dtype (when provided) controls the gate output dtype, which then also determines the dtype that the router-weight * activation computation is done in. Do we disagree @rakkit ? LMK if I'm wrong or misunderstanding you.

rakkit · 2026-02-18T15:17:10Z

oh i see your point, yes, you are right. if we force logits to be fp32 then the reset should also keep in fp32.

torchtitan/models/moe/moe.py

chelsea0x3b · 2026-02-18T18:52:47Z

Just caught up on this conversation - so should I add some configuration alongside this?

garrett361 · 2026-02-18T21:27:10Z

I suggest making it configurable and keeping the fp32 path as the default for BC

rakkit · 2026-02-19T17:50:48Z

actually i have another check on megatron. if we set --moe-router-dtype = fp32 it seems goes to "BF16@BF16 -> FP32", so actuall gemm on router's gate is BF16. reference here @garrett361

thats should be equative we do logits = torch.mm(x, self.gate.weight.t(), out_dtype=torch.float32)

TBH i dont see any reason we dont do this (by default for BF16 path). Rest code then seems make sense to keep on fp32?

garrett361 · 2026-02-19T19:01:10Z

@rakkit understanding check: you're saying we should by default have something like

class TokenChoiceTopKRouter(nn.Module):
    def forward(
        self, x: torch.Tensor, expert_bias: torch.Tensor | None = None
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
-       scores = self.gate(x)
+       scores = torch.mm(x, slef.gate.weight.t(), out_dtype=torch.float32)

?

Since we immediately cast scores to fp32 before the sigmoid/softmax anyway? Makes sense to me, if so, but a slightly different concern from the current PR.

rakkit · 2026-02-19T19:37:48Z

yes @garrett361.

we need a "fix" for torchtitan
By default we have BF16 gate GEMM (directly FP32 output instead of current FP32->BF16->FP32 cast). and by default (current behaviors, but need fix BMM) FP32 reduce (i.e. score_before_experts and score_afre_experts)

Then optionally offer BF16,. and/Or fully fp32 path.

garrett361 · 2026-02-19T19:46:53Z

but need fix BMM

this is about the backwards slowness here?

Whatever solution is fine to me. Either fixing bmm or moving to elementwise-prod-then-sum. Seems like the former should be strictly faster, though, being a single op. Which is why I introduced the bmm call.

rakkit · 2026-02-19T19:50:45Z

@garrett361 yes, BMM is faster (even with this weird SM80 kernels). IDK its a bug we need to fix or it's expected to like that.

garrett361 · 2026-02-19T20:09:18Z

yes, BMM is faster (even with this weird SM80 kernels)

Oh ok, I misunderstood the other thread; thought bmm was slower, somehow. Which was confusing me because I thought I tested 😅

So is this an accurate summary?

Change self.gate(x) to store outputs fp32 and (no change) do the subsequent sigmoid/softmax in fp32 as well
Make the score*outputs op dtype configurable
Do the scores * outputs op in the configured dtype (default fp32), and always immediately cast the result back to outputs.dtype

rakkit · 2026-02-19T20:20:13Z

So is this an accurate summary?

Change self.gate(x) to store outputs fp32 and (no change) do the subsequent sigmoid/softmax in fp32 as well
Make the score*outputs op dtype configurable
Do the scores * outputs op in the configured dtype (default fp32), and always immediately cast the result back to outputs.dtype

@garrett361 yes, and it more or less aligned with magnetron.

(its still differ to deepseek's HF inference code, we can comment in code incase someone wants something like "ultimate FP32" version)

chelsea0x3b · 2026-02-19T20:32:24Z

Another piece of data: openai's official gpt oss implementation doesn't use f32 at all: https://github.com/openai/gpt-oss/blob/main/gpt_oss/torch/model.py#L316

chelsea0x3b · 2026-02-19T20:42:42Z

And llama4 moe router casts right back to input dtype: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama4/modeling_llama4.py#L152

garrett361 · 2026-02-19T21:01:19Z

Yeah I saw that gpt-oss code as well @chelsea0x3b. Hadn't looked at the HF llama4 code. The meta llama4 code doesn't do any pre-sigmoid upcast at all. Not much consensus.

chelsea0x3b · 2026-02-19T21:22:16Z

IMO since there isn't really consensus I'd lean towards casting back to bfloat16 immediately (as in the PR) because it reduces memory by a decent amount (5% is like 9GB on b200, which is substantial for users not on large GPUs).

garrett361 · 2026-02-19T22:21:44Z

@tianyu-l do you have a strong opinion here?

acisseJZhong · 2026-02-27T08:09:14Z

@garrett361 #2448 would sth like this work? the test seems no longer have same issue(failing on inductor)

rakkit · 2026-02-27T08:13:59Z

@acisseJZhong i had this version ages ago and compile works good

acisseJZhong · 2026-02-27T16:53:25Z

lol @rakkit what prevents you from landing that change? shall we revive or maybe @chelsea0x3b could just fix this and land!

rakkit · 2026-02-27T17:03:04Z

@acisseJZhong so times ago i have that and from my ablation (7b-a1b MoE, 64 experts) i don't see significant diff on performance, and mid-2025 cause we need to train smth on 40GB A100 so i decided to keep bf16. (we train another 7b x 1T tokens for bf16-bf16-bf16 and kind still works). but TBH i never think about TP and mixed precision stuff at beginning

chelsea0x3b · 2026-02-27T17:14:49Z

sorry so we want to go with the auto cast? just getting a lot of mixed messages and i don't know who to listen to lol. about to go out of town for a week. would love to get this PR merged

acisseJZhong · 2026-02-27T17:25:48Z

@chelsea0x3b can you help test numerics with autocast approach(#2448) in your PR? sorry for all the back and forth 🤣 hope we could land soon.

chelsea0x3b · 2026-02-27T18:06:33Z

@acisseJZhong numerics for autocast look good with DP/EP/ETP, PR should be good to go

acisseJZhong · 2026-02-27T20:51:54Z

torchtitan/models/common/moe/moe.py

i think you can remove the cast to float32? as scores is already fp32. Maybe add a comment saying scores is by default fp32 already. Thanks!

acisseJZhong · 2026-02-27T20:53:07Z

@chelsea0x3b pls remove the cast to fp32 for scores, and you can feel free to land it! Would appreciate if you paste the testing result(numerics doesn't change) in PR description!

chelsea0x3b · 2026-02-27T22:47:33Z

Im away from my laptop so I dont have access to the logs anymore, but I updated the description and removed the redundant to call

tianyu-l · 2026-02-28T08:14:53Z

In PR summary, why AC=none results in more memory in bf16 than fp32? Could you include the parallelism config you are using? If using EP, we should turn on load balanced routing to factor out the imbalance https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/configs.py#L387

@acisseJZhong

Would appreciate if you paste the testing result(numerics doesn't change) in PR description!

I don't think the numerics should be the same even if we fix random seed and turn on deterministic mode, because fp32 gate matmul should give us different results than bf16 gate matmul.

chelsea0x3b · 2026-02-28T13:02:25Z

@tianyu-l because it kept hitting a ton of cuda memory reallocations. check out the full log i posted for that case, the memory usage varies a lot between each step and the number of reallocations increased each step. i just was taking the mem usage from a single step so it was hard to pick which step for that one bc there wasn't a good "average". the other cases were all much more consistent.

and yes the f32 was different numbers, especially for all the different cases, but the losses all followed the same pattern and were very close (within like .1 of each other or closer)

tianyu-l · 2026-02-28T18:50:22Z

@chelsea0x3b Regarding cuda memory reallocation, I guess you could use a smaller model, e.g. even the debugmodel to showcase.

Please fix lint error.

pytorch#2225 Fix router dtype cast

1be88b1

chelsea0x3b requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 17, 2026 22:37

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2026

tianyu-l reviewed Feb 17, 2026

View reviewed changes

torchtitan/models/moe/moe.py Show resolved Hide resolved

chelsea0x3b commented Feb 18, 2026

View reviewed changes

torchtitan/models/moe/moe.py Show resolved Hide resolved

wwwjn requested a review from tianyu-l February 20, 2026 20:57

tianyu-l added this to 26H1 TorchTitan Development Feb 21, 2026

qayqaq added 2 commits February 26, 2026 15:18

Merge branch 'main' into 2225-router-dtype

8e879cb

Adding LinearFloat32 wrapper for router.gate

18bb18c

chelsea0x3b requested review from acisseJZhong, garrett361 and wwwjn February 26, 2026 15:24

weifengpy removed request for garrett361 and rakkit February 26, 2026 16:23

acisseJZhong mentioned this pull request Feb 27, 2026

MoE gate in fp32 #2448

Draft

Fix lint

7fe5475

chelsea0x3b and others added 3 commits February 27, 2026 12:51

Merge branch 'pytorch:main' into 2225-router-dtype

5a6e5bf

Use autocast method

1921433

Remove LinearFloat32

a3b4320

acisseJZhong reviewed Feb 27, 2026

View reviewed changes

wwwjn assigned acisseJZhong Feb 27, 2026

wwwjn added this to the New Feature, Model, Misc milestone Feb 27, 2026

pytorch-bot bot added the ciflow/8gpu label Feb 27, 2026

Remove unnecessary float32 cast

adcf052

pytorch-bot bot removed the ciflow/8gpu label Feb 27, 2026

Conversation

chelsea0x3b commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

output from runs

Uh oh!

chelsea0x3b commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rakkit commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garrett361 commented Feb 18, 2026

Uh oh!

rakkit commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garrett361 commented Feb 18, 2026

Uh oh!

rakkit commented Feb 18, 2026

Uh oh!

Uh oh!

chelsea0x3b commented Feb 18, 2026

Uh oh!

garrett361 commented Feb 18, 2026

Uh oh!

rakkit commented Feb 19, 2026

Uh oh!

garrett361 commented Feb 19, 2026

Uh oh!

rakkit commented Feb 19, 2026

Uh oh!

garrett361 commented Feb 19, 2026

Uh oh!

rakkit commented Feb 19, 2026

Uh oh!

garrett361 commented Feb 19, 2026

Uh oh!

rakkit commented Feb 19, 2026

Uh oh!

chelsea0x3b commented Feb 19, 2026

Uh oh!

chelsea0x3b commented Feb 19, 2026

Uh oh!

garrett361 commented Feb 19, 2026

Uh oh!

chelsea0x3b commented Feb 19, 2026

Uh oh!

garrett361 commented Feb 19, 2026

Uh oh!

acisseJZhong commented Feb 27, 2026

Uh oh!

rakkit commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acisseJZhong commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakkit commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chelsea0x3b commented Feb 27, 2026

Uh oh!

acisseJZhong commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chelsea0x3b commented Feb 27, 2026

Uh oh!

acisseJZhong Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

acisseJZhong commented Feb 27, 2026

Uh oh!

chelsea0x3b commented Feb 27, 2026

Uh oh!

tianyu-l commented Feb 28, 2026

Uh oh!

chelsea0x3b commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chelsea0x3b commented Feb 17, 2026 •

edited

Loading

chelsea0x3b commented Feb 17, 2026 •

edited

Loading

rakkit commented Feb 18, 2026 •

edited

Loading

rakkit commented Feb 18, 2026 •

edited

Loading

rakkit commented Feb 27, 2026 •

edited

Loading

acisseJZhong commented Feb 27, 2026 •

edited

Loading

rakkit commented Feb 27, 2026 •

edited

Loading

acisseJZhong commented Feb 27, 2026 •

edited

Loading

chelsea0x3b commented Feb 28, 2026 •

edited

Loading