Skip to content

Execute moe.gate in float32#2389

Open
chelsea0x3b wants to merge 14 commits intopytorch:mainfrom
chelsea0x3b:2225-router-dtype
Open

Execute moe.gate in float32#2389
chelsea0x3b wants to merge 14 commits intopytorch:mainfrom
chelsea0x3b:2225-router-dtype

Conversation

@chelsea0x3b
Copy link
Contributor

@chelsea0x3b chelsea0x3b commented Feb 17, 2026

Original discussion #2225.

Per comments this PR now changes the gate to happen in f32.

Run on 8xb200.

AC=none AC=full
gate in bfloat16 (main branch) memory: 175.60GiB(98.45%) tps: 12,437 tflops: 389.45 mfu: 17.31% memory: 66.91GiB(37.51%) tps: 10,050 tflops: 314.69 mfu: 13.99%
gate in float32 (this PR) memory: 171.76GiB(96.30%) tps: 12,393 tflops: 388.07 mfu: 17.25% memory: 66.91GiB(37.51%) tps: 9,990 tflops: 312.83 mfu: 13.90%

output from runs

Full AC with float32 gate (this PR)
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.97  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,072  tflops: 64.87  mfu: 2.88%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,106  tflops: 65.95  mfu: 2.93%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,062  tflops: 64.58  mfu: 2.87%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.96  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,075  tflops: 64.97  mfu: 2.89%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,059  tflops: 64.48  mfu: 2.87%
[titan] 2026-02-24 19:48:29,513 - root - INFO - step:  1  loss: 12.69021  grad_norm: 10.0625  memory: 66.91GiB(37.51%)  tps: 2,073  tflops: 64.90  mfu: 2.88%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.71  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,986  tflops: 312.70  mfu: 13.90%
[titan] 2026-02-24 19:48:31,154 - root - INFO - step:  2  loss: 10.07672  grad_norm: 25.6250  memory: 66.91GiB(37.51%)  tps: 9,987  tflops: 312.72  mfu: 13.90%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.92  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,866  tflops: 308.94  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.92  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,865  tflops: 308.91  mfu: 13.73%
[titan] 2026-02-24 19:48:32,815 - root - INFO - step:  3  loss:  8.92008  grad_norm: 44.7500  memory: 66.91GiB(37.51%)  tps: 9,866  tflops: 308.93  mfu: 13.73%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.93  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,994  tflops: 312.94  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.92  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.91  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.91  mfu: 13.91%
[titan] 2026-02-24 19:48:34,455 - root - INFO - step:  4  loss:  7.49564  grad_norm: 23.3750  memory: 66.91GiB(37.51%)  tps: 9,993  tflops: 312.93  mfu: 13.91%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.06  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.07  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:36,099 - root - INFO - step:  5  loss:  6.92255  grad_norm: 45.0000  memory: 66.91GiB(37.51%)  tps: 9,966  tflops: 312.08  mfu: 13.87%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.40  mfu: 13.80%
[titan] 2026-02-24 19:48:37,752 - root - INFO - step:  6  loss:  6.08224  grad_norm: 23.0000  memory: 66.91GiB(37.51%)  tps: 9,913  tflops: 310.41  mfu: 13.80%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,945  tflops: 311.42  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:39,400 - root - INFO - step:  7  loss:  5.30016  grad_norm: 15.7500  memory: 66.91GiB(37.51%)  tps: 9,946  tflops: 311.43  mfu: 13.84%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,933  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.07  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.07  mfu: 13.83%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,933  tflops: 311.05  mfu: 13.82%
[titan] 2026-02-24 19:48:41,050 - root - INFO - step:  8  loss:  4.53393  grad_norm: 11.6875  memory: 66.91GiB(37.51%)  tps: 9,934  tflops: 311.06  mfu: 13.82%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,990  tflops: 312.83  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,991  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:42,690 - root - INFO - step:  9  loss:  4.06968  grad_norm:  8.0625  memory: 66.91GiB(37.51%)  tps: 9,991  tflops: 312.84  mfu: 13.90%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.92  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.93  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.96  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,961  tflops: 311.92  mfu: 13.86%
[titan] 2026-02-24 19:48:44,335 - root - INFO - step: 10  loss:  3.88978  grad_norm: 12.0000  memory: 66.91GiB(37.51%)  tps: 9,962  tflops: 311.94  mfu: 13.86%
Full AC with bfloat16 gate (main branch)
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,861  tflops: 58.29  mfu: 2.59%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,865  tflops: 58.41  mfu: 2.60%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,940  tflops: 60.73  mfu: 2.70%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,986  tflops: 62.19  mfu: 2.76%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,923  tflops: 60.22  mfu: 2.68%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,927  tflops: 60.33  mfu: 2.68%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,921  tflops: 60.14  mfu: 2.67%
[titan] 2026-02-24 19:46:12,577 - root - INFO - step:  1  loss: 12.79842  grad_norm:  9.8750  memory: 66.91GiB(37.51%)  tps: 1,865  tflops: 58.41  mfu: 2.60%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.23  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,908  tflops: 310.24  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.21  mfu: 13.79%
[titan] 2026-02-24 19:46:14,231 - root - INFO - step:  2  loss:  9.75310  grad_norm: 26.0000  memory: 66.91GiB(37.51%)  tps: 9,907  tflops: 310.22  mfu: 13.79%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.77  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.76  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.76  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,956  tflops: 311.75  mfu: 13.86%
[titan] 2026-02-24 19:46:15,877 - root - INFO - step:  3  loss:  9.27740  grad_norm: 32.2500  memory: 66.91GiB(37.51%)  tps: 9,955  tflops: 311.73  mfu: 13.85%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.71  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,049  tflops: 314.68  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,050  tflops: 314.69  mfu: 13.99%
[titan] 2026-02-24 19:46:17,508 - root - INFO - step:  4  loss:  7.62585  grad_norm: 26.6250  memory: 66.91GiB(37.51%)  tps: 10,049  tflops: 314.68  mfu: 13.99%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,008  tflops: 313.37  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,008  tflops: 313.37  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.36  mfu: 13.93%
[titan] 2026-02-24 19:46:19,145 - root - INFO - step:  5  loss:  7.07623  grad_norm: 46.5000  memory: 66.91GiB(37.51%)  tps: 10,007  tflops: 313.35  mfu: 13.93%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.13  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.14  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.17  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,000  tflops: 313.14  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.16  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:20,784 - root - INFO - step:  6  loss:  6.00625  grad_norm: 31.0000  memory: 66.91GiB(37.51%)  tps: 10,001  tflops: 313.15  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.23  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,004  tflops: 313.25  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.24  mfu: 13.92%
[titan] 2026-02-24 19:46:22,422 - root - INFO - step:  7  loss:  4.91313  grad_norm: 17.7500  memory: 66.91GiB(37.51%)  tps: 10,003  tflops: 313.23  mfu: 13.92%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.30  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,037  tflops: 314.30  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,039  tflops: 314.35  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.33  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,039  tflops: 314.35  mfu: 13.97%
[titan] 2026-02-24 19:46:24,055 - root - INFO - step:  8  loss:  4.16736  grad_norm: 12.8125  memory: 66.91GiB(37.51%)  tps: 10,038  tflops: 314.31  mfu: 13.97%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,053  tflops: 314.78  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.75  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.77  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.76  mfu: 13.99%
[titan] 2026-02-24 19:46:25,685 - root - INFO - step:  9  loss:  3.80604  grad_norm:  9.1250  memory: 66.91GiB(37.51%)  tps: 10,052  tflops: 314.77  mfu: 13.99%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.43  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.42  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.42  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
[titan] 2026-02-24 19:46:27,312 - root - INFO - step: 10  loss:  3.59465  grad_norm:  8.3125  memory: 66.91GiB(37.51%)  tps: 10,073  tflops: 315.41  mfu: 14.02%
No AC with bfloat16 gate (main branch)
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,092  tflops: 65.50  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,081  tflops: 65.16  mfu: 2.90%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 165.98GiB(93.06%)  tps: 2,091  tflops: 65.48  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,098  tflops: 65.69  mfu: 2.92%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 165.98GiB(93.06%)  tps: 2,094  tflops: 65.57  mfu: 2.91%
[titan] 2026-02-24 19:44:29,792 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,096  tflops: 65.63  mfu: 2.92%
[titan] 2026-02-24 19:44:29,793 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,089  tflops: 65.41  mfu: 2.91%
[titan] 2026-02-24 19:44:29,793 - root - INFO - step:  1  loss: 12.55633  grad_norm:  9.9375  memory: 166.34GiB(93.26%)  tps: 2,094  tflops: 65.57  mfu: 2.91%
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,995 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.18GiB(98.22%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.95  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.18GiB(98.22%)  tps: 7,439  tflops: 232.93  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,440  tflops: 232.96  mfu: 10.35%
[titan] 2026-02-24 19:44:31,996 - root - WARNING - 1 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:31,996 - root - INFO - step:  2  loss:  9.69482  grad_norm: 26.5000  memory: 175.54GiB(98.42%)  tps: 7,439  tflops: 232.94  mfu: 10.35%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.93  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.60GiB(98.45%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.58GiB(98.44%)  tps: 6,672  tflops: 208.94  mfu: 9.29%
[titan] 2026-02-24 19:44:34,452 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:34,452 - root - INFO - step:  3  loss: 10.47314  grad_norm: 24.7500  memory: 175.62GiB(98.46%)  tps: 6,672  tflops: 208.92  mfu: 9.29%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,451  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,453  tflops: 389.96  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.60GiB(98.45%)  tps: 12,453  tflops: 389.93  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.62GiB(98.46%)  tps: 12,452  tflops: 389.91  mfu: 17.33%
[titan] 2026-02-24 19:44:35,768 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:35,769 - root - INFO - step:  4  loss:  9.69192  grad_norm: 50.7500  memory: 175.58GiB(98.44%)  tps: 12,451  tflops: 389.90  mfu: 17.33%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,414  tflops: 388.71  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,414  tflops: 388.72  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.69  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,413  tflops: 388.69  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.62GiB(98.46%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,088 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.60GiB(98.45%)  tps: 12,413  tflops: 388.70  mfu: 17.28%
[titan] 2026-02-24 19:44:37,088 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:37,089 - root - INFO - step:  5  loss:  7.66977  grad_norm: 47.7500  memory: 175.58GiB(98.44%)  tps: 12,418  tflops: 388.85  mfu: 17.28%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.46  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.58GiB(98.44%)  tps: 12,440  tflops: 389.55  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.45  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,437  tflops: 389.45  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,436  tflops: 389.42  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.44  mfu: 17.31%
[titan] 2026-02-24 19:44:38,406 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,406 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.62GiB(98.46%)  tps: 12,437  tflops: 389.43  mfu: 17.31%
[titan] 2026-02-24 19:44:38,407 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:38,407 - root - INFO - step:  6  loss:  5.94582  grad_norm: 24.5000  memory: 175.60GiB(98.45%)  tps: 12,437  tflops: 389.43  mfu: 17.31%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,408  tflops: 388.53  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.58GiB(98.44%)  tps: 12,408  tflops: 388.54  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,409  tflops: 388.57  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,408  tflops: 388.55  mfu: 17.27%
[titan] 2026-02-24 19:44:39,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.60GiB(98.45%)  tps: 12,417  tflops: 388.82  mfu: 17.28%
[titan] 2026-02-24 19:44:39,727 - root - INFO - step:  7  loss:  5.67546  grad_norm: 21.8750  memory: 175.62GiB(98.46%)  tps: 12,410  tflops: 388.60  mfu: 17.27%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.58GiB(98.44%)  tps: 12,464  tflops: 390.31  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,466  tflops: 390.34  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,465  tflops: 390.32  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,462  tflops: 390.24  mfu: 17.34%
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.62GiB(98.46%)  tps: 12,463  tflops: 390.27  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,465  tflops: 390.32  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,464  tflops: 390.28  mfu: 17.35%
[titan] 2026-02-24 19:44:41,042 - root - INFO - step:  8  loss:  5.03076  grad_norm: 13.8125  memory: 175.60GiB(98.45%)  tps: 12,465  tflops: 390.33  mfu: 17.35%
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.66  mfu: 17.36%
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,355 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,477  tflops: 390.70  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.62GiB(98.46%)  tps: 12,477  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,477  tflops: 390.71  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.58GiB(98.44%)  tps: 12,474  tflops: 390.61  mfu: 17.36%
[titan] 2026-02-24 19:44:42,356 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:42,356 - root - INFO - step:  9  loss:  4.40252  grad_norm: 12.0625  memory: 175.60GiB(98.45%)  tps: 12,476  tflops: 390.68  mfu: 17.36%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,459  tflops: 390.13  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,460  tflops: 390.15  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.58GiB(98.44%)  tps: 12,460  tflops: 390.16  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,459  tflops: 390.14  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,458  tflops: 390.10  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - INFO - Sleeping 2 seconds for other ranks to complete
[titan] 2026-02-24 19:44:43,671 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.62GiB(98.46%)  tps: 12,458  tflops: 390.11  mfu: 17.34%
[titan] 2026-02-24 19:44:43,671 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,672 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,463  tflops: 390.27  mfu: 17.35%
[titan] 2026-02-24 19:44:43,672 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:44:43,672 - root - INFO - step: 10  loss:  3.89592  grad_norm:  8.7500  memory: 175.60GiB(98.45%)  tps: 12,459  tflops: 390.12  mfu: 17.34%
No AC with float32 gate
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,146  tflops: 67.20  mfu: 2.99%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,144  tflops: 67.15  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,139  tflops: 66.99  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,145  tflops: 67.16  mfu: 2.98%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,157  tflops: 67.55  mfu: 3.00%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.29GiB(94.35%)  tps: 2,147  tflops: 67.22  mfu: 2.99%
[titan] 2026-02-24 19:40:49,876 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.64GiB(94.55%)  tps: 2,137  tflops: 66.92  mfu: 2.97%
[titan] 2026-02-24 19:40:49,877 - root - INFO - step:  1  loss: 12.66745  grad_norm: 10.0000  memory: 168.29GiB(94.35%)  tps: 2,146  tflops: 67.19  mfu: 2.99%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.38GiB(98.89%)  tps: 5,749  tflops: 180.02  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.38GiB(98.89%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,748  tflops: 180.00  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,748  tflops: 180.00  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - WARNING - 2 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:52,727 - root - INFO - step:  2  loss:  9.42829  grad_norm: 26.2500  memory: 176.35GiB(98.87%)  tps: 5,749  tflops: 180.01  mfu: 8.00%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.81  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,117  tflops: 316.80  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.79GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - WARNING - 3 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.77  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.79GiB(94.63%)  tps: 10,116  tflops: 316.76  mfu: 14.08%
[titan] 2026-02-24 19:40:54,347 - root - INFO - step:  3  loss:  9.71232  grad_norm: 43.2500  memory: 168.78GiB(94.63%)  tps: 10,116  tflops: 316.78  mfu: 14.08%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,630  tflops: 270.23  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - WARNING - 5 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.48GiB(98.94%)  tps: 8,630  tflops: 270.23  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,629  tflops: 270.19  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.47GiB(98.94%)  tps: 8,630  tflops: 270.22  mfu: 12.01%
[titan] 2026-02-24 19:40:56,246 - root - INFO - step:  4  loss:  7.86874  grad_norm: 24.3750  memory: 176.48GiB(98.94%)  tps: 8,629  tflops: 270.21  mfu: 12.01%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.13GiB(94.83%)  tps: 10,378  tflops: 324.98  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.01  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,378  tflops: 324.99  mfu: 14.44%
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - WARNING - 6 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:57,825 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.01  mfu: 14.44%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.00  mfu: 14.44%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.14GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:57,826 - root - INFO - step:  5  loss:  6.75186  grad_norm: 34.0000  memory: 169.13GiB(94.83%)  tps: 10,379  tflops: 325.02  mfu: 14.45%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.03  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.45GiB(98.93%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.03  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - WARNING - 8 CUDA memory allocation retries.
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.45GiB(98.93%)  tps: 9,614  tflops: 301.05  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,613  tflops: 301.02  mfu: 13.38%
[titan] 2026-02-24 19:40:59,530 - root - INFO - step:  6  loss:  5.99373  grad_norm: 25.8750  memory: 176.44GiB(98.92%)  tps: 9,614  tflops: 301.06  mfu: 13.38%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.88  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.87  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.44%)  tps: 10,918  tflops: 341.89  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,917  tflops: 341.84  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.21GiB(95.43%)  tps: 10,918  tflops: 341.88  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.44%)  tps: 10,917  tflops: 341.86  mfu: 15.19%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,031 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.21GiB(95.43%)  tps: 10,918  tflops: 341.89  mfu: 15.20%
[titan] 2026-02-24 19:41:01,031 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:01,032 - root - INFO - step:  7  loss:  4.84441  grad_norm: 17.2500  memory: 170.22GiB(95.43%)  tps: 10,918  tflops: 341.87  mfu: 15.19%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.31  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,403  tflops: 388.39  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,402  tflops: 388.33  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.76GiB(96.30%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.76GiB(96.30%)  tps: 12,401  tflops: 388.33  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.32  mfu: 17.26%
[titan] 2026-02-24 19:41:02,353 - root - INFO - step:  8  loss:  4.34809  grad_norm: 13.0625  memory: 171.75GiB(96.29%)  tps: 12,402  tflops: 388.36  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.27  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.76GiB(96.30%)  tps: 12,400  tflops: 388.30  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.25  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,401  tflops: 388.31  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.76GiB(96.30%)  tps: 12,399  tflops: 388.26  mfu: 17.26%
[titan] 2026-02-24 19:41:03,675 - root - INFO - step:  9  loss:  4.10388  grad_norm:  9.1875  memory: 171.75GiB(96.29%)  tps: 12,400  tflops: 388.30  mfu: 17.26%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.76GiB(96.30%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,394  tflops: 388.09  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.06  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,395  tflops: 388.13  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.76GiB(96.30%)  tps: 12,394  tflops: 388.11  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - WARNING - 9 CUDA memory allocation retries.
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%
[titan] 2026-02-24 19:41:04,997 - root - INFO - step: 10  loss:  3.88823  grad_norm:  9.6250  memory: 171.75GiB(96.29%)  tps: 12,393  tflops: 388.07  mfu: 17.25%

Hope this helps!

The numerics don't change with this.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2026
@chelsea0x3b
Copy link
Contributor Author

chelsea0x3b commented Feb 17, 2026

Here's the config file I was using. I used the same config file for both the logs in the description ``` [job] dump_folder = "./outputs" description = "Gpt-oss debug training" print_config = false

[profiling]
enable_profiling = false
profile_freq = 5

[metrics]
log_freq = 1
disable_color_printing = false
enable_tensorboard = false
enable_wandb = false

[model]
name = "gpt_oss"
flavor = "20b"
hf_assets_path = "./tests/assets/tokenizer"

[optimizer]
name = "AdamW"
lr = 8e-4
eps = 1e-8
implementation = "fused"

[lr_scheduler]
warmup_steps = 2 # lr scheduler warm up, normally 20% of the train steps
decay_ratio = 0.8 # lr scheduler decay ratio, 80% of the train steps
decay_type = "linear"
min_lr_factor = 0.0

[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0 # grad norm clipping
steps = 10
dataset = "c4" # supported datasets: c4_test (2K), c4 (177M)
dtype = "bfloat16"

[training.dataloader]
num_workers = 4
pin_memory = true
persistent_workers = true
prefetch_factor = 2

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
tensor_parallel_degree = 1
enable_async_tensor_parallel = true
expert_parallel_degree = 1
expert_tensor_parallel_degree = 1
expert_parallel_comm_backend = "standard" # or "deepep"

[checkpoint]
enable = false
folder = "checkpoint"
interval = 10
last_save_model_only = false
export_dtype = "float32"
async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"]

[activation_checkpoint]
mode = "none" # ["none", "selective", "full"]
selective_ac_option = '2' # 'int' = ac every positive int layer or 'op', ac based on ops policy

[compile]
enable = true
components = ["model", "loss"]

[validation]
enable = false
dataset = "c4_validation"
freq = 5
steps = 10

</details>

@rakkit
Copy link
Contributor

rakkit commented Feb 18, 2026

Thx @chelsea0x3b

I think the probelm is we don't know what is "correct" for [bf16 or FP32] reduce.

In DeepEP it seems to be FP32 reduce.

In Megatron the probs is actually cast back to BF16 right after sigmoid (so gather can also be at least 2x faster):
scores = torch.sigmoid(logits.float()).type_as(logits)

in downstream infer lib, e.g. Sglang both bf16 and fp32 exits. (at some point there is smth about FP8 training/infer consistency, sglang for infer and megatron for training).

So its hard to tell what is correct especially when scale up.
If we decide to keep BF16 -> we can maybe do like megatron way to make it even faster or -> only do BF16 reduce but keep topk on fp32 for stable
Or we keep FP32 reduce (more deepseek style) by either via bmm or. .sum(dim=1). [i think bmm broken is more like from pytorch side?]
or we make options to let user decide what to go.

@garrett361
Copy link
Contributor

I think the probelm is we don't know what is "correct" for [bf16 or FP32] reduce.

Yeah, agreed that's the real issue.

In Megatron the probs is actually cast back to BF16 right after sigmoid

I found that Megatron does have a recommended --moe-router-dtype flag which forces higher-dtype computations, though.

or we make options to let user decide what to go

IMO making this configurable is reasonable.

@rakkit
Copy link
Contributor

rakkit commented Feb 18, 2026

--moe-router-dtype in magatron is for gate's linear
logits = router_gating_linear(input, self.weight, self.bias, router_dtype)
in torchtitan we do BF16 gemm -> Cast to fp32

@garrett361
Copy link
Contributor

IIUC in Megatron --moe-router-dtype (when provided) controls the gate output dtype, which then also determines the dtype that the router-weight * activation computation is done in. Do we disagree @rakkit ? LMK if I'm wrong or misunderstanding you.

@rakkit
Copy link
Contributor

rakkit commented Feb 18, 2026

oh i see your point, yes, you are right. if we force logits to be fp32 then the reset should also keep in fp32.

@chelsea0x3b
Copy link
Contributor Author

Just caught up on this conversation - so should I add some configuration alongside this?

@garrett361
Copy link
Contributor

I suggest making it configurable and keeping the fp32 path as the default for BC

@rakkit
Copy link
Contributor

rakkit commented Feb 19, 2026

actually i have another check on megatron. if we set --moe-router-dtype = fp32 it seems goes to "BF16@BF16 -> FP32", so actuall gemm on router's gate is BF16. reference here @garrett361

thats should be equative we do logits = torch.mm(x, self.gate.weight.t(), out_dtype=torch.float32)

TBH i dont see any reason we dont do this (by default for BF16 path). Rest code then seems make sense to keep on fp32?

@garrett361
Copy link
Contributor

@rakkit understanding check: you're saying we should by default have something like

class TokenChoiceTopKRouter(nn.Module):
    def forward(
        self, x: torch.Tensor, expert_bias: torch.Tensor | None = None
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
-       scores = self.gate(x)
+       scores = torch.mm(x, slef.gate.weight.t(), out_dtype=torch.float32) 

?

Since we immediately cast scores to fp32 before the sigmoid/softmax anyway? Makes sense to me, if so, but a slightly different concern from the current PR.

@rakkit
Copy link
Contributor

rakkit commented Feb 19, 2026

yes @garrett361.

we need a "fix" for torchtitan
By default we have BF16 gate GEMM (directly FP32 output instead of current FP32->BF16->FP32 cast). and by default (current behaviors, but need fix BMM) FP32 reduce (i.e. score_before_experts and score_afre_experts)

Then optionally offer BF16,. and/Or fully fp32 path.

@garrett361
Copy link
Contributor

but need fix BMM

this is about the backwards slowness here?

Whatever solution is fine to me. Either fixing bmm or moving to elementwise-prod-then-sum. Seems like the former should be strictly faster, though, being a single op. Which is why I introduced the bmm call.

@rakkit
Copy link
Contributor

rakkit commented Feb 19, 2026

@garrett361 yes, BMM is faster (even with this weird SM80 kernels). IDK its a bug we need to fix or it's expected to like that.

@garrett361
Copy link
Contributor

yes, BMM is faster (even with this weird SM80 kernels)

Oh ok, I misunderstood the other thread; thought bmm was slower, somehow. Which was confusing me because I thought I tested 😅

So is this an accurate summary?

  1. Change self.gate(x) to store outputs fp32 and (no change) do the subsequent sigmoid/softmax in fp32 as well
  2. Make the score*outputs op dtype configurable
  3. Do the scores * outputs op in the configured dtype (default fp32), and always immediately cast the result back to outputs.dtype

@rakkit
Copy link
Contributor

rakkit commented Feb 19, 2026

So is this an accurate summary?

Change self.gate(x) to store outputs fp32 and (no change) do the subsequent sigmoid/softmax in fp32 as well
Make the score*outputs op dtype configurable
Do the scores * outputs op in the configured dtype (default fp32), and always immediately cast the result back to outputs.dtype

@garrett361 yes, and it more or less aligned with magnetron.

(its still differ to deepseek's HF inference code, we can comment in code incase someone wants something like "ultimate FP32" version)

@chelsea0x3b
Copy link
Contributor Author

Another piece of data: openai's official gpt oss implementation doesn't use f32 at all: https://github.com/openai/gpt-oss/blob/main/gpt_oss/torch/model.py#L316

@chelsea0x3b
Copy link
Contributor Author

@garrett361
Copy link
Contributor

Yeah I saw that gpt-oss code as well @chelsea0x3b. Hadn't looked at the HF llama4 code. The meta llama4 code doesn't do any pre-sigmoid upcast at all. Not much consensus.

@chelsea0x3b
Copy link
Contributor Author

IMO since there isn't really consensus I'd lean towards casting back to bfloat16 immediately (as in the PR) because it reduces memory by a decent amount (5% is like 9GB on b200, which is substantial for users not on large GPUs).

@garrett361
Copy link
Contributor

@tianyu-l do you have a strong opinion here?

@acisseJZhong
Copy link
Contributor

@garrett361 #2448 would sth like this work? the test seems no longer have same issue(failing on inductor)

@rakkit
Copy link
Contributor

rakkit commented Feb 27, 2026

@acisseJZhong i had this version ages ago and compile works good

@acisseJZhong
Copy link
Contributor

acisseJZhong commented Feb 27, 2026

lol @rakkit what prevents you from landing that change? shall we revive or maybe @chelsea0x3b could just fix this and land!

@rakkit
Copy link
Contributor

rakkit commented Feb 27, 2026

@acisseJZhong so times ago i have that and from my ablation (7b-a1b MoE, 64 experts) i don't see significant diff on performance, and mid-2025 cause we need to train smth on 40GB A100 so i decided to keep bf16. (we train another 7b x 1T tokens for bf16-bf16-bf16 and kind still works). but TBH i never think about TP and mixed precision stuff at beginning

@chelsea0x3b
Copy link
Contributor Author

sorry so we want to go with the auto cast? just getting a lot of mixed messages and i don't know who to listen to lol. about to go out of town for a week. would love to get this PR merged

@acisseJZhong
Copy link
Contributor

acisseJZhong commented Feb 27, 2026

@chelsea0x3b can you help test numerics with autocast approach(#2448) in your PR? sorry for all the back and forth 🤣 hope we could land soon.

@chelsea0x3b
Copy link
Contributor Author

@acisseJZhong numerics for autocast look good with DP/EP/ETP, PR should be good to go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can remove the cast to float32? as scores is already fp32. Maybe add a comment saying scores is by default fp32 already. Thanks!

@acisseJZhong
Copy link
Contributor

@chelsea0x3b pls remove the cast to fp32 for scores, and you can feel free to land it! Would appreciate if you paste the testing result(numerics doesn't change) in PR description!

@pytorch-bot pytorch-bot bot removed the ciflow/8gpu label Feb 27, 2026
@chelsea0x3b
Copy link
Contributor Author

Im away from my laptop so I dont have access to the logs anymore, but I updated the description and removed the redundant to call

@tianyu-l
Copy link
Contributor

In PR summary, why AC=none results in more memory in bf16 than fp32? Could you include the parallelism config you are using? If using EP, we should turn on load balanced routing to factor out the imbalance https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/configs.py#L387

@acisseJZhong

Would appreciate if you paste the testing result(numerics doesn't change) in PR description!

I don't think the numerics should be the same even if we fix random seed and turn on deterministic mode, because fp32 gate matmul should give us different results than bf16 gate matmul.

@chelsea0x3b
Copy link
Contributor Author

chelsea0x3b commented Feb 28, 2026

@tianyu-l because it kept hitting a ton of cuda memory reallocations. check out the full log i posted for that case, the memory usage varies a lot between each step and the number of reallocations increased each step. i just was taking the mem usage from a single step so it was hard to pick which step for that one bc there wasn't a good "average". the other cases were all much more consistent.

and yes the f32 was different numbers, especially for all the different cases, but the losses all followed the same pattern and were very close (within like .1 of each other or closer)

@tianyu-l
Copy link
Contributor

@chelsea0x3b Regarding cuda memory reallocation, I guess you could use a smaller model, e.g. even the debugmodel to showcase.

Please fix lint error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. high priority

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

8 participants