Conversation
tianyu-l
left a comment
There was a problem hiding this comment.
I did not swap the initialization in the experiments folder, but I can easily add a commit changing those too.
any reason we don't? If not, I think we should.
GPTOSS was calling init_weights in its init method, which does cause some errors.
why it cause errors? we can remove those if so
| b=cutoff_factor * final_out_std, | ||
| ) | ||
|
|
||
| # If weight tying is enabled, we don't need to initialize the output layer |
There was a problem hiding this comment.
Qwen3 was not initializing the output weights when enable_weight_tying=True. This would mean that embedding initialization would have been used for the output weights, which would have caused the loss to skyrocket past 500.
To clarify, do you mean we should override embedding init with this output weight init? Why one direction of override is better than the other direction?
There was a problem hiding this comment.
The problem is that the output initialization should have a small std, otherwise we will get high logits and a high loss consequently. If the weight are tied we should prioritize the output layer initialization. At the momement runnning
NGPU=1 CONFIG_FILE="./torchtitan/models/qwen3/train_configs/qwen3_0.6b.toml" ./run_train.sh --training.steps 100 --training.seq_len 256 --compile.no-enable --training.dtype bfloat16 --metrics.enable-tensorboard --job.dump-folder "./outputs/qwen3"yield
[rank0]:[titan] 2026-02-09 09:03:01,409 - root - INFO - step: 1 loss: 127.88081 grad_norm: 42.2500 memory: 4.48GiB(28.75%) tps: 267 tflops: 0.75 mfu: 0.24%
[rank0]:[titan] 2026-02-09 09:03:01,696 - root - INFO - step: 2 loss: 124.81187 grad_norm: 66.5000 memory: 6.21GiB(39.92%) tps: 3,576 tflops: 10.08 mfu: 3.23%
[rank0]:[titan] 2026-02-09 09:03:01,905 - root - INFO - step: 3 loss: 119.61317 grad_norm: 43.0000 memory: 6.21GiB(39.92%) tps: 4,909 tflops: 13.84 mfu: 4.44%
[rank0]:[titan] 2026-02-09 09:03:02,110 - root - INFO - step: 4 loss: 119.98587 grad_norm: 238.0000 memory: 6.21GiB(39.92%) tps: 5,006 tflops: 14.11 mfu: 4.52%
[rank0]:[titan] 2026-02-09 09:03:02,314 - root - INFO - step: 5 loss: 115.53140 grad_norm: 29.8750 memory: 6.21GiB(39.92%) tps: 5,035 tflops: 14.19 mfu: 4.55%
There was a problem hiding this comment.
I see. I guess this is the reason for #1879? @wwwjn
@francesco-bertolotti could you add a comment why we overriding?
|
I didn’t swap the initialization in the
No objection on my end — I just wanted to confirm before doing it.
In This only happens for If keeping |
No it's not required / expected. Let's just remove it. |
|
In the last commits I have:
|
|
@francesco-bertolotti could you help remove this line as well? |
removing weight initialization from model's init as per request from @tianyu-l in [comment](#2342 (comment))
This PR addresses #2269
Briefly, there is a numerical instability in
torch.nn.init.trunc_normal_that causes an abnormal number of left bounds to appear in the weights.I consistently swapped all the usage of
torch.nn.init.trunc_normal_with a custom implementation oftrunc_normal_located intorchtitan/models/utils.pyI have done two other fixes that would require some attention that do not have much to do with the
trunc_normal_but they felt right.Qwen3 was not initializing the output weights when
enable_weight_tying=True. This would mean that embedding initialization would have been used for the output weights, which would have caused the loss to skyrocket past 500.GPTOSS was calling
init_weightsin its__init__method, which does cause some errors.I did not swap the initialization in the
experimentsfolder, but I can easily add a commit changing those too.Here I have some debug runs with associated losses: