Add emulate in float8 and relative checks by mori360 · Pull Request #1214 · pytorch/torchtitan

mori360 · 2025-05-21T18:32:39Z

Add emulate in float8, to enable test on older hardware.

Change relative warnings

Test result:
Test locally on 8 H100 server.
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd --float8.emulate

tianyu-l

Thanks for working on this! I left some inline comments.

tianyu-l · 2025-05-22T04:47:50Z

.ci/docker/requirements.txt

 wandb
 fsspec
 tyro
+torchao


I think the recommended way of installing torchao is still via nightly, similar to how we install pytorch nightly for CI
https://github.com/pytorch/torchtitan/blob/main/.github/workflows/integration_test_8gpu.yaml#L39
but for torchao
USE_CPP=0 python -m pip install git+https://github.com/pytorch/ao.git

tianyu-l · 2025-05-22T05:00:01Z

torchtitan/components/quantization/float8.py

+                "To enable support on older hardware, set `float8.emulate` to True.",
+            )
+            return
+        elif float8_config.emulate and job_config.training.compile:


I wonder if emulate+compile works on H100? Since the original comment from @vkuzo is

torch.compile with float8 dtypes is not going to work on older hardware, so the emulation can only be used in eager mode.

Will have some tests on it

test to be good, remove this exception

tianyu-l · 2025-05-22T05:02:41Z

torchtitan/config_manager.py

+    Whether to run on earlier hardware in CI test.
+    torch.compile with float8 dtypes is not going to work on older hardware, so the emulation can
+    only be used in eager mode.


Suggested change

Whether to run on earlier hardware in CI test.

torch.compile with float8 dtypes is not going to work on older hardware, so the emulation can

only be used in eager mode.

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only, as the current CI does have sm_90 capability, required by Float8.

Not compatible with torch.compile.

This is assuming torch.compile+emulate don't work on >= H100 either. If not we'll need to further adjust code and helper message.

fegin · 2025-05-22T16:48:30Z

torchtitan/components/quantization/float8.py

+            return
+        elif float8_config.emulate and job_config.training.compile:
+            logger.warning(
+                "Failed to run on emulate with compile on, please disable compile to allow on emulate.",


We should just raise an exception if the configurations combination is not runnable.

torchtitan/components/quantization/float8.py

tianyu-l · 2025-05-23T02:03:22Z

torchtitan/config_manager.py

+    If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,
+    as the current CI does have sm_90 capability, required by Float8.
+    Not compatible with torch.compile.


Suggested change

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does have sm_90 capability, required by Float8.

Not compatible with torch.compile.

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does not have sm_89 capability, required by Float8.

could you make this update to the doc string? otherwise it seems inaccurate

Suggested change

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does have sm_90 capability, required by Float8.

Not compatible with torch.compile.

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does not have sm_89 capability, required by Float8.

tianyu-l · 2025-05-23T02:08:37Z

torchtitan/components/quantization/float8.py

            logger.warning(
-                "Failed to swap to Float8Linear because float8 is only supported on SM89 or later",
+                "Failed to swap to Float8Linear because float8 is only supported on SM89 or later."
+                "To enable support on older hardware, set `float8.emulate` to True.",


Suggested change

"To enable support on older hardware, set `float8.emulate` to True.",

"To enable testing on older hardware, set `float8.emulate` to True in eager mode.",

tianyu-l · 2025-05-23T02:14:08Z

torchtitan/components/quantization/float8.py


        float8_config: Float8 = job_config.float8
-        if not has_cuda_capability(8, 9):
+        if not has_cuda_capability(8, 9) and not float8_config.emulate:


On sm < 89, we can't enable torch.compile with/without emulate, right? If so let's do

Suggested change

if not has_cuda_capability(8, 9) and not float8_config.emulate:

if not has_cuda_capability(8, 9) and (job_config.training.compile or not float8_config.emulate):

Also it's a bit hard to read. A better way may be

if has_cuda_capability(8, 9) or (float8_config.emulate and not job_config.training.compile): pass else: raise ValueError(...)

tianyu-l

The CPU CI error is because we change warning to exception when sm < 89.
I think we can just add the emulate flag to https://github.com/pytorch/torchtitan/blob/main/tests/unit_tests/test_model_converter.py#L42

tianyu-l

LGTM. Please address final comments before merge.

tianyu-l · 2025-05-28T10:47:27Z

.github/workflows/unit_test_cpu.yaml


        pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
+
+        pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cpu


curious, can we specify USE_CPP=0 here too?

Yeah, it works.

tianyu-l · 2025-05-28T10:49:22Z

torchtitan/config_manager.py

+    If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,
+    as the current CI does have sm_90 capability, required by Float8.
+    Not compatible with torch.compile.


could you make this update to the doc string? otherwise it seems inaccurate

Suggested change

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does have sm_90 capability, required by Float8.

Not compatible with torch.compile.

If True, emulation is used instead of hardware accelerated gemm. This is for test purpose only,

as the current CI does not have sm_89 capability, required by Float8.

Add [emulate](https://github.com/pytorch/ao/blob/554cb60c750e6ef31bbcafec74bb76a4578902da/torchao/float8/config.py#L193) in float8, to enable test on older hardware. Change relative warnings Test result: Test locally on 8 H100 server. `CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd` <img width="1127" alt="Screenshot 2025-05-21 at 2 38 39 PM" src="https://github.com/user-attachments/assets/c15fcabb-d7cd-4c96-8ff4-9fe5a2bc5246" /> `CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd --float8.emulate` <img width="1127" alt="Screenshot 2025-05-21 at 2 39 01 PM" src="https://github.com/user-attachments/assets/9227c839-fbe6-45d0-a919-6b62ac66863a" />

add emulate in float 8 and relative checks

0f1ea74

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2025

update node

b0e1427

mori360 changed the title ~~Add emulate in float 8 and relative checks~~ Add emulate in float8 and relative checks May 21, 2025

mori360 added 8 commits May 21, 2025 13:09

add into doc

36d1047

add into doc

3b5d3ef

lint

f62c485

add into test

757b528

update requirement of ao, move test ahead right now

14f8a44

update ao version

5124ede

update requirement

809adc6

move test to the end

e94a80c

mori360 marked this pull request as ready for review May 22, 2025 03:09

mori360 requested review from tianyu-l and vkuzo May 22, 2025 03:09

tianyu-l reviewed May 22, 2025

View reviewed changes

fegin reviewed May 22, 2025

View reviewed changes

mori360 added 3 commits May 22, 2025 10:21

remove one exception and update description

0beac1d

update ci build

ffdd8a8

lint

4ab9baa

mori360 marked this pull request as draft May 22, 2025 17:35

update torchao build

6d6ff24

mori360 marked this pull request as ready for review May 22, 2025 18:41

mori360 requested review from fegin and tianyu-l May 22, 2025 18:41

tianyu-l reviewed May 23, 2025

View reviewed changes

switch to error out and change description

7ddc5c8

mori360 marked this pull request as draft May 23, 2025 17:15

tianyu-l reviewed May 23, 2025

View reviewed changes

add emulate flag in cpu test

217896e

mori360 added 5 commits May 27, 2025 19:07

update unit test

362d74d

update cpu test env

8139ed0

change to cpu version

7087d34

change to cpu version

93fb8fc

cpu test

b82d502

mori360 marked this pull request as ready for review May 28, 2025 03:08

mori360 requested a review from tianyu-l May 28, 2025 03:09

tianyu-l approved these changes May 28, 2025

View reviewed changes

change sm version in description, try with USE_CPP=0 in setup env

ee22b6f

mori360 merged commit 594a120 into pytorch:main May 28, 2025
6 checks passed

mori360 deleted the add_fp8_emulate branch May 28, 2025 17:44

	"To enable support on older hardware, set `float8.emulate` to True.",
	"To enable testing on older hardware, set `float8.emulate` to True in eager mode.",

	if not has_cuda_capability(8, 9) and not float8_config.emulate:
	if not has_cuda_capability(8, 9) and (job_config.training.compile or not float8_config.emulate):


		pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

		pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cpu

Conversation

mori360 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mori360 May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mori360 commented May 21, 2025 •

edited

Loading

mori360 May 22, 2025 •

edited

Loading