Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions#140
Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions#140woct0rdho merged 21 commits intorelease/3.4.x-windowsfrom
Conversation
|
Let me try to upstream it: triton-lang#7904 |
|
Can you please prioritize this feature? I want to use model compiling with fp8 FLUX on my 3070 Ti |
|
@sinand99 Are you using torch 2.8? You can try to download the wheel here (download the artifact at the bottom of the webpage): Feel free to ask if you see any error. |
|
@woct0rdho yes using v2.8 python 3.13 cu129 Works with Sage attention also but gives me this warning: |
|
You can ignore these warnings. As you're using Flux, when seeing I'd like to collect more user feedback before merging it. This kind of modification to the core part of Triton must be proceeded with care. |
|
Hi, great addition to the library. Just a couple questions:
I'm asking because I just installed it, and as a very first impression, I think things feel slightly slower than usual. That's completely subjective though with only minimal testing (as in, installed it and ran 2-3 inferences). I'll reinstall the stable Triton release, take some measurements and reinstall this one to compare. It's just that intuitively it didn't really feel like things were suddenly much more fast. This is my Comfy: No Pinokio or standalone Python or such, just git cloned the repo and made a Conda env. |
|
@jtabox The wheels above should work with both PyTorch 2.8.0 and the recent nightly versions. If you need older versions of Triton for older versions of PyTorch, I can also build the wheels. Yes you can just install the wheel. What do you mean by 'triton is still at 3.3.0'? If you've installed fp8 model should be more accurate than GGUF Q4, but it's hard to say whether it's more accurate than GGUF Q8. Also you can do some tests to see which one is faster. You need to be careful when taking an objective speed test. Run the same workflow, and only change the Triton version. Exit ComfyUI when reinstalling Triton. After starting ComfyUI, run the workflow once to let everything load and compile, then run it again to see the real speed. |
5c2d2e5 to
3882b5e
Compare
|
@woct0rdho do you recommend to move Pytorch 2.8 now? i am still on 2.7 and it works so far on everything |
|
@FurkanGozukara You can move to PyTorch 2.8, especially because the performance on Blackwell GPUs will be better. In some cases |
|
awesome thanks. |
|
Let me merge this and see how it rolls out. I hoped this could be superseded by GGUF + better This is the first feature in the 'core' part (rather than the Windows support code) that's deliberately different from the official Triton. It should also work on Linux but I'm not sure what's the best way to publish Linux wheels. I'm not an expert on PTX. Welcome help in optimizing those PTX code. |
|
@woct0rdho yesterday i have generated FP8 Scaled version of Qwen Image Edit 2509 model and it is like 1.5x faster inference compared to GGUF and same quality i don't see any benefit of GGUF anymore unless you are going for lower ranks like Q5 perhaps i am using with native comfyui |
|
I'm trying the newest release, 3.5.0-post21, and encountered this error. I have latest nightly comfy + torch 2.8.0.dev20250521+cu128 installed. It was working fine with 3.3.1.post19. Do I need to upgrade torch to 2.9? |
|
@jprsyt5 Triton 3.5 may not work with an earlier nightly version of torch 2.8 . |
What script are you using for the conversion to scaled? This one? https://github.com/Clybius/Learned-Rounding |
I coded myself based on Musubi Tuner https://github.com/kohya-ss/musubi-tuner Kohya is a legend. he also recently analyzed and visually compared different scaling methods and upgraded his fp8 scaling method |
Hi @woct0rdho I can confirm the compile haven't worked yet, speed haven't changed either compile or not. This is my working workflow for Qwen fp8. This is also the only working workflow, otherwise ComfyUI create black image. Not working workflow:
This is comfyUI settings, I use Pytorch 2.9 stable version, cuda 12.8 and GPU 3090, I already installed triton triton-windows 3.5.0.post2.1
|
This comment was marked as duplicate.
This comment was marked as duplicate.
|
@pivtienduc If you don't see this error, then the compile itself works. The compile does not always give significant speedup, and it's known that things like compile + SageAttention + fast fp16 accum may result in black image because there's too much precision loss. It's also known that in SageAttention the CUDA kernel is less likely to cause black image than the Triton kernel. My best hope is to let Nunchaku supersede all this... For Qwen-Image, you may try to use Also, you need a node to do block swap. By default ComfyUI does not do it, and when the model size exceeds your VRAM, Windows will move it to 'shared GPU memory', which is actually CPU memory in a very slow way. |
Well I already tried all type of compile nodes, I put the native compile node for your reference only, compile working but so far no speed increase. |
|
Wait, it seems the block swap nodes in ComfyUI-TaylorSeer cannot be used alone (I haven't tried it) I know there is https://github.com/orssorbit/ComfyUI-wanBlockswap and I guess you can make it work for Qwen-Image with minimal modification (I haven't tried either...) |
Hi @woct0rdho I wonder block swap different than Wan Tile node. I use Wan tile to transfer part of model to RAM: |
|
In a quick glance it seems ComfyUI--WanImageToVideoTiled only divides the image into tiles in the VAE, not in the diffusion model. There are indeed some ways to divide the image into tiles in the diffusion model (such as those tiled upscale workflows), but the whole Qwen-Image diffusion model is still too large for you with 24G VRAM. You need block swap to divide the model. |
No, it's not the tile vae decode. It divided into tile in the encode process, I can put into RAM 14gb and I can even generate 1 megapixel video with 121 frames with my 3090 alone. By the way I use WAN 2.2 Q8 GGUF not the bf16 model |
|
Yes I meant the tiled VAE encode. I guess you're familiar with the tiled VAE decode, and in the But anyway, when running Wan, you can load the 14G model into your VRAM and you still have 10G for activations (the size of activations depends on how large your video is), so probably you don't need block swap. While when running Qwen-Image, you need to load the 20G model into your VRAM and there is only 4G for activations, so probably you need block swap. You can see the GPU memory usage in Task Manager. If any 'shared GPU memory' is used, then your workflow will be much slower. |
Are those comparisons publicly available? Curious to see them. |
Hmm still error for me, testing it on 3070 & 3080 both throwing same error. LogsName: torch Name: triton-windows 0%| | 0/8 [00:00<?, ?it/s]E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Error while creating guard:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Name: ''
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Source: shape_env
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Create Function: SHAPE_ENV
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Guard Types: ['SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV']
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Code List: ["L['x'].stride()[0] == 5120*L['x'].size()[1]", "L['x'].stride()[2] == L['x'].size()[1]", "___as_tensor(L['self']._modules['norm1'].eps).item() == 1e-06", "L['self']._modules['norm1'].eps == 1e-06", "L['freqs'].size()[1] == L['x'].size()[1]", "L['freqs'].size()[5] == L['freqs'].size()[4]", "L['freqs'].stride()[0] == L['x'].size()[1]*L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[1] == L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[2] == L['x'].size()[1]*L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[3] == L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[4] == L['freqs'].size()[4]", "5120*L['x'].size()[1] <= 2147483647", "2 <= L['x'].size()[1] and L['x'].size()[1] <= 2147483647", "2 <= L['freqs'].size()[3]", "2 <= L['freqs'].size()[4]"]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Object Weakref: None
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Guarded Class Weakref: None
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Traceback (most recent call last):
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 137, in check_compiler_exist_windows
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] subprocess.check_output([compiler, "/help"], stderr=subprocess.STDOUT)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 466, in check_output
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 548, in run
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] with Popen(*popenargs, **kwargs) as process:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1026, in __init__
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] self._execute_child(args, executable, preexec_fn, close_fds,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1538, in _execute_child
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] FileNotFoundError: [WinError 2] The system cannot find the file specified
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] The above exception was the direct cause of the following exception:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Traceback (most recent call last):
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_guards.py", line 366, in create
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] return self.create_fn(builder, self)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 2671, in SHAPE_ENV
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] clib = CppCodeCache.load(func_str)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2839, in load
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] return cls.load_async(*args, **kwargs)()
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2705, in load_async
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] "vec_isa": pick_vec_isa(),
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 497, in pick_vec_isa
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] _valid_vec_isa_list: list[VecISA] = valid_vec_isa_list()
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in valid_vec_isa_list
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] isa_list.extend(
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in <genexpr>
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] isa_list.extend(
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 143, in __bool__
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] return self.__bool__impl(config.cpp.vec_isa_ok)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 153, in __bool__impl
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] return self.check_build(VecISA._avx_code)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 103, in check_build
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] extra=_get_isa_dry_compile_fingerprint(self._arch_flags),
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 29, in _get_isa_dry_compile_fingerprint
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] compiler_info = get_compiler_version_info(get_cpp_compiler())
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] ^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 338, in get_cpp_compiler
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] check_compiler_exist_windows(compiler)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 139, in check_compiler_exist_windows
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] raise RuntimeError(f"Compiler: {compiler} is not found.") from exc
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] RuntimeError: Compiler: cl is not found.
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] Created at:
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 773, in trace_frame
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] tracer = InstructionTranslator(
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\symbolic_convert.py", line 3847, in __init__
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] output=OutputGraph(
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\output_graph.py", line 508, in __init__
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] self.init_ambient_guards()
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\output_graph.py", line 668, in init_ambient_guards
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] self.guards.add(ShapeEnvSource().make_guard(GuardBuilder.SHAPE_ENV))
0%| | 0/8 [00:07<?, ?it/s]
!!! Exception during processing !!! RuntimeError: Compiler: cl is not found.
Traceback (most recent call last):
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\nodes.py", line 1521, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\nodes.py", line 1488, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\sample.py", line 45, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1143, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1033, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1018, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 986, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 969, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 748, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\k_diffusion\sampling.py", line 190, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 400, in __call__
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 949, in __call__
return self.predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 952, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 380, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 206, in calc_cond_batch
return executor.execute(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\model_base.py", line 155, in apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.wrappers[self.idx](self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 104, in __call__
return new_executor.execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\model_base.py", line 194, in _apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\ldm\wan\model.py", line 580, in forward
return self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transformer_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\ldm\wan\model.py", line 550, in forward_orig
x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 414, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 832, in compile_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1874, in __call__
result = self._torchdynamo_orig_backend(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1624, in __call__
result = self._inner_convert(
^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 688, in __call__
result = _compile(
^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1494, in _compile
raise InternalTorchDynamoError(
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1433, in _compile
guarded_code, tracer_output = compile_inner(code, one_graph, hooks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_utils_internal.py", line 92, in wrapper_function
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1117, in compile_inner
return _compile_inner(code, one_graph, hooks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1251, in _compile_inner
check_fn = dynamo_output.build_guards(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 856, in build_guards
return CheckFunctionManager(
^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 3383, in __init__
builder, guard_manager = self.build_guards(
^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 3674, in build_guards
guard.create(builder)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_guards.py", line 366, in create
return self.create_fn(builder, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 2671, in SHAPE_ENV
clib = CppCodeCache.load(func_str)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2839, in load
return cls.load_async(*args, **kwargs)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2705, in load_async
"vec_isa": pick_vec_isa(),
^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 497, in pick_vec_isa
_valid_vec_isa_list: list[VecISA] = valid_vec_isa_list()
^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in valid_vec_isa_list
isa_list.extend(
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in <genexpr>
isa_list.extend(
^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 143, in __bool__
return self.__bool__impl(config.cpp.vec_isa_ok)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 153, in __bool__impl
return self.check_build(VecISA._avx_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 103, in check_build
extra=_get_isa_dry_compile_fingerprint(self._arch_flags),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 29, in _get_isa_dry_compile_fingerprint
compiler_info = get_compiler_version_info(get_cpp_compiler())
^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 338, in get_cpp_compiler
check_compiler_exist_windows(compiler)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 139, in check_compiler_exist_windows
raise RuntimeError(f"Compiler: {compiler} is not found.") from exc
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: Compiler: cl is not found.Edit: With this recent version, I need to run vcvars64.bat in the Command Prompt before starting ComfyUI. Is it also supposed to be showing this in the logs? |
|
@jprsyt5 What's your MSVC version (you can see it when modifying the components in your Visual Studio)? What if you update it to the latest version? |
C:\Users\user>"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" ** Visual Studio 2022 Developer Command Prompt v17.7.4 [vcvarsall.bat] Environment initialized for: 'x64' C:\Users\user>cl usage: cl [ option... ] filename... [ /link linkoption... ]
Tried checking in Visual Studio Installer, and it shows as the latest version |
|
Currently the latest version should be MSVC 14.44 . Is there MSVC 14.44 in your Bug like this happened before but I rarely see it. To help debug, you can modify with tempfile.TemporaryDirectory() as tmpdir:
# Add this
tmpdir = os.path.basename(tmpdir)
tmpdir = rf"C:\tmp\{tmpdir}"
os.makedirs(tmpdir, exist_ok=True)
src_path = os.path.join(tmpdir, f"{name}.c")Then run the workflow again. |
Solved! Now torch.compile works fine with FP8_e4m3fn on my Ampere GPU! What I did was decide to reinstall the whole visual studio, because I was trying to upgrade the vs installer, but it wouldn't let me. and yeah, my previous msvc wasn’t the latest version. C:\Users\user>cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35217 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
usage: cl [ option... ] filename... [ /link linkoption... ]As for the performance, I don’t really notice any difference. I did a quick test running text2image (just 1 frame) using WAN 2.2, and the results were similar in performance. |
My bad, I though something new. Actually I already use multi GPU node: https://github.com/pollockjj/ComfyUI-MultiGPU |



Motivation
Nvidia GPUs with sm < 89 are still widely used, see e.g. Steam hardware survey. When running large AI models, a common usage is to store the parameters in fp8, and cast them to fp16 for computation on hardware that doesn't have native fp8. This reduces the memory requirement, even though no speed advantage. This PR aims to enable
torch.compileon this usage.We may refer to XLA's fallback mechanism for fp8 operations, see openxla/xla#23124 , although I think we only need to support the conversions rather than all arithmetic operations.
Implementation
Before triton-lang#2105 , there were some PTX code for converting F8E4M3/F8E5M2 <-> F16/BF16, but they did not correctly handle denormalized values and rounding to nearest even (RTNE). I've fixed these cases, and added the code for F32 -> F8E4M3/F8E5M2.
I've tested that for all 2^8 F8E4M3/F8E5M2 values, all 2^16 F16/BF16 values, and all 2^32 F32 values, the conversion results are bitwise identical to the PyTorch implementation, except some glitches about inf and nan, see the comments. The tests in
test_conversions.pyare passed.I've checked that all unit tests are passed on RTX 3080 (sm86). There is no IR change for sm >= 90. For sm89, there is a minor change that previously F32 -> F8E4M3/F8E5M2 was implemented by F32 -> F16 -> F8E4M3/F8E5M2 without correct RTNE, now it's directly implemented with RTNE.