Skip to content
This repository was archived by the owner on Feb 18, 2026. It is now read-only.

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions#140

Merged
woct0rdho merged 21 commits intorelease/3.4.x-windowsfrom
fp-emu
Oct 15, 2025
Merged

Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions#140
woct0rdho merged 21 commits intorelease/3.4.x-windowsfrom
fp-emu

Conversation

@woct0rdho
Copy link
Copy Markdown
Owner

@woct0rdho woct0rdho commented Aug 11, 2025

Motivation

Nvidia GPUs with sm < 89 are still widely used, see e.g. Steam hardware survey. When running large AI models, a common usage is to store the parameters in fp8, and cast them to fp16 for computation on hardware that doesn't have native fp8. This reduces the memory requirement, even though no speed advantage. This PR aims to enable torch.compile on this usage.

We may refer to XLA's fallback mechanism for fp8 operations, see openxla/xla#23124 , although I think we only need to support the conversions rather than all arithmetic operations.

Implementation

Before triton-lang#2105 , there were some PTX code for converting F8E4M3/F8E5M2 <-> F16/BF16, but they did not correctly handle denormalized values and rounding to nearest even (RTNE). I've fixed these cases, and added the code for F32 -> F8E4M3/F8E5M2.

I've tested that for all 2^8 F8E4M3/F8E5M2 values, all 2^16 F16/BF16 values, and all 2^32 F32 values, the conversion results are bitwise identical to the PyTorch implementation, except some glitches about inf and nan, see the comments. The tests in test_conversions.py are passed.

I've checked that all unit tests are passed on RTX 3080 (sm86). There is no IR change for sm >= 90. For sm89, there is a minor change that previously F32 -> F8E4M3/F8E5M2 was implemented by F32 -> F16 -> F8E4M3/F8E5M2 without correct RTNE, now it's directly implemented with RTNE.

@woct0rdho woct0rdho changed the title [WIP] Enable FP8 conversion on sm < 89 [WIP] Enable F8E4M3 conversion on Nvidia GPUs with sm < 89 Aug 12, 2025
@woct0rdho woct0rdho changed the title [WIP] Enable F8E4M3 conversion on Nvidia GPUs with sm < 89 Enable F8E4M3 conversions on Nvidia GPUs with sm < 89, and fix F8E5M2 conversions Aug 19, 2025
@woct0rdho woct0rdho marked this pull request as ready for review August 19, 2025 15:49
@woct0rdho
Copy link
Copy Markdown
Owner Author

Let me try to upstream it: triton-lang#7904

@sinand99
Copy link
Copy Markdown

Can you please prioritize this feature? I want to use model compiling with fp8 FLUX on my 3070 Ti

@woct0rdho
Copy link
Copy Markdown
Owner Author

@sinand99 Are you using torch 2.8? You can try to download the wheel here (download the artifact at the bottom of the webpage):
https://github.com/Comfy-Org/wheels/actions/runs/17170280663

Feel free to ask if you see any error.

@sinand99
Copy link
Copy Markdown

sinand99 commented Aug 23, 2025

@woct0rdho yes using v2.8 python 3.13 cu129
It seems to be working on ComfyUI 0.3.51 except the errors below. When will you commit this to main branch?

0823 17:19:30.178000 22892 Lib\site-packages\torch\_inductor\utils.py:1436] [0/0] Not enough SMs to use max_autotune_gemm mode
[02:18<00:59,  4.94s/it]W0823 17:21:12.734000 22892 Lib\site-packages\torch\_dynamo\convert_frame.py:1016] [0/8] torch._dynamo hit config.recompile_limit (8)
W0823 17:21:12.734000 22892 Lib\site-packages\torch\_dynamo\convert_frame.py:1016] [0/8]    function: 'forward' (G:\AI\Image\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\model.py:216)
W0823 17:21:12.734000 22892 Lib\site-packages\torch\_dynamo\convert_frame.py:1016] [0/8]    last reason: 0/7: transformer_options['current_percent'] == 0.35
W0823 17:21:12.734000 22892 Lib\site-packages\torch\_dynamo\convert_frame.py:1016] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0823 17:21:12.734000 22892 Lib\site-packages\torch\_dynamo\convert_frame.py:1016] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html

Works with Sage attention also but gives me this warning:

Lib\site-packages\torch\_dynamo\variables\functions.py:1575: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Aug 23, 2025

You can ignore these warnings. Not enough SMs to use max_autotune_gemm mode basically means your GPU is too small to do some tuning. Dynamo detected a call to a functools.lru_cache-wrapped function is something PyTorch developers are working on.

As you're using Flux, when seeing torch._dynamo hit config.recompile_limit (8), you can use the TorchCompileModelFluxAdvancedV2 node in KJNodes, and set a larger recompile_limit there.

I'd like to collect more user feedback before merging it. This kind of modification to the core part of Triton must be proceeded with care.

@jtabox
Copy link
Copy Markdown

jtabox commented Aug 24, 2025

Hi, great addition to the library. Just a couple questions:

  1. This doesn't require PyTorch nightly right? I have latest stable (2.8.0).
  2. Are there more "installation steps" than just downloading the build you linked a couple posts ago, unzipping and installing the Python wheel? It's triton-windows v3.4.0+gita4ccd09a.post21.fp8, triton is still at 3.3.0 but I don't think it matters, right?
  3. Is it mostly FP8 or GGUF models that are supposed to get most benefit?

I'm asking because I just installed it, and as a very first impression, I think things feel slightly slower than usual. That's completely subjective though with only minimal testing (as in, installed it and ran 2-3 inferences).

I'll reinstall the stable Triton release, take some measurements and reinstall this one to compare. It's just that intuitively it didn't really feel like things were suddenly much more fast.

This is my Comfy:

Total VRAM 10240 MB, total RAM 32682 MB
pytorch version: 2.8.0+cu129
xformers version: 0.0.32.post2
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3080 : cudaMallocAsync
Using sage attention  # its version is 2.2.0+cu128torch2.8.0.post2
Python version: 3.11.12 | packaged by conda-forge
ComfyUI version: 0.3.51
ComfyUI frontend version: 1.25.10

No Pinokio or standalone Python or such, just git cloned the repo and made a Conda env.

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Aug 25, 2025

@jtabox The wheels above should work with both PyTorch 2.8.0 and the recent nightly versions. If you need older versions of Triton for older versions of PyTorch, I can also build the wheels.

Yes you can just install the wheel. What do you mean by 'triton is still at 3.3.0'? If you've installed triton-windows, you don't need to install another package named triton.

fp8 model should be more accurate than GGUF Q4, but it's hard to say whether it's more accurate than GGUF Q8. Also you can do some tests to see which one is faster.

You need to be careful when taking an objective speed test. Run the same workflow, and only change the Triton version. Exit ComfyUI when reinstalling Triton. After starting ComfyUI, run the workflow once to let everything load and compile, then run it again to see the real speed.

@woct0rdho woct0rdho force-pushed the release/3.4.x-windows branch from 5c2d2e5 to 3882b5e Compare August 31, 2025 01:21
@FurkanGozukara
Copy link
Copy Markdown

@woct0rdho do you recommend to move Pytorch 2.8 now? i am still on 2.7 and it works so far on everything

@woct0rdho
Copy link
Copy Markdown
Owner Author

@FurkanGozukara You can move to PyTorch 2.8, especially because the performance on Blackwell GPUs will be better.

In some cases torch.compile does not work well in PyTorch 2.8, but it also does not work well in PyTorch 2.7, and it will be better in the next version of PyTorch.

@FurkanGozukara
Copy link
Copy Markdown

awesome thanks.

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 15, 2025

Let me merge this and see how it rolls out. I hoped this could be superseded by GGUF + better torch.compile or Nunchaku, but as of PyTorch 2.9 I realized that fp8 + the block swap in ComfyUI-WanVideoWrapper (or ComfyUI-wanBlockswap for native workflows) runs faster and causes fewer recompilations than GGUF + the block swap in ComfyUI-GGUF on my machine.

This is the first feature in the 'core' part (rather than the Windows support code) that's deliberately different from the official Triton. It should also work on Linux but I'm not sure what's the best way to publish Linux wheels.

I'm not an expert on PTX. Welcome help in optimizing those PTX code.

@woct0rdho woct0rdho merged commit 4d53687 into release/3.4.x-windows Oct 15, 2025
@woct0rdho woct0rdho deleted the fp-emu branch October 15, 2025 02:56
@FurkanGozukara
Copy link
Copy Markdown

FurkanGozukara commented Oct 15, 2025

@woct0rdho yesterday i have generated FP8 Scaled version of Qwen Image Edit 2509 model and it is like 1.5x faster inference compared to GGUF and same quality

i don't see any benefit of GGUF anymore unless you are going for lower ranks like Q5 perhaps

i am using with native comfyui

@jprsyt5
Copy link
Copy Markdown

jprsyt5 commented Oct 15, 2025

I'm trying the newest release, 3.5.0-post21, and encountered this error.

Requested to load WAN21
loaded partially 5616.674973297119 5604.770568847656 642
(RES4LYF) rk_type: res_2s
  0%|                                                                                            | 0/4 [00:00<?, ?it/s]!!! Exception during processing !!! backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\triton\compiler\compiler.py)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Traceback (most recent call last):
  File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 496, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 315, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 289, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 277, in process_inputs
    result = f(**inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\nodes.py", line 1525, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\nodes.py", line 1492, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\sample.py", line 45, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 1161, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 1051, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 1036, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 1004, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 987, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 759, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\RES4LYF\beta\__init__.py", line 167, in sample_res_2s
    return rk_sampler_beta.sample_rk_beta(model, x, sigmas, None, extra_args, callback, disable, rk_type="res_2s",)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\RES4LYF\beta\rk_sampler_beta.py", line 1665, in sample_rk_beta
    eps_[row], data_[row] = RK(x_tmp, s_tmp, x_0, sigma, transformer_options={'row': row, 'x_tmp': x_tmp, 'sigma_next': sigma_next})
  File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\RES4LYF\beta\rk_method_beta.py", line 901, in __call__
    denoised = self.model_denoised(x.to(self.model_device), sub_sigma.to(self.model_device), **self.extra_args).to(sigma.device)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\RES4LYF\beta\rk_method_beta.py", line 241, in model_denoised
    denoised = self.model(x, sigma * s_in, **extra_args)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 408, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 960, in __call__
    return self.outer_predict_noise(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 967, in outer_predict_noise
    ).execute(x, timestep, model_options, seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 970, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 388, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 206, in calc_cond_batch
    return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 214, in _calc_cond_batch_outer
    return executor.execute(model, conds, x_in, timestep, model_options)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\samplers.py", line 333, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\model_base.py", line 161, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 113, in execute
    return self.wrappers[self.idx](self, *args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
    return executor(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
    return new_executor.execute(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\model_base.py", line 200, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\ldm\wan\model.py", line 614, in forward
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\ldm\wan\model.py", line 634, in _forward
    return self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transformer_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
  File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\ldm\wan\model.py", line 579, in forward_orig
    x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, transformer_options=transformer_options)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\eval_frame.py", line 372, in __call__
    return super().__call__(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\eval_frame.py", line 712, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\output_graph.py", line 1636, in _call_user_compiler
    raise BackendCompilerFailed(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\output_graph.py", line 1611, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\repro\after_dynamo.py", line 150, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\__init__.py", line 2364, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\compile_fx.py", line 2358, in compile_fx
    return aot_autograd(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\backends\common.py", line 106, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\aot_autograd.py", line 1189, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\_aot_autograd\autograd_cache.py", line 923, in load
    compiled_fn = dispatch_and_compile()
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\aot_autograd.py", line 1174, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\aot_autograd.py", line 576, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\aot_autograd.py", line 836, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\_aot_autograd\jit_compile_runtime_wrappers.py", line 243, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_functorch\aot_autograd.py", line 483, in __call__
    return self.compiler_fn(gm, example_inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\compile_fx.py", line 2190, in fw_compiler_base
    return inner_compile(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\compile_fx.py", line 716, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_dynamo\repro\after_aot.py", line 124, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\compile_fx.py", line 823, in _compile_fx_inner
    (key_info, cache_info) = FxGraphCache.prepare_key(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\codecache.py", line 1399, in prepare_key
    key, debug_lines = compiled_fx_graph_hash(
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\codecache.py", line 905, in compiled_fx_graph_hash
    details = FxGraphHashDetails(gm, example_inputs, fx_kwargs, inputs_to_check)
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\codecache.py", line 877, in __init__
    self.system_info = CacheBase.get_system()
  File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\torch\_inductor\codecache.py", line 188, in get_system
    from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\triton\compiler\compiler.py)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (F:\AI\ComfyUI-Nightly\ComfyUI\venv\lib\site-packages\triton\compiler\compiler.py)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

I have latest nightly comfy + torch 2.8.0.dev20250521+cu128 installed. It was working fine with 3.3.1.post19.

Do I need to upgrade torch to 2.9?

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 15, 2025

@jprsyt5 Triton 3.5 may not work with an earlier nightly version of torch 2.8 . I only tested that it works with the stable version of torch 2.8, and the release candidate version (downloaded from the 'test' channel, should be almost identical to the last nightly version and the stable version) of torch 2.9 . Let's assume it only works with the stable version of torch 2.9 .

@jtabox
Copy link
Copy Markdown

jtabox commented Oct 15, 2025

@woct0rdho yesterday i have generated FP8 Scaled version of Qwen Image Edit 2509 model and it is like 1.5x faster inference compared to GGUF and same quality

i don't see any benefit of GGUF anymore unless you are going for lower ranks like Q5 perhaps

i am using with native comfyui

What script are you using for the conversion to scaled? This one? https://github.com/Clybius/Learned-Rounding

@FurkanGozukara
Copy link
Copy Markdown

@woct0rdho yesterday i have generated FP8 Scaled version of Qwen Image Edit 2509 model and it is like 1.5x faster inference compared to GGUF and same quality
i don't see any benefit of GGUF anymore unless you are going for lower ranks like Q5 perhaps
i am using with native comfyui

What script are you using for the conversion to scaled? This one? https://github.com/Clybius/Learned-Rounding

I coded myself based on Musubi Tuner https://github.com/kohya-ss/musubi-tuner

Kohya is a legend. he also recently analyzed and visually compared different scaling methods and upgraded his fp8 scaling method

@pivtienduc
Copy link
Copy Markdown

@jprsyt5 Triton 3.5 may not work with an earlier nightly version of torch 2.8 . I only tested that it works with the stable version of torch 2.8, and the release candidate version (downloaded from the 'test' channel, should be almost identical to the last nightly version and the stable version) of torch 2.9 .

Hi @woct0rdho

I can confirm the compile haven't worked yet, speed haven't changed either compile or not. This is my working workflow for Qwen fp8. This is also the only working workflow, otherwise ComfyUI create black image.

image

Not working workflow:

image

This is comfyUI settings, I use Pytorch 2.9 stable version, cuda 12.8 and GPU 3090, I already installed triton triton-windows 3.5.0.post2.1

image

@pivtienduc

This comment was marked as duplicate.

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 16, 2025

@pivtienduc If you don't see this error, then the compile itself works.

The compile does not always give significant speedup, and it's known that things like compile + SageAttention + fast fp16 accum may result in black image because there's too much precision loss. It's also known that in SageAttention the CUDA kernel is less likely to cause black image than the Triton kernel. My best hope is to let Nunchaku supersede all this...

For Qwen-Image, you may try to use TorchCompileModeQwenImage in KJNodes rather than the original compile node, which only compiles the most heavy part rather than the whole pipeline, and it may help reduce the compilation time.

Also, you need a node to do block swap. By default ComfyUI does not do it, and when the model size exceeds your VRAM, Windows will move it to 'shared GPU memory', which is actually CPU memory in a very slow way. There are some block swap nodes in https://github.com/philipy1219/ComfyUI-TaylorSeer

@pivtienduc
Copy link
Copy Markdown

@pivtienduc If you don't see this error, then the compile itself works.

The compile does not always give significant speedup, and it's known that things like compile + SageAttention + fast fp16 accum may result in black image because there's too much precision loss. It's also known that in SageAttention the CUDA kernel is less likely to cause black image than the Triton kernel. My best hope is to let Nunchaku supersede all this...

For Qwen-Image, you may try to use TorchCompileModeQwenImage in KJNodes rather than the original compile node, which only compiles the most heavy part rather than the whole pipeline, and it may help reduce the compilation time.

Well I already tried all type of compile nodes, I put the native compile node for your reference only, compile working but so far no speed increase.
I will try https://github.com/philipy1219/ComfyUI-TaylorSeer. Thank you so much for your support.

@woct0rdho
Copy link
Copy Markdown
Owner Author

Wait, it seems the block swap nodes in ComfyUI-TaylorSeer cannot be used alone (I haven't tried it)

I know there is https://github.com/orssorbit/ComfyUI-wanBlockswap and I guess you can make it work for Qwen-Image with minimal modification (I haven't tried either...)

@pivtienduc
Copy link
Copy Markdown

Wait, it seems the block swap nodes in ComfyUI-TaylorSeer cannot be used alone (I haven't tried it)

I know there is https://github.com/orssorbit/ComfyUI-wanBlockswap and I guess you can make it work for Qwen-Image with minimal modification (I haven't tried either...)

Hi @woct0rdho

I wonder block swap different than Wan Tile node. I use Wan tile to transfer part of model to RAM:
https://github.com/stduhpf/ComfyUI--WanImageToVideoTiled

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 16, 2025

In a quick glance it seems ComfyUI--WanImageToVideoTiled only divides the image into tiles in the VAE, not in the diffusion model.

There are indeed some ways to divide the image into tiles in the diffusion model (such as those tiled upscale workflows), but the whole Qwen-Image diffusion model is still too large for you with 24G VRAM. You need block swap to divide the model.

@pivtienduc
Copy link
Copy Markdown

pivtienduc commented Oct 16, 2025

It seems ComfyUI--WanImageToVideoTiled only divides the image into tiles in the VAE, not in the diffusion model.

There are indeed some ways to divide the image into tiles in the diffusion model (such as those tiled upscale workflows), but the whole Qwen-Image diffusion model is still too large for you with 24G VRAM. You need block swap to divide the model.

No, it's not the tile vae decode. It divided into tile in the encode process, I can put into RAM 14gb and I can even generate 1 megapixel video with 121 frames with my 3090 alone. By the way I use WAN 2.2 Q8 GGUF not the bf16 model

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 16, 2025

Yes I meant the tiled VAE encode. I guess you're familiar with the tiled VAE decode, and in the WanImageToVideo there is a similar procedure to encode your input image using VAE, which can be speed up by tiling. However, that's not the KSampler node, which runs the diffusion model.

But anyway, when running Wan, you can load the 14G model into your VRAM and you still have 10G for activations (the size of activations depends on how large your video is), so probably you don't need block swap.

While when running Qwen-Image, you need to load the 20G model into your VRAM and there is only 4G for activations, so probably you need block swap.

You can see the GPU memory usage in Task Manager. If any 'shared GPU memory' is used, then your workflow will be much slower.

@jtabox
Copy link
Copy Markdown

jtabox commented Oct 16, 2025

@woct0rdho yesterday i have generated FP8 Scaled version of Qwen Image Edit 2509 model and it is like 1.5x faster inference compared to GGUF and same quality
i don't see any benefit of GGUF anymore unless you are going for lower ranks like Q5 perhaps
i am using with native comfyui

What script are you using for the conversion to scaled? This one? Clybius/Learned-Rounding

I coded myself based on Musubi Tuner kohya-ss/musubi-tuner

Kohya is a legend. he also recently analyzed and visually compared different scaling methods and upgraded his fp8 scaling method

Are those comparisons publicly available? Curious to see them.

@jprsyt5
Copy link
Copy Markdown

jprsyt5 commented Oct 16, 2025

@jprsyt5 Triton 3.5 may not work with an earlier nightly version of torch 2.8 . I only tested that it works with the stable version of torch 2.8, and the release candidate version (downloaded from the 'test' channel, should be almost identical to the last nightly version and the stable version) of torch 2.9 .

Hmm still error for me, testing it on 3070 & 3080 both throwing same error.

Logs

Name: torch
Version: 2.9.0+cu128

Name: triton-windows
Version: 3.5.0.post21

    0%|                                                                                            | 0/8 [00:00<?, ?it/s]E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Error while creating guard:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Name: ''
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Source: shape_env
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Create Function: SHAPE_ENV
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Guard Types: ['SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV', 'SHAPE_ENV']
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Code List: ["L['x'].stride()[0] == 5120*L['x'].size()[1]", "L['x'].stride()[2] == L['x'].size()[1]", "___as_tensor(L['self']._modules['norm1'].eps).item() == 1e-06", "L['self']._modules['norm1'].eps == 1e-06", "L['freqs'].size()[1] == L['x'].size()[1]", "L['freqs'].size()[5] == L['freqs'].size()[4]", "L['freqs'].stride()[0] == L['x'].size()[1]*L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[1] == L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[2] == L['x'].size()[1]*L['freqs'].size()[3]*L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[3] == L['freqs'].size()[4]*L['freqs'].size()[4]", "L['freqs'].stride()[4] == L['freqs'].size()[4]", "5120*L['x'].size()[1] <= 2147483647", "2 <= L['x'].size()[1] and L['x'].size()[1] <= 2147483647", "2 <= L['freqs'].size()[3]", "2 <= L['freqs'].size()[4]"]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Object Weakref: None
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     Guarded Class Weakref: None
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Traceback (most recent call last):
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 137, in check_compiler_exist_windows
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     subprocess.check_output([compiler, "/help"], stderr=subprocess.STDOUT)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 466, in check_output
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 548, in run
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     with Popen(*popenargs, **kwargs) as process:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1026, in __init__
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     self._execute_child(args, executable, preexec_fn, close_fds,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1538, in _execute_child
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] FileNotFoundError: [WinError 2] The system cannot find the file specified
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] The above exception was the direct cause of the following exception:
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] Traceback (most recent call last):
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_guards.py", line 366, in create
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     return self.create_fn(builder, self)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 2671, in SHAPE_ENV
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     clib = CppCodeCache.load(func_str)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2839, in load
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     return cls.load_async(*args, **kwargs)()
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2705, in load_async
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     "vec_isa": pick_vec_isa(),
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]                ^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 497, in pick_vec_isa
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     _valid_vec_isa_list: list[VecISA] = valid_vec_isa_list()
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]                                         ^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in valid_vec_isa_list
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     isa_list.extend(
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in <genexpr>
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     isa_list.extend(
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]                    ^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 143, in __bool__
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     return self.__bool__impl(config.cpp.vec_isa_ok)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 153, in __bool__impl
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     return self.check_build(VecISA._avx_code)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 103, in check_build
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     extra=_get_isa_dry_compile_fingerprint(self._arch_flags),
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 29, in _get_isa_dry_compile_fingerprint
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     compiler_info = get_compiler_version_info(get_cpp_compiler())
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]                                               ^^^^^^^^^^^^^^^^^^
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 338, in get_cpp_compiler
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     check_compiler_exist_windows(compiler)
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 139, in check_compiler_exist_windows
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2]     raise RuntimeError(f"Compiler: {compiler} is not found.") from exc
E1016 17:06:52.875000 25680 Lib\site-packages\torch\_guards.py:368] [0/0_2] RuntimeError: Compiler: cl is not found.
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2] Created at:
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 773, in trace_frame
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]     tracer = InstructionTranslator(
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\symbolic_convert.py", line 3847, in __init__
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]     output=OutputGraph(
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\output_graph.py", line 508, in __init__
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]     self.init_ambient_guards()
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]   File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\output_graph.py", line 668, in init_ambient_guards
E1016 17:06:52.892000 25680 Lib\site-packages\torch\_guards.py:370] [0/0_2]     self.guards.add(ShapeEnvSource().make_guard(GuardBuilder.SHAPE_ENV))
0%|                                                                                            | 0/8 [00:07<?, ?it/s]
!!! Exception during processing !!! RuntimeError: Compiler: cl is not found.

Traceback (most recent call last):
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)
         ^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\nodes.py", line 1521, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\nodes.py", line 1488, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\sample.py", line 45, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1143, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1033, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 1018, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 986, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 969, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 748, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\k_diffusion\sampling.py", line 190, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 400, in __call__
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 949, in __call__
return self.predict_noise(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 952, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 380, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 206, in calc_cond_batch
return executor.execute(model, conds, x_in, timestep, model_options)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\model_base.py", line 155, in apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.wrappers[self.idx](self, *args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
return executor(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 104, in __call__
return new_executor.execute(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\patcher_extension.py", line 111, in execute
return self.original(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\model_base.py", line 194, in _apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\ldm\wan\model.py", line 580, in forward
return self.forward_orig(x, timestep, context, clip_fea=clip_fea, freqs=freqs, transformer_options=transformer_options, **kwargs)[:, :, :t, :h, :w]
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\comfy\ldm\wan\model.py", line 550, in forward_orig
x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 414, in __call__
return super().__call__(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 832, in compile_wrapper
return fn(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1874, in __call__
result = self._torchdynamo_orig_backend(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1624, in __call__
result = self._inner_convert(
         ^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 688, in __call__
result = _compile(
         ^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1494, in _compile
raise InternalTorchDynamoError(
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1433, in _compile
guarded_code, tracer_output = compile_inner(code, one_graph, hooks)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_utils_internal.py", line 92, in wrapper_function
return function(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1117, in compile_inner
return _compile_inner(code, one_graph, hooks)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 1251, in _compile_inner
check_fn = dynamo_output.build_guards(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\convert_frame.py", line 856, in build_guards
return CheckFunctionManager(
       ^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 3383, in __init__
builder, guard_manager = self.build_guards(
                         ^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 3674, in build_guards
guard.create(builder)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_guards.py", line 366, in create
return self.create_fn(builder, self)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_dynamo\guards.py", line 2671, in SHAPE_ENV
clib = CppCodeCache.load(func_str)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2839, in load
return cls.load_async(*args, **kwargs)()
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\codecache.py", line 2705, in load_async
"vec_isa": pick_vec_isa(),
           ^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 497, in pick_vec_isa
_valid_vec_isa_list: list[VecISA] = valid_vec_isa_list()
                                    ^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in valid_vec_isa_list
isa_list.extend(
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 484, in <genexpr>
isa_list.extend(
               ^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 143, in __bool__
return self.__bool__impl(config.cpp.vec_isa_ok)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 153, in __bool__impl
return self.check_build(VecISA._avx_code)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 103, in check_build
extra=_get_isa_dry_compile_fingerprint(self._arch_flags),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpu_vec_isa.py", line 29, in _get_isa_dry_compile_fingerprint
compiler_info = get_compiler_version_info(get_cpp_compiler())
                                          ^^^^^^^^^^^^^^^^^^
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 338, in get_cpp_compiler
check_compiler_exist_windows(compiler)
File "D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\torch\_inductor\cpp_builder.py", line 139, in check_compiler_exist_windows
raise RuntimeError(f"Compiler: {compiler} is not found.") from exc
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: Compiler: cl is not found.

Edit: With this recent version, I need to run vcvars64.bat in the Command Prompt before starting ComfyUI.

Is it also supposed to be showing this in the logs?

IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.obj', '/link', '/LIBPATH:D:\\AI\\ComfyUI-WAN-Only\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\nvidia\\lib', '/LIBPATH:C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.9\\lib\\x64', '/LIBPATH:C:\\Users\\user\\AppData\\Local\\Programs\\Python\\Python311\\libs', '/LIBPATH:C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.37.32822\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'cuda.lib', 'python311.lib', '/OUT:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.pyd', '/IMPLIB:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.lib', '/PDB:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.pdb']
Error running sage attention: Command '['C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.37.32822\\bin\\Hostx64\\x64\\cl.exe', 'C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.c', '/nologo', '/O2', '/LD', '/std:c11', '/wd4819', '/ID:\\AI\\ComfyUI-WAN-Only\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\nvidia\\include', '/IC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.9\\include', '/IC:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd', '/IC:\\Users\\user\\AppData\\Local\\Programs\\Python\\Python311\\Include', '/IC:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.37.32822\\include', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\shared', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\ucrt', '/IC:\\Program Files (x86)\\Windows Kits\\10\\Include\\10.0.22621.0\\um', '/FoC:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.obj', '/link', '/LIBPATH:D:\\AI\\ComfyUI-WAN-Only\\ComfyUI\\venv\\Lib\\site-packages\\triton\\backends\\nvidia\\lib', '/LIBPATH:C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.9\\lib\\x64', '/LIBPATH:C:\\Users\\user\\AppData\\Local\\Programs\\Python\\Python311\\libs', '/LIBPATH:C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.37.32822\\lib\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\ucrt\\x64', '/LIBPATH:C:\\Program Files (x86)\\Windows Kits\\10\\Lib\\10.0.22621.0\\um\\x64', 'cuda.lib', 'python311.lib', '/OUT:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.pyd', '/IMPLIB:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.lib', '/PDB:C:\\Users\\user\\AppData\\Local\\Temp\\tmp8qn5ysbd\\__triton_launcher.cp311-win_amd64.pdb']' returned non-zero exit status 2., using pytorch attention instead.
__triton_launcher.c
C:\Users\user\AppData\Local\Temp\tmpyxbou8dl\__triton_launcher.c(123): error C2059: syntax error: '}'
C:\Users\user\AppData\Local\Temp\tmpyxbou8dl\__triton_launcher.c(131): error C2059: syntax error: '}'

@woct0rdho
Copy link
Copy Markdown
Owner Author

@jprsyt5 What's your MSVC version (you can see it when modifying the components in your Visual Studio)? What if you update it to the latest version?

@jprsyt5
Copy link
Copy Markdown

jprsyt5 commented Oct 16, 2025

@jprsyt5 What's your MSVC version (you can see it when modifying the components in your Visual Studio)? What if you update it to the latest version?

C:\Users\user>"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"


** Visual Studio 2022 Developer Command Prompt v17.7.4
** Copyright (c) 2022 Microsoft Corporation


[vcvarsall.bat] Environment initialized for: 'x64'

C:\Users\user>cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.37.32824 for x64
Copyright (C) Microsoft Corporation. All rights reserved.

usage: cl [ option... ] filename... [ /link linkoption... ]

image

Tried checking in Visual Studio Installer, and it shows as the latest version

@woct0rdho
Copy link
Copy Markdown
Owner Author

woct0rdho commented Oct 16, 2025

Currently the latest version should be MSVC 14.44 . Is there MSVC 14.44 in your C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\?

Bug like this happened before but I rarely see it. To help debug, you can modify D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\triton\runtime\build.py, around line 160:

    with tempfile.TemporaryDirectory() as tmpdir:

        # Add this
        tmpdir = os.path.basename(tmpdir)
        tmpdir = rf"C:\tmp\{tmpdir}"
        os.makedirs(tmpdir, exist_ok=True)

        src_path = os.path.join(tmpdir, f"{name}.c")

Then run the workflow again. __triton_launcher.c should be saved in C:\tmp\. Send it here.

@jprsyt5
Copy link
Copy Markdown

jprsyt5 commented Oct 16, 2025

Currently the latest version should be MSVC 14.44 . Is there MSVC 14.44 in your C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\?

Bug like this happened before but I rarely see it. To help debug, you can modify D:\AI\ComfyUI-WAN-Only\ComfyUI\venv\Lib\site-packages\triton\runtime\build.py, around line 160:

    with tempfile.TemporaryDirectory() as tmpdir:

        # Add this
        tmpdir = os.path.basename(tmpdir)
        tmpdir = rf"C:\tmp\{tmpdir}"
        os.makedirs(tmpdir, exist_ok=True)

        src_path = os.path.join(tmpdir, f"{name}.c")

Then run the workflow again. __triton_launcher.c should be saved in C:\tmp\. Send it here.

Solved! Now torch.compile works fine with FP8_e4m3fn on my Ampere GPU!

What I did was decide to reinstall the whole visual studio, because I was trying to upgrade the vs installer, but it wouldn't let me.

and yeah, my previous msvc wasn’t the latest version.

C:\Users\user>cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35217 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

usage: cl [ option... ] filename... [ /link linkoption... ]

As for the performance, I don’t really notice any difference. I did a quick test running text2image (just 1 frame) using WAN 2.2, and the results were similar in performance.

@pivtienduc
Copy link
Copy Markdown

Yes I meant the tiled VAE encode. I guess you're familiar with the tiled VAE decode, and in the WanImageToVideo there is a similar procedure to encode your input image using VAE, which can be speed up by tiling. However, that's not the KSampler node, which runs the diffusion model.

But anyway, when running Wan, you can load the 14G model into your VRAM and you still have 10G for activations (the size of activations depends on how large your video is), so probably you don't need block swap.

While when running Qwen-Image, you need to load the 20G model into your VRAM and there is only 4G for activations, so probably you need block swap.

You can see the GPU memory usage in Task Manager. If any 'shared GPU memory' is used, then your workflow will be much slower.

My bad, I though something new. Actually I already use multi GPU node: https://github.com/pollockjj/ComfyUI-MultiGPU
which does exactly like swap block but a bit better since I know how many GB Ram replace Vram not just block. Thank for introduce me the wan block swap node anyway.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants