Add a C extension so launches skip more Python frames by choijon5 · Pull Request #2534 · pytorch/helion

choijon5 · 2026-05-20T18:57:58Z

Stacked PRs:

Even with the previous patch's fast launcher, every kernel call still
spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through
one more Python wrapper before reaching the C launcher. On small
kernels, those Python frames are a big chunk of the remaining overhead.

Adds Helion's first compiled C extension, helion._C. It does the
hottest steps in C instead of Python:

Builds the per-call argument cache key directly in C.
Exposes the launcher itself as a C-callable object so the
generated wrapper can call straight into it (no extra Python
frame).
Skips a redundant broadcast call when the input shapes already
match (the common case for Helion kernels).

If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build),
we fall back to the pure-Python launcher automatically and Helion
keeps working as before.

Perf (1024x1024 bf16 add microbench, CUDA graphs off):
Pre-launcher baseline: 34.5 us / call
Previous fast-launcher patch: 25.2 us / call
This patch (default): 18.2 us / call (-29% vs prev)
This patch (pool=1): 10.7 us / call (-58% vs prev)

Three pieces inside helion._C:

C tensor_key (helion._C.tensor_key): replaces the per-call
Python specialization-key build inside Kernel.bind for the
static-shape fast path. Falls back to the Python implementation
for unsupported inputs (SymInts etc) by returning None.
C CompiledLauncher: caches the Triton CompiledKernel + driver
hooks once and exposes a tp_call slot so the host wrapper can
skip the _FastLauncher.call Python frame. The Python
_FastLauncher gets an _install_c_launcher method that hot-swaps
the kwdefault "_launcher" entry to the C object after priming.
Two codegen tactics for the generated Triton host wrapper:
- Skip torch.broadcast_tensors when the input shapes statically
  match (most Helion kernels).
- Opt-in (HELION_OUTPUT_POOL=1) reuse of the output buffer via
  _helion_output_alloc, swapping torch.empty for a small cache.
  Default behavior preserved (one fresh allocation per call).

Even with the previous patch's fast launcher, every kernel call still spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through one more Python wrapper before reaching the C launcher. On small kernels, those Python frames are a big chunk of the remaining overhead. Adds Helion's first compiled C extension, helion._C. It does the hottest steps in C instead of Python: - Builds the per-call argument cache key directly in C. - Exposes the launcher itself as a C-callable object so the generated wrapper can call straight into it (no extra Python frame). - Skips a redundant broadcast call when the input shapes already match (the common case for Helion kernels). If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build), we fall back to the pure-Python launcher automatically and Helion keeps working as before. Perf (1024x1024 bf16 add microbench, CUDA graphs off): Pre-launcher baseline: 34.5 us / call Previous fast-launcher patch: 25.2 us / call This patch (default): 18.2 us / call (-29% vs prev) This patch (pool=1): 10.7 us / call (-58% vs prev) Three pieces inside helion._C: 1. C tensor_key (`helion._C.tensor_key`): replaces the per-call Python specialization-key build inside Kernel.bind for the static-shape fast path. Falls back to the Python implementation for unsupported inputs (SymInts etc) by returning None. 2. C CompiledLauncher: caches the Triton CompiledKernel + driver hooks once and exposes a tp_call slot so the host wrapper can skip the _FastLauncher.__call__ Python frame. The Python _FastLauncher gets an _install_c_launcher method that hot-swaps the kwdefault "_launcher" entry to the C object after priming. 3. Two codegen tactics for the generated Triton host wrapper: - Skip torch.broadcast_tensors when the input shapes statically match (most Helion kernels). - Opt-in (HELION_OUTPUT_POOL=1) reuse of the output buffer via _helion_output_alloc, swapping torch.empty for a small cache. Default behavior preserved (one fresh allocation per call). stack-info: PR: #2534, branch: choijon5/stack/75

… (Chunk D+E follow-up) Wires Chunk D (output pool) into the generated host wrapper via a codegen rewrite. When ``HELION_OUTPUT_POOL=1`` is set at compile time, ``out = torch.empty_like(x)`` in the wrapper is rewritten to ``out = _helion_pool_empty_like(x)`` (== ``helion.runtime.empty_like``). Users no longer need to manually swap ``torch.empty_like`` in their kernel source. Implementation: ``_rewrite_output_allocs_for_pool`` in ``helion/_compiler/generate_ast.py`` walks the host wrapper's AST, finds top-level ``Assign(target=Name(v), value=Call(torch.empty_like(...)))`` statements whose target ``v`` is passed as a positional arg to a kernel launcher call in the same scope (a kernel-output buffer), and swaps the called function for the pool helper. The rewrite is conservative — it only fires on: 1. ``HELION_OUTPUT_POOL=1`` env var set at compile time 2. ``torch.empty_like`` calls (not arbitrary factory expressions) 3. Targets that are consumed by a kernel launcher call (output buffers) Tensors that escape via other paths stay on ``torch.empty_like`` because pooled recycling isn't safe for them. Tests in ``test_pool_codegen_rewrite.py`` cover: - Rewrite fires when ``HELION_OUTPUT_POOL=1`` is set and produces a wrapper that calls ``_helion_pool_empty_like(x)``. - Rewrite is ABSENT when env var unset — wrapper preserves ``torch.empty_like(x)``. - End-to-end correctness: outputs match a ``torch.empty_like`` baseline regardless of which allocation path was taken. - ``_POOLS`` accumulates exactly one ring entry per call signature when the wrapper runs. Benchmarks (H100, vector_add at N=4096) — combined Chunk D + Chunk E --------------------------------------------------------------------- To approach #2534's reported 10.7 μs/call (pool=1) number we need both the pool wiring (this commit) AND the C launcher (Chunk E) active. The C launcher install path isn't auto-wired yet — measured here by priming the launcher and swapping the wrapper's kwdefault by hand, mirroring what an end-to-end install hook would do. 15k iters per variant after 100 warmup, two timings each (wall+sync forces ``torch.cuda.synchronize()`` after every call; cpu_only does a single sync at the end and measures submission rate): | Variant | wall+sync μs | cpu_only μs | | ---------------------------------------- | -----------: | ----------: | | 1. baseline (default_launcher, pool off) | 26.16 | 19.08 | | 2. + C launcher (Chunk E only) | 18.64 | 11.95 | | 3. + pool (Chunk D only, Py launcher) | 22.82 | 16.35 | | 4. + pool + C launcher (D + E) | **16.69** | **10.31** | | Reference (#2534 pool=1) | – | **10.70** | Our combined D+E cpu_only of 10.31 μs matches and slightly beats #2534's reported 10.7 μs/call. The wall+sync gap (16.7 vs the implicit ~10.7) is per-call GPU sync overhead — both numbers measure real Python/C dispatch cost; the cpu_only metric is the relevant "how fast can we submit" comparison. The meta-device retargeting + C-launcher pool integration I worried about isn't required to match the number — the simpler "C launcher + Python pool returning a cached tensor" stack hits the same performance ceiling on this kernel. Limitations ----------- - C launcher install is still manual (prime then swap kwdefault) — a follow-up could add a ``BoundKernel.set_config`` hook that auto-installs ``helion._C.CompiledLauncher`` when present. - The rewrite only targets ``torch.empty_like`` (Helion's current codegen pattern). ``torch.empty(shape, dtype, device)`` could be rewritten too once the pool helper's signature is extended to match that form. Part of the chunked C-extension plan; the codegen integration that brings Chunks D + E together up to #2534's reported pool=1 number. stack-info: PR: #2609, branch: yushangdi/stack/12

Adds a single-file C extension (``helion/_C/_launcher.c``) that exposes ``CompiledLauncher`` — a Python type with a ``tp_call`` slot. A primed launcher dispatches a Triton kernel launch directly into ``compiled_kernel.run`` (the C launcher Triton emits), bypassing both the Python ``default_launcher`` frame AND Triton's ``JITFunction.run`` pipeline (binder, ``compute_cache_key``, kernel-cache lookup, ``launch_metadata`` allocation, etc.). Usage from Python: import helion._C launcher = helion._C.CompiledLauncher() launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2) # then install as the wrapper's _launcher kwdefault, or call directly: launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...) Scope caveats (deliberately omitted to demonstrate the ceiling) --------------------------------------------------------------- The launcher in this commit is intentionally minimal — these correctness guards are NOT implemented, so the perf number below represents the upper bound on the saving (i.e. how fast the C launcher could possibly go before re-adding any safety): - No multi-spec cache. The compiled kernel captured at prime time is reused for EVERY subsequent call. Caller must keep arg specs stable (same alignment, same shape, same dtype) or the GPU launch will silently use a wrong-spec binary (e.g. vectorized 16-byte loads against an unaligned pointer → ``CUDA error: misaligned address``). - No knob/hook re-reads. ``triton.knobs.runtime.debug``, ``triton.knobs.compilation.instrumentation_mode``, ``triton.knobs.runtime.add_stages_inspection_hook``, and the launch hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT observed — we always pass ``None`` for ``launch_metadata`` and for both hooks. A profiler attached after priming silently sees no Helion launches. - No ``used_global_vals`` mutation check. Mutating a Helion-tracked global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently uses the stale binary — Triton's own ``RuntimeError`` raise is bypassed. - No multi-device guard. Switching CUDA devices after priming and then calling the launcher dispatches to the priming device's stream — likely an ``invalid resource handle`` crash. - No ``pre_run_hooks`` invocation. Hooks installed on the underlying ``JITFunction`` (e.g. autotune timing hooks) silently don't fire. - No fallback to ``default_launcher`` on any of the above. The launcher just dispatches with stale / wrong state. A production version would port the Phase-2-launcher's correctness guards to C; their per-call cost in Python is ~2 μs, so a ship-able C launcher nets out around -5 μs/call vs ``default_launcher`` rather than the -7 μs the un-guarded version below achieves. The build is NOT wired into hatchling — manual ``gcc`` one-liner in ``helion/_C/README.md``. Default install leaves ``_C.CompiledLauncher = None`` and Helion uses the Python ``default_launcher``. Tests in ``test_c_launcher.py`` cover: - Correctness: primed C launcher produces same output as Python path on the same args, both direct-call and routed through the generated wrapper. - ``CompiledLauncher`` is subclassable (``Py_TPFLAGS_BASETYPE``). - Calling without priming raises ``RuntimeError``. Benchmarks (H100, vector_add at N=4096, this commit's state) ------------------------------------------------------------ End-to-end per-call timing with the C launcher installed (manual prime + swap of the wrapper's ``_launcher`` kwdefault): Variant wall+sync cpu_only baseline (main) 23.94 us 17.17 us + Chunk D pool runtime (no rewrite) 24.34 us 17.95 us + codegen rewrite (HELION_OUTPUT_POOL=1) 22.21 us 15.96 us + Phase 2 fast launcher (Python multi-spec) 19.49 us 13.07 us + Chunk A (helion._C shim, no .c) 19.91 us 12.96 us + Chunk E (C launcher, this commit) 16.64 us 9.77 us Total saving vs baseline: -7.30 us -7.40 us Speedup: 1.44x 1.76x Reference: #2534 reports 10.7 us/call pool=1. Our 9.77 us cpu_only beats that — the simpler "C launcher + Python pool returning a cached tensor + Phase-2 multi-spec correctness" stack hits the same performance ceiling as #2534's design on this kernel. Reminder: the -7 us is the *ceiling* (this commit's C launcher has NO correctness guards). A production C launcher must add the Phase-2 guards (multi-spec cache, knob checks, used_global_vals snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python launcher in the prior commit already has those guards, so its 13.07 us/call number is the realistic apples-to-apples comparison. Why no separate "C tensor_key" commit ------------------------------------- An earlier draft of this stack included a "Chunk B: C tensor_key accelerator" commit between Chunk A and this one. ``Kernel.bind`` calls ``_tensor_key(fn, tensor)`` once per tensor arg; a C implementation that mirrors the static-shapes branch returns the same 4-tuple ``(dtype, sizes, strides, static_indices)`` ~110 ns faster than the Python path (0.510 us vs 0.619 us in isolated timeit microbench, 1.21x speedup). But end-to-end on a typical kernel (vector_add has 2 tensor args) the per-call saving was ~220 ns — within run-to-run variance: prior commit (Chunk A shim, no .c) 18.79 us 12.89 us Chunk A + C tensor_key (dropped) 19.13 us 13.06 us We dropped the dedicated commit. The ``helion._C.tensor_key = None`` slot remains in Chunk A's shim as a documented extension point for a future C accelerator that can do meaningfully better than the ~110 ns/call current Python C-API approach achieves (e.g. via PyTorch's stable C ABI to skip Python attribute machinery for ``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API path's gain is bounded by the cost of those attribute lookups, which it doesn't actually avoid — so any future C tensor_key accelerator should target the stable ABI or unstable ``at::Tensor`` to get a meaningful per-call saving. stack-info: PR: #2609, branch: yushangdi/stack/12

Adds a single-crate Rust extension (``helion/_native/src/lib.rs``, built via PyO3) that exposes ``CompiledLauncher`` — a Python type with a ``__call__`` slot. A primed launcher dispatches a Triton kernel launch directly into ``compiled_kernel.run`` (the C launcher Triton emits), bypassing both the Python ``default_launcher`` frame AND Triton's ``JITFunction.run`` pipeline (binder, ``compute_cache_key``, kernel-cache lookup, ``launch_metadata`` allocation, etc.). Usage from Python: import helion._native launcher = helion._native.CompiledLauncher() launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2) # then install as the wrapper's _launcher kwdefault, or call directly: launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...) Why Rust (not C) ---------------- A prior draft of this commit used a hand-written C extension. The Rust version is functionally identical and benchmarks the same on the hot path (see numbers below). Rust buys us: - Memory safety on error paths. PyO3's ``PyResult`` / ``Bound<'_, T>`` types prevent the kinds of reference-leak bugs the manual ``Py_XDECREF`` ladder in C is prone to. The Rust source is about half the length of the equivalent C and reads closer to Python. - A standard build system (``cargo``) instead of an ad-hoc ``gcc`` one-liner. PyO3 0.28 with the ``abi3-py310`` feature makes the resulting .so binary-compatible with any CPython >= 3.10. The crate's only dependency is ``pyo3`` itself. Scope caveats (deliberately omitted to demonstrate the ceiling) --------------------------------------------------------------- The launcher in this commit is intentionally minimal — these correctness guards are NOT implemented, so the perf number below represents the upper bound on the saving (i.e. how fast the Rust launcher could possibly go before re-adding any safety): - No multi-spec cache. The compiled kernel captured at prime time is reused for EVERY subsequent call. Caller must keep arg specs stable (same alignment, same shape, same dtype) or the GPU launch will silently use a wrong-spec binary (e.g. vectorized 16-byte loads against an unaligned pointer -> ``CUDA error: misaligned address``). - No knob/hook re-reads. ``triton.knobs.runtime.debug``, ``triton.knobs.compilation.instrumentation_mode``, ``triton.knobs.runtime.add_stages_inspection_hook``, and the launch hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT observed — we always pass ``None`` for ``launch_metadata`` and for both hooks. A profiler attached after priming silently sees no Helion launches. - No ``used_global_vals`` mutation check. Mutating a Helion-tracked global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently uses the stale binary — Triton's own ``RuntimeError`` raise is bypassed. - No multi-device guard. Switching CUDA devices after priming and then calling the launcher dispatches to the priming device's stream — likely an ``invalid resource handle`` crash. - No ``pre_run_hooks`` invocation. Hooks installed on the underlying ``JITFunction`` (e.g. autotune timing hooks) silently don't fire. - No fallback to ``default_launcher`` on any of the above. The launcher just dispatches with stale / wrong state. A production version would port the Phase-2-launcher's correctness guards to Rust; their per-call cost in Python is ~2 us, so a ship-able Rust launcher nets out around -5 us/call vs ``default_launcher`` rather than the -7 us the un-guarded version below achieves. The build is NOT wired into hatchling — see ``helion/_native/README.md`` for the manual ``cargo build`` flow. Default install leaves ``_native.CompiledLauncher = None`` and Helion uses the Python ``default_launcher``. Tests in ``test_native_launcher.py`` cover: - Correctness: primed Rust launcher produces same output as Python path on the same args, both direct-call and routed through the generated wrapper. - ``CompiledLauncher`` is subclassable (declared with ``#[pyclass(subclass)]``). - Calling without priming raises ``RuntimeError``. Benchmarks (H100, vector_add at N=4096) --------------------------------------- End-to-end per-call timing with the Rust launcher installed (manual prime + swap of the wrapper's ``_launcher`` kwdefault): Variant wall+sync cpu_only baseline (main) 23.94 us 17.17 us + Chunk D pool runtime (no rewrite) 24.34 us 17.95 us + codegen rewrite (HELION_OUTPUT_POOL=1) 22.21 us 15.96 us + Concrete-tensor _hashable_dims fast path ~21 us ~14.5 us + Phase 2 fast launcher (Python multi-spec) 19.49 us 13.07 us + Chunk A (helion._native shim, no .rs) ~19.7 us ~13.0 us + Chunk E (Rust launcher, this commit) ~16.7 us ~9.93 us Total saving vs baseline: ~-7.2 us ~-7.2 us Per-call comparison vs the equivalent C version (prior draft of this commit) at the same stack position: C ``tp_call`` 16.64 us 9.77 us Rust ``__call__`` (this commit) ~16.7 us 9.93 us The ~0.16 us difference is within run-to-run noise. PyO3's macro-generated trampolines compile down to the same CPython entry points a hand-written C extension would issue; the per-call ceiling is set by ``compiled_kernel.run`` (Triton's C launcher) and the Python C API tuple-build, not by language choice. Reference: #2534 reports 10.7 us/call pool=1. Our 9.93 us cpu_only beats that — the simpler "Rust launcher + Python pool returning a cached tensor + Phase-2 multi-spec correctness" stack hits the same performance ceiling as #2534's design on this kernel. Reminder: the -7 us is the *ceiling* (this commit's Rust launcher has NO correctness guards). A production version must add the Phase-2 guards (multi-spec cache, knob checks, used_global_vals snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python launcher in the prior commit already has those guards, so its 13.07 us/call number is the realistic apples-to-apples comparison. Why no separate "Rust tensor_key" commit ---------------------------------------- The earlier C-stack version of this work investigated a Rust / C ``tensor_key`` accelerator. ``Kernel.bind`` calls ``_tensor_key`` once per tensor arg; a Rust implementation that mirrors the static-shapes branch returns the same 4-tuple ``(dtype, sizes, strides, static_indices)`` ~100 ns faster than the Python path in microbench. But end-to-end on a typical kernel (vector_add has 2 tensor args) the per-call saving was ~200 ns — within run-to-run variance. We dropped the dedicated commit. The ``helion._native.tensor_key = None`` slot remains in Chunk A's shim as a documented extension point for a future accelerator that can do meaningfully better than the ~100 ns/call current Python C-API approach achieves (e.g. via PyTorch's stable C ABI to skip Python attribute machinery for ``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API path's gain is bounded by the cost of those attribute lookups, which it doesn't actually avoid — so any future Rust ``tensor_key`` accelerator should target the stable ABI or unstable ``at::Tensor`` to get a meaningful per-call saving. stack-info: PR: #2609, branch: yushangdi/stack/12

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 20, 2026

choijon5 force-pushed the choijon5/stack/74 branch from 91281ac to 72b5756 Compare May 20, 2026 18:59

choijon5 force-pushed the choijon5/stack/75 branch from 527baf9 to 529203e Compare May 20, 2026 18:59

This was referenced May 27, 2026

[fast-launcher] Opt-in output-tensor pool (Chunk D, Python-only) #2604

Draft

[fast-launcher] Minimal Rust CompiledLauncher with __call__ (Chunk E) #2609

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a C extension so launches skip more Python frames#2534

Add a C extension so launches skip more Python frames#2534
choijon5 wants to merge 1 commit into
choijon5/stack/74from
choijon5/stack/75

choijon5 commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

choijon5 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

choijon5 commented May 20, 2026 •

edited

Loading