Add a C extension so launches skip more Python frames#2534
Draft
choijon5 wants to merge 1 commit into
Draft
Conversation
Even with the previous patch's fast launcher, every kernel call still
spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through
one more Python wrapper before reaching the C launcher. On small
kernels, those Python frames are a big chunk of the remaining overhead.
Adds Helion's first compiled C extension, helion._C. It does the
hottest steps in C instead of Python:
- Builds the per-call argument cache key directly in C.
- Exposes the launcher itself as a C-callable object so the
generated wrapper can call straight into it (no extra Python
frame).
- Skips a redundant broadcast call when the input shapes already
match (the common case for Helion kernels).
If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build),
we fall back to the pure-Python launcher automatically and Helion
keeps working as before.
Perf (1024x1024 bf16 add microbench, CUDA graphs off):
Pre-launcher baseline: 34.5 us / call
Previous fast-launcher patch: 25.2 us / call
This patch (default): 18.2 us / call (-29% vs prev)
This patch (pool=1): 10.7 us / call (-58% vs prev)
Three pieces inside helion._C:
1. C tensor_key (`helion._C.tensor_key`): replaces the per-call
Python specialization-key build inside Kernel.bind for the
static-shape fast path. Falls back to the Python implementation
for unsupported inputs (SymInts etc) by returning None.
2. C CompiledLauncher: caches the Triton CompiledKernel + driver
hooks once and exposes a tp_call slot so the host wrapper can
skip the _FastLauncher.__call__ Python frame. The Python
_FastLauncher gets an _install_c_launcher method that hot-swaps
the kwdefault "_launcher" entry to the C object after priming.
3. Two codegen tactics for the generated Triton host wrapper:
- Skip torch.broadcast_tensors when the input shapes statically
match (most Helion kernels).
- Opt-in (HELION_OUTPUT_POOL=1) reuse of the output buffer via
_helion_output_alloc, swapping torch.empty for a small cache.
Default behavior preserved (one fresh allocation per call).
stack-info: PR: #2534, branch: choijon5/stack/75
91281ac to
72b5756
Compare
527baf9 to
529203e
Compare
This was referenced May 20, 2026
This was referenced May 27, 2026
yushangdi
added a commit
that referenced
this pull request
May 27, 2026
… (Chunk D+E follow-up) Wires Chunk D (output pool) into the generated host wrapper via a codegen rewrite. When ``HELION_OUTPUT_POOL=1`` is set at compile time, ``out = torch.empty_like(x)`` in the wrapper is rewritten to ``out = _helion_pool_empty_like(x)`` (== ``helion.runtime.empty_like``). Users no longer need to manually swap ``torch.empty_like`` in their kernel source. Implementation: ``_rewrite_output_allocs_for_pool`` in ``helion/_compiler/generate_ast.py`` walks the host wrapper's AST, finds top-level ``Assign(target=Name(v), value=Call(torch.empty_like(...)))`` statements whose target ``v`` is passed as a positional arg to a kernel launcher call in the same scope (a kernel-output buffer), and swaps the called function for the pool helper. The rewrite is conservative — it only fires on: 1. ``HELION_OUTPUT_POOL=1`` env var set at compile time 2. ``torch.empty_like`` calls (not arbitrary factory expressions) 3. Targets that are consumed by a kernel launcher call (output buffers) Tensors that escape via other paths stay on ``torch.empty_like`` because pooled recycling isn't safe for them. Tests in ``test_pool_codegen_rewrite.py`` cover: - Rewrite fires when ``HELION_OUTPUT_POOL=1`` is set and produces a wrapper that calls ``_helion_pool_empty_like(x)``. - Rewrite is ABSENT when env var unset — wrapper preserves ``torch.empty_like(x)``. - End-to-end correctness: outputs match a ``torch.empty_like`` baseline regardless of which allocation path was taken. - ``_POOLS`` accumulates exactly one ring entry per call signature when the wrapper runs. Benchmarks (H100, vector_add at N=4096) — combined Chunk D + Chunk E --------------------------------------------------------------------- To approach #2534's reported 10.7 μs/call (pool=1) number we need both the pool wiring (this commit) AND the C launcher (Chunk E) active. The C launcher install path isn't auto-wired yet — measured here by priming the launcher and swapping the wrapper's kwdefault by hand, mirroring what an end-to-end install hook would do. 15k iters per variant after 100 warmup, two timings each (wall+sync forces ``torch.cuda.synchronize()`` after every call; cpu_only does a single sync at the end and measures submission rate): | Variant | wall+sync μs | cpu_only μs | | ---------------------------------------- | -----------: | ----------: | | 1. baseline (default_launcher, pool off) | 26.16 | 19.08 | | 2. + C launcher (Chunk E only) | 18.64 | 11.95 | | 3. + pool (Chunk D only, Py launcher) | 22.82 | 16.35 | | 4. + pool + C launcher (D + E) | **16.69** | **10.31** | | Reference (#2534 pool=1) | – | **10.70** | Our combined D+E cpu_only of 10.31 μs matches and slightly beats #2534's reported 10.7 μs/call. The wall+sync gap (16.7 vs the implicit ~10.7) is per-call GPU sync overhead — both numbers measure real Python/C dispatch cost; the cpu_only metric is the relevant "how fast can we submit" comparison. The meta-device retargeting + C-launcher pool integration I worried about isn't required to match the number — the simpler "C launcher + Python pool returning a cached tensor" stack hits the same performance ceiling on this kernel. Limitations ----------- - C launcher install is still manual (prime then swap kwdefault) — a follow-up could add a ``BoundKernel.set_config`` hook that auto-installs ``helion._C.CompiledLauncher`` when present. - The rewrite only targets ``torch.empty_like`` (Helion's current codegen pattern). ``torch.empty(shape, dtype, device)`` could be rewritten too once the pool helper's signature is extended to match that form. Part of the chunked C-extension plan; the codegen integration that brings Chunks D + E together up to #2534's reported pool=1 number. stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi
added a commit
that referenced
this pull request
May 27, 2026
Adds a single-file C extension (``helion/_C/_launcher.c``) that
exposes ``CompiledLauncher`` — a Python type with a ``tp_call`` slot.
A primed launcher dispatches a Triton kernel launch directly into
``compiled_kernel.run`` (the C launcher Triton emits), bypassing
both the Python ``default_launcher`` frame AND Triton's
``JITFunction.run`` pipeline (binder, ``compute_cache_key``,
kernel-cache lookup, ``launch_metadata`` allocation, etc.).
Usage from Python:
import helion._C
launcher = helion._C.CompiledLauncher()
launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
# then install as the wrapper's _launcher kwdefault, or call directly:
launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)
Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------
The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the C
launcher could possibly go before re-adding any safety):
- No multi-spec cache. The compiled kernel captured at prime time is
reused for EVERY subsequent call. Caller must keep arg specs stable
(same alignment, same shape, same dtype) or the GPU launch will
silently use a wrong-spec binary (e.g. vectorized 16-byte loads
against an unaligned pointer → ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
``triton.knobs.compilation.instrumentation_mode``,
``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
observed — we always pass ``None`` for ``launch_metadata`` and for
both hooks. A profiler attached after priming silently sees no
Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
uses the stale binary — Triton's own ``RuntimeError`` raise is
bypassed.
- No multi-device guard. Switching CUDA devices after priming and
then calling the launcher dispatches to the priming device's
stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
launcher just dispatches with stale / wrong state.
A production version would port the Phase-2-launcher's correctness
guards to C; their per-call cost in Python is ~2 μs, so a
ship-able C launcher nets out around -5 μs/call vs ``default_launcher``
rather than the -7 μs the un-guarded version below achieves.
The build is NOT wired into hatchling — manual ``gcc`` one-liner
in ``helion/_C/README.md``. Default install leaves
``_C.CompiledLauncher = None`` and Helion uses the Python
``default_launcher``.
Tests in ``test_c_launcher.py`` cover:
- Correctness: primed C launcher produces same output as Python path
on the same args, both direct-call and routed through the
generated wrapper.
- ``CompiledLauncher`` is subclassable (``Py_TPFLAGS_BASETYPE``).
- Calling without priming raises ``RuntimeError``.
Benchmarks (H100, vector_add at N=4096, this commit's state)
------------------------------------------------------------
End-to-end per-call timing with the C launcher installed (manual
prime + swap of the wrapper's ``_launcher`` kwdefault):
Variant wall+sync cpu_only
baseline (main) 23.94 us 17.17 us
+ Chunk D pool runtime (no rewrite) 24.34 us 17.95 us
+ codegen rewrite (HELION_OUTPUT_POOL=1) 22.21 us 15.96 us
+ Phase 2 fast launcher (Python multi-spec) 19.49 us 13.07 us
+ Chunk A (helion._C shim, no .c) 19.91 us 12.96 us
+ Chunk E (C launcher, this commit) 16.64 us 9.77 us
Total saving vs baseline: -7.30 us -7.40 us
Speedup: 1.44x 1.76x
Reference: #2534 reports 10.7 us/call pool=1. Our 9.77 us cpu_only
beats that — the simpler "C launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.
Reminder: the -7 us is the *ceiling* (this commit's C launcher has
NO correctness guards). A production C launcher must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.
Why no separate "C tensor_key" commit
-------------------------------------
An earlier draft of this stack included a "Chunk B: C tensor_key
accelerator" commit between Chunk A and this one. ``Kernel.bind``
calls ``_tensor_key(fn, tensor)`` once per tensor arg; a C
implementation that mirrors the static-shapes branch returns the
same 4-tuple ``(dtype, sizes, strides, static_indices)`` ~110 ns
faster than the Python path (0.510 us vs 0.619 us in isolated
timeit microbench, 1.21x speedup).
But end-to-end on a typical kernel (vector_add has 2 tensor args)
the per-call saving was ~220 ns — within run-to-run variance:
prior commit (Chunk A shim, no .c) 18.79 us 12.89 us
Chunk A + C tensor_key (dropped) 19.13 us 13.06 us
We dropped the dedicated commit. The ``helion._C.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for
a future C accelerator that can do meaningfully better than the
~110 ns/call current Python C-API approach achieves (e.g. via
PyTorch's stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future C tensor_key
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.
stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi
added a commit
that referenced
this pull request
May 27, 2026
Adds a single-crate Rust extension (``helion/_native/src/lib.rs``,
built via PyO3) that exposes ``CompiledLauncher`` — a Python type
with a ``__call__`` slot. A primed launcher dispatches a Triton
kernel launch directly into ``compiled_kernel.run`` (the C launcher
Triton emits), bypassing both the Python ``default_launcher`` frame
AND Triton's ``JITFunction.run`` pipeline (binder,
``compute_cache_key``, kernel-cache lookup, ``launch_metadata``
allocation, etc.).
Usage from Python:
import helion._native
launcher = helion._native.CompiledLauncher()
launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
# then install as the wrapper's _launcher kwdefault, or call directly:
launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)
Why Rust (not C)
----------------
A prior draft of this commit used a hand-written C extension. The
Rust version is functionally identical and benchmarks the same on
the hot path (see numbers below). Rust buys us:
- Memory safety on error paths. PyO3's ``PyResult`` / ``Bound<'_, T>``
types prevent the kinds of reference-leak bugs the manual
``Py_XDECREF`` ladder in C is prone to. The Rust source is about
half the length of the equivalent C and reads closer to Python.
- A standard build system (``cargo``) instead of an ad-hoc ``gcc``
one-liner.
PyO3 0.28 with the ``abi3-py310`` feature makes the resulting .so
binary-compatible with any CPython >= 3.10. The crate's only
dependency is ``pyo3`` itself.
Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------
The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the Rust
launcher could possibly go before re-adding any safety):
- No multi-spec cache. The compiled kernel captured at prime time is
reused for EVERY subsequent call. Caller must keep arg specs stable
(same alignment, same shape, same dtype) or the GPU launch will
silently use a wrong-spec binary (e.g. vectorized 16-byte loads
against an unaligned pointer -> ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
``triton.knobs.compilation.instrumentation_mode``,
``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
observed — we always pass ``None`` for ``launch_metadata`` and for
both hooks. A profiler attached after priming silently sees no
Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
uses the stale binary — Triton's own ``RuntimeError`` raise is
bypassed.
- No multi-device guard. Switching CUDA devices after priming and
then calling the launcher dispatches to the priming device's
stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
launcher just dispatches with stale / wrong state.
A production version would port the Phase-2-launcher's correctness
guards to Rust; their per-call cost in Python is ~2 us, so a
ship-able Rust launcher nets out around -5 us/call vs
``default_launcher`` rather than the -7 us the un-guarded version
below achieves.
The build is NOT wired into hatchling — see
``helion/_native/README.md`` for the manual ``cargo build`` flow.
Default install leaves ``_native.CompiledLauncher = None`` and
Helion uses the Python ``default_launcher``.
Tests in ``test_native_launcher.py`` cover:
- Correctness: primed Rust launcher produces same output as Python
path on the same args, both direct-call and routed through the
generated wrapper.
- ``CompiledLauncher`` is subclassable (declared with
``#[pyclass(subclass)]``).
- Calling without priming raises ``RuntimeError``.
Benchmarks (H100, vector_add at N=4096)
---------------------------------------
End-to-end per-call timing with the Rust launcher installed
(manual prime + swap of the wrapper's ``_launcher`` kwdefault):
Variant wall+sync cpu_only
baseline (main) 23.94 us 17.17 us
+ Chunk D pool runtime (no rewrite) 24.34 us 17.95 us
+ codegen rewrite (HELION_OUTPUT_POOL=1) 22.21 us 15.96 us
+ Concrete-tensor _hashable_dims fast path ~21 us ~14.5 us
+ Phase 2 fast launcher (Python multi-spec) 19.49 us 13.07 us
+ Chunk A (helion._native shim, no .rs) ~19.7 us ~13.0 us
+ Chunk E (Rust launcher, this commit) ~16.7 us ~9.93 us
Total saving vs baseline: ~-7.2 us ~-7.2 us
Per-call comparison vs the equivalent C version (prior draft of
this commit) at the same stack position:
C ``tp_call`` 16.64 us 9.77 us
Rust ``__call__`` (this commit) ~16.7 us 9.93 us
The ~0.16 us difference is within run-to-run noise. PyO3's
macro-generated trampolines compile down to the same CPython entry
points a hand-written C extension would issue; the per-call ceiling
is set by ``compiled_kernel.run`` (Triton's C launcher) and the
Python C API tuple-build, not by language choice.
Reference: #2534 reports 10.7 us/call pool=1. Our 9.93 us cpu_only
beats that — the simpler "Rust launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.
Reminder: the -7 us is the *ceiling* (this commit's Rust launcher
has NO correctness guards). A production version must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.
Why no separate "Rust tensor_key" commit
----------------------------------------
The earlier C-stack version of this work investigated a Rust /
C ``tensor_key`` accelerator. ``Kernel.bind`` calls ``_tensor_key``
once per tensor arg; a Rust implementation that mirrors the
static-shapes branch returns the same 4-tuple ``(dtype, sizes,
strides, static_indices)`` ~100 ns faster than the Python path in
microbench. But end-to-end on a typical kernel (vector_add has 2
tensor args) the per-call saving was ~200 ns — within run-to-run
variance.
We dropped the dedicated commit. The ``helion._native.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for a
future accelerator that can do meaningfully better than the ~100
ns/call current Python C-API approach achieves (e.g. via PyTorch's
stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future Rust ``tensor_key``
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.
stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi
added a commit
that referenced
this pull request
May 27, 2026
Adds a single-crate Rust extension (``helion/_native/src/lib.rs``,
built via PyO3) that exposes ``CompiledLauncher`` — a Python type
with a ``__call__`` slot. A primed launcher dispatches a Triton
kernel launch directly into ``compiled_kernel.run`` (the C launcher
Triton emits), bypassing both the Python ``default_launcher`` frame
AND Triton's ``JITFunction.run`` pipeline (binder,
``compute_cache_key``, kernel-cache lookup, ``launch_metadata``
allocation, etc.).
Usage from Python:
import helion._native
launcher = helion._native.CompiledLauncher()
launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
# then install as the wrapper's _launcher kwdefault, or call directly:
launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)
Why Rust (not C)
----------------
A prior draft of this commit used a hand-written C extension. The
Rust version is functionally identical and benchmarks the same on
the hot path (see numbers below). Rust buys us:
- Memory safety on error paths. PyO3's ``PyResult`` / ``Bound<'_, T>``
types prevent the kinds of reference-leak bugs the manual
``Py_XDECREF`` ladder in C is prone to. The Rust source is about
half the length of the equivalent C and reads closer to Python.
- A standard build system (``cargo``) instead of an ad-hoc ``gcc``
one-liner.
PyO3 0.28 with the ``abi3-py310`` feature makes the resulting .so
binary-compatible with any CPython >= 3.10. The crate's only
dependency is ``pyo3`` itself.
Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------
The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the Rust
launcher could possibly go before re-adding any safety):
- No multi-spec cache. The compiled kernel captured at prime time is
reused for EVERY subsequent call. Caller must keep arg specs stable
(same alignment, same shape, same dtype) or the GPU launch will
silently use a wrong-spec binary (e.g. vectorized 16-byte loads
against an unaligned pointer -> ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
``triton.knobs.compilation.instrumentation_mode``,
``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
observed — we always pass ``None`` for ``launch_metadata`` and for
both hooks. A profiler attached after priming silently sees no
Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
uses the stale binary — Triton's own ``RuntimeError`` raise is
bypassed.
- No multi-device guard. Switching CUDA devices after priming and
then calling the launcher dispatches to the priming device's
stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
launcher just dispatches with stale / wrong state.
A production version would port the Phase-2-launcher's correctness
guards to Rust; their per-call cost in Python is ~2 us, so a
ship-able Rust launcher nets out around -5 us/call vs
``default_launcher`` rather than the -7 us the un-guarded version
below achieves.
The build is NOT wired into hatchling — see
``helion/_native/README.md`` for the manual ``cargo build`` flow.
Default install leaves ``_native.CompiledLauncher = None`` and
Helion uses the Python ``default_launcher``.
Tests in ``test_native_launcher.py`` cover:
- Correctness: primed Rust launcher produces same output as Python
path on the same args, both direct-call and routed through the
generated wrapper.
- ``CompiledLauncher`` is subclassable (declared with
``#[pyclass(subclass)]``).
- Calling without priming raises ``RuntimeError``.
Benchmarks (H100, vector_add at N=4096)
---------------------------------------
End-to-end per-call timing with the Rust launcher installed
(manual prime + swap of the wrapper's ``_launcher`` kwdefault):
Variant wall+sync cpu_only
baseline (main) 23.94 us 17.17 us
+ Chunk D pool runtime (no rewrite) 24.34 us 17.95 us
+ codegen rewrite (HELION_OUTPUT_POOL=1) 22.21 us 15.96 us
+ Concrete-tensor _hashable_dims fast path ~21 us ~14.5 us
+ Phase 2 fast launcher (Python multi-spec) 19.49 us 13.07 us
+ Chunk A (helion._native shim, no .rs) ~19.7 us ~13.0 us
+ Chunk E (Rust launcher, this commit) ~16.7 us ~9.93 us
Total saving vs baseline: ~-7.2 us ~-7.2 us
Per-call comparison vs the equivalent C version (prior draft of
this commit) at the same stack position:
C ``tp_call`` 16.64 us 9.77 us
Rust ``__call__`` (this commit) ~16.7 us 9.93 us
The ~0.16 us difference is within run-to-run noise. PyO3's
macro-generated trampolines compile down to the same CPython entry
points a hand-written C extension would issue; the per-call ceiling
is set by ``compiled_kernel.run`` (Triton's C launcher) and the
Python C API tuple-build, not by language choice.
Reference: #2534 reports 10.7 us/call pool=1. Our 9.93 us cpu_only
beats that — the simpler "Rust launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.
Reminder: the -7 us is the *ceiling* (this commit's Rust launcher
has NO correctness guards). A production version must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.
Why no separate "Rust tensor_key" commit
----------------------------------------
The earlier C-stack version of this work investigated a Rust /
C ``tensor_key`` accelerator. ``Kernel.bind`` calls ``_tensor_key``
once per tensor arg; a Rust implementation that mirrors the
static-shapes branch returns the same 4-tuple ``(dtype, sizes,
strides, static_indices)`` ~100 ns faster than the Python path in
microbench. But end-to-end on a typical kernel (vector_add has 2
tensor args) the per-call saving was ~200 ns — within run-to-run
variance.
We dropped the dedicated commit. The ``helion._native.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for a
future accelerator that can do meaningfully better than the ~100
ns/call current Python C-API approach achieves (e.g. via PyTorch's
stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future Rust ``tensor_key``
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.
stack-info: PR: #2609, branch: yushangdi/stack/12
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Even with the previous patch's fast launcher, every kernel call still
spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through
one more Python wrapper before reaching the C launcher. On small
kernels, those Python frames are a big chunk of the remaining overhead.
Adds Helion's first compiled C extension, helion._C. It does the
hottest steps in C instead of Python:
generated wrapper can call straight into it (no extra Python
frame).
match (the common case for Helion kernels).
If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build),
we fall back to the pure-Python launcher automatically and Helion
keeps working as before.
Perf (1024x1024 bf16 add microbench, CUDA graphs off):
Pre-launcher baseline: 34.5 us / call
Previous fast-launcher patch: 25.2 us / call
This patch (default): 18.2 us / call (-29% vs prev)
This patch (pool=1): 10.7 us / call (-58% vs prev)
Three pieces inside helion._C:
C tensor_key (
helion._C.tensor_key): replaces the per-callPython specialization-key build inside Kernel.bind for the
static-shape fast path. Falls back to the Python implementation
for unsupported inputs (SymInts etc) by returning None.
C CompiledLauncher: caches the Triton CompiledKernel + driver
hooks once and exposes a tp_call slot so the host wrapper can
skip the _FastLauncher.call Python frame. The Python
_FastLauncher gets an _install_c_launcher method that hot-swaps
the kwdefault "_launcher" entry to the C object after priming.
Two codegen tactics for the generated Triton host wrapper:
match (most Helion kernels).
_helion_output_alloc, swapping torch.empty for a small cache.
Default behavior preserved (one fresh allocation per call).