Skip to content

Add a C extension so launches skip more Python frames#2534

Draft
choijon5 wants to merge 1 commit into
choijon5/stack/74from
choijon5/stack/75
Draft

Add a C extension so launches skip more Python frames#2534
choijon5 wants to merge 1 commit into
choijon5/stack/74from
choijon5/stack/75

Conversation

@choijon5
Copy link
Copy Markdown
Contributor

@choijon5 choijon5 commented May 20, 2026

Stacked PRs:


Even with the previous patch's fast launcher, every kernel call still
spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through
one more Python wrapper before reaching the C launcher. On small
kernels, those Python frames are a big chunk of the remaining overhead.

Adds Helion's first compiled C extension, helion._C. It does the
hottest steps in C instead of Python:

  • Builds the per-call argument cache key directly in C.
  • Exposes the launcher itself as a C-callable object so the
    generated wrapper can call straight into it (no extra Python
    frame).
  • Skips a redundant broadcast call when the input shapes already
    match (the common case for Helion kernels).

If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build),
we fall back to the pure-Python launcher automatically and Helion
keeps working as before.

Perf (1024x1024 bf16 add microbench, CUDA graphs off):
Pre-launcher baseline: 34.5 us / call
Previous fast-launcher patch: 25.2 us / call
This patch (default): 18.2 us / call (-29% vs prev)
This patch (pool=1): 10.7 us / call (-58% vs prev)

Three pieces inside helion._C:

  1. C tensor_key (helion._C.tensor_key): replaces the per-call
    Python specialization-key build inside Kernel.bind for the
    static-shape fast path. Falls back to the Python implementation
    for unsupported inputs (SymInts etc) by returning None.

  2. C CompiledLauncher: caches the Triton CompiledKernel + driver
    hooks once and exposes a tp_call slot so the host wrapper can
    skip the _FastLauncher.call Python frame. The Python
    _FastLauncher gets an _install_c_launcher method that hot-swaps
    the kwdefault "_launcher" entry to the C object after priming.

  3. Two codegen tactics for the generated Triton host wrapper:

    • Skip torch.broadcast_tensors when the input shapes statically
      match (most Helion kernels).
    • Opt-in (HELION_OUTPUT_POOL=1) reuse of the output buffer via
      _helion_output_alloc, swapping torch.empty for a small cache.
      Default behavior preserved (one fresh allocation per call).

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 20, 2026
Even with the previous patch's fast launcher, every kernel call still
spends microseconds in Python: building a cache key, broadcasting argument shapes, and bouncing through
one more Python wrapper before reaching the C launcher. On small
kernels, those Python frames are a big chunk of the remaining overhead.

Adds Helion's first compiled C extension, helion._C. It does the
hottest steps in C instead of Python:
  - Builds the per-call argument cache key directly in C.
  - Exposes the launcher itself as a C-callable object so the
    generated wrapper can call straight into it (no extra Python
    frame).
  - Skips a redundant broadcast call when the input shapes already
    match (the common case for Helion kernels).

If the C extension didn't build (e.g. PyTorch-on-MTIA, debug build),
we fall back to the pure-Python launcher automatically and Helion
keeps working as before.

Perf (1024x1024 bf16 add microbench, CUDA graphs off):
  Pre-launcher baseline:        34.5 us / call
  Previous fast-launcher patch: 25.2 us / call
  This patch (default):         18.2 us / call  (-29% vs prev)
  This patch (pool=1):          10.7 us / call  (-58% vs prev)

Three pieces inside helion._C:

1. C tensor_key (`helion._C.tensor_key`): replaces the per-call
   Python specialization-key build inside Kernel.bind for the
   static-shape fast path. Falls back to the Python implementation
   for unsupported inputs (SymInts etc) by returning None.

2. C CompiledLauncher: caches the Triton CompiledKernel + driver
   hooks once and exposes a tp_call slot so the host wrapper can
   skip the _FastLauncher.__call__ Python frame. The Python
   _FastLauncher gets an _install_c_launcher method that hot-swaps
   the kwdefault "_launcher" entry to the C object after priming.

3. Two codegen tactics for the generated Triton host wrapper:
   - Skip torch.broadcast_tensors when the input shapes statically
     match (most Helion kernels).
   - Opt-in (HELION_OUTPUT_POOL=1) reuse of the output buffer via
     _helion_output_alloc, swapping torch.empty for a small cache.
     Default behavior preserved (one fresh allocation per call).

stack-info: PR: #2534, branch: choijon5/stack/75
@choijon5 choijon5 force-pushed the choijon5/stack/74 branch from 91281ac to 72b5756 Compare May 20, 2026 18:59
@choijon5 choijon5 force-pushed the choijon5/stack/75 branch from 527baf9 to 529203e Compare May 20, 2026 18:59
yushangdi added a commit that referenced this pull request May 27, 2026
… (Chunk D+E follow-up)

Wires Chunk D (output pool) into the generated host wrapper via a
codegen rewrite. When ``HELION_OUTPUT_POOL=1`` is set at compile
time, ``out = torch.empty_like(x)`` in the wrapper is rewritten to
``out = _helion_pool_empty_like(x)`` (== ``helion.runtime.empty_like``).
Users no longer need to manually swap ``torch.empty_like`` in their
kernel source.

Implementation: ``_rewrite_output_allocs_for_pool`` in
``helion/_compiler/generate_ast.py`` walks the host wrapper's AST,
finds top-level ``Assign(target=Name(v), value=Call(torch.empty_like(...)))``
statements whose target ``v`` is passed as a positional arg to a
kernel launcher call in the same scope (a kernel-output buffer), and
swaps the called function for the pool helper.

The rewrite is conservative — it only fires on:
1. ``HELION_OUTPUT_POOL=1`` env var set at compile time
2. ``torch.empty_like`` calls (not arbitrary factory expressions)
3. Targets that are consumed by a kernel launcher call (output buffers)

Tensors that escape via other paths stay on ``torch.empty_like``
because pooled recycling isn't safe for them.

Tests in ``test_pool_codegen_rewrite.py`` cover:
- Rewrite fires when ``HELION_OUTPUT_POOL=1`` is set and produces a
  wrapper that calls ``_helion_pool_empty_like(x)``.
- Rewrite is ABSENT when env var unset — wrapper preserves
  ``torch.empty_like(x)``.
- End-to-end correctness: outputs match a ``torch.empty_like``
  baseline regardless of which allocation path was taken.
- ``_POOLS`` accumulates exactly one ring entry per call signature
  when the wrapper runs.

Benchmarks (H100, vector_add at N=4096) — combined Chunk D + Chunk E
---------------------------------------------------------------------

To approach #2534's reported 10.7 μs/call (pool=1) number we need
both the pool wiring (this commit) AND the C launcher (Chunk E)
active. The C launcher install path isn't auto-wired yet — measured
here by priming the launcher and swapping the wrapper's kwdefault by
hand, mirroring what an end-to-end install hook would do.

15k iters per variant after 100 warmup, two timings each (wall+sync
forces ``torch.cuda.synchronize()`` after every call; cpu_only does a
single sync at the end and measures submission rate):

| Variant                                  | wall+sync μs | cpu_only μs |
| ---------------------------------------- | -----------: | ----------: |
| 1. baseline (default_launcher, pool off) |        26.16 |       19.08 |
| 2. + C launcher (Chunk E only)           |        18.64 |       11.95 |
| 3. + pool (Chunk D only, Py launcher)    |        22.82 |       16.35 |
| 4. + pool + C launcher (D + E)           |    **16.69** |   **10.31** |

| Reference (#2534 pool=1)                 |            – |   **10.70** |

Our combined D+E cpu_only of 10.31 μs matches and slightly beats
#2534's reported 10.7 μs/call. The wall+sync gap (16.7 vs the
implicit ~10.7) is per-call GPU sync overhead — both numbers measure
real Python/C dispatch cost; the cpu_only metric is the relevant
"how fast can we submit" comparison.

The meta-device retargeting + C-launcher pool integration I worried
about isn't required to match the number — the simpler "C launcher
+ Python pool returning a cached tensor" stack hits the same
performance ceiling on this kernel.

Limitations
-----------

- C launcher install is still manual (prime then swap kwdefault) —
  a follow-up could add a ``BoundKernel.set_config`` hook that
  auto-installs ``helion._C.CompiledLauncher`` when present.
- The rewrite only targets ``torch.empty_like`` (Helion's current
  codegen pattern). ``torch.empty(shape, dtype, device)`` could be
  rewritten too once the pool helper's signature is extended to
  match that form.

Part of the chunked C-extension plan; the codegen integration that
brings Chunks D + E together up to #2534's reported pool=1 number.

stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi added a commit that referenced this pull request May 27, 2026
Adds a single-file C extension (``helion/_C/_launcher.c``) that
exposes ``CompiledLauncher`` — a Python type with a ``tp_call`` slot.
A primed launcher dispatches a Triton kernel launch directly into
``compiled_kernel.run`` (the C launcher Triton emits), bypassing
both the Python ``default_launcher`` frame AND Triton's
``JITFunction.run`` pipeline (binder, ``compute_cache_key``,
kernel-cache lookup, ``launch_metadata`` allocation, etc.).

Usage from Python:

    import helion._C
    launcher = helion._C.CompiledLauncher()
    launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
    # then install as the wrapper's _launcher kwdefault, or call directly:
    launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)

Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------

The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the C
launcher could possibly go before re-adding any safety):

- No multi-spec cache. The compiled kernel captured at prime time is
  reused for EVERY subsequent call. Caller must keep arg specs stable
  (same alignment, same shape, same dtype) or the GPU launch will
  silently use a wrong-spec binary (e.g. vectorized 16-byte loads
  against an unaligned pointer → ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
  ``triton.knobs.compilation.instrumentation_mode``,
  ``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
  hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
  observed — we always pass ``None`` for ``launch_metadata`` and for
  both hooks. A profiler attached after priming silently sees no
  Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
  global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
  uses the stale binary — Triton's own ``RuntimeError`` raise is
  bypassed.
- No multi-device guard. Switching CUDA devices after priming and
  then calling the launcher dispatches to the priming device's
  stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
  ``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
  launcher just dispatches with stale / wrong state.

A production version would port the Phase-2-launcher's correctness
guards to C; their per-call cost in Python is ~2 μs, so a
ship-able C launcher nets out around -5 μs/call vs ``default_launcher``
rather than the -7 μs the un-guarded version below achieves.

The build is NOT wired into hatchling — manual ``gcc`` one-liner
in ``helion/_C/README.md``. Default install leaves
``_C.CompiledLauncher = None`` and Helion uses the Python
``default_launcher``.

Tests in ``test_c_launcher.py`` cover:
- Correctness: primed C launcher produces same output as Python path
  on the same args, both direct-call and routed through the
  generated wrapper.
- ``CompiledLauncher`` is subclassable (``Py_TPFLAGS_BASETYPE``).
- Calling without priming raises ``RuntimeError``.

Benchmarks (H100, vector_add at N=4096, this commit's state)
------------------------------------------------------------

End-to-end per-call timing with the C launcher installed (manual
prime + swap of the wrapper's ``_launcher`` kwdefault):

  Variant                                       wall+sync   cpu_only
  baseline (main)                               23.94 us    17.17 us
  + Chunk D pool runtime (no rewrite)           24.34 us    17.95 us
  + codegen rewrite (HELION_OUTPUT_POOL=1)      22.21 us    15.96 us
  + Phase 2 fast launcher (Python multi-spec)   19.49 us    13.07 us
  + Chunk A (helion._C shim, no .c)             19.91 us    12.96 us
  + Chunk E (C launcher, this commit)           16.64 us     9.77 us

  Total saving vs baseline:                     -7.30 us    -7.40 us
  Speedup:                                        1.44x       1.76x

Reference: #2534 reports 10.7 us/call pool=1. Our 9.77 us cpu_only
beats that — the simpler "C launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.

Reminder: the -7 us is the *ceiling* (this commit's C launcher has
NO correctness guards). A production C launcher must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.

Why no separate "C tensor_key" commit
-------------------------------------

An earlier draft of this stack included a "Chunk B: C tensor_key
accelerator" commit between Chunk A and this one. ``Kernel.bind``
calls ``_tensor_key(fn, tensor)`` once per tensor arg; a C
implementation that mirrors the static-shapes branch returns the
same 4-tuple ``(dtype, sizes, strides, static_indices)`` ~110 ns
faster than the Python path (0.510 us vs 0.619 us in isolated
timeit microbench, 1.21x speedup).

But end-to-end on a typical kernel (vector_add has 2 tensor args)
the per-call saving was ~220 ns — within run-to-run variance:

  prior commit (Chunk A shim, no .c)            18.79 us    12.89 us
  Chunk A + C tensor_key (dropped)              19.13 us    13.06 us

We dropped the dedicated commit. The ``helion._C.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for
a future C accelerator that can do meaningfully better than the
~110 ns/call current Python C-API approach achieves (e.g. via
PyTorch's stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future C tensor_key
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.

stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi added a commit that referenced this pull request May 27, 2026
Adds a single-crate Rust extension (``helion/_native/src/lib.rs``,
built via PyO3) that exposes ``CompiledLauncher`` — a Python type
with a ``__call__`` slot. A primed launcher dispatches a Triton
kernel launch directly into ``compiled_kernel.run`` (the C launcher
Triton emits), bypassing both the Python ``default_launcher`` frame
AND Triton's ``JITFunction.run`` pipeline (binder,
``compute_cache_key``, kernel-cache lookup, ``launch_metadata``
allocation, etc.).

Usage from Python:

    import helion._native
    launcher = helion._native.CompiledLauncher()
    launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
    # then install as the wrapper's _launcher kwdefault, or call directly:
    launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)

Why Rust (not C)
----------------

A prior draft of this commit used a hand-written C extension. The
Rust version is functionally identical and benchmarks the same on
the hot path (see numbers below). Rust buys us:

- Memory safety on error paths. PyO3's ``PyResult`` / ``Bound<'_, T>``
  types prevent the kinds of reference-leak bugs the manual
  ``Py_XDECREF`` ladder in C is prone to. The Rust source is about
  half the length of the equivalent C and reads closer to Python.
- A standard build system (``cargo``) instead of an ad-hoc ``gcc``
  one-liner.

PyO3 0.28 with the ``abi3-py310`` feature makes the resulting .so
binary-compatible with any CPython >= 3.10. The crate's only
dependency is ``pyo3`` itself.

Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------

The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the Rust
launcher could possibly go before re-adding any safety):

- No multi-spec cache. The compiled kernel captured at prime time is
  reused for EVERY subsequent call. Caller must keep arg specs stable
  (same alignment, same shape, same dtype) or the GPU launch will
  silently use a wrong-spec binary (e.g. vectorized 16-byte loads
  against an unaligned pointer -> ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
  ``triton.knobs.compilation.instrumentation_mode``,
  ``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
  hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
  observed — we always pass ``None`` for ``launch_metadata`` and for
  both hooks. A profiler attached after priming silently sees no
  Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
  global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
  uses the stale binary — Triton's own ``RuntimeError`` raise is
  bypassed.
- No multi-device guard. Switching CUDA devices after priming and
  then calling the launcher dispatches to the priming device's
  stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
  ``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
  launcher just dispatches with stale / wrong state.

A production version would port the Phase-2-launcher's correctness
guards to Rust; their per-call cost in Python is ~2 us, so a
ship-able Rust launcher nets out around -5 us/call vs
``default_launcher`` rather than the -7 us the un-guarded version
below achieves.

The build is NOT wired into hatchling — see
``helion/_native/README.md`` for the manual ``cargo build`` flow.
Default install leaves ``_native.CompiledLauncher = None`` and
Helion uses the Python ``default_launcher``.

Tests in ``test_native_launcher.py`` cover:
- Correctness: primed Rust launcher produces same output as Python
  path on the same args, both direct-call and routed through the
  generated wrapper.
- ``CompiledLauncher`` is subclassable (declared with
  ``#[pyclass(subclass)]``).
- Calling without priming raises ``RuntimeError``.

Benchmarks (H100, vector_add at N=4096)
---------------------------------------

End-to-end per-call timing with the Rust launcher installed
(manual prime + swap of the wrapper's ``_launcher`` kwdefault):

  Variant                                       wall+sync   cpu_only
  baseline (main)                               23.94 us    17.17 us
  + Chunk D pool runtime (no rewrite)           24.34 us    17.95 us
  + codegen rewrite (HELION_OUTPUT_POOL=1)      22.21 us    15.96 us
  + Concrete-tensor _hashable_dims fast path    ~21 us      ~14.5 us
  + Phase 2 fast launcher (Python multi-spec)   19.49 us    13.07 us
  + Chunk A (helion._native shim, no .rs)       ~19.7 us    ~13.0 us
  + Chunk E (Rust launcher, this commit)        ~16.7 us    ~9.93 us

  Total saving vs baseline:                     ~-7.2 us    ~-7.2 us

Per-call comparison vs the equivalent C version (prior draft of
this commit) at the same stack position:

  C ``tp_call``                                 16.64 us    9.77 us
  Rust ``__call__`` (this commit)               ~16.7 us    9.93 us

The ~0.16 us difference is within run-to-run noise. PyO3's
macro-generated trampolines compile down to the same CPython entry
points a hand-written C extension would issue; the per-call ceiling
is set by ``compiled_kernel.run`` (Triton's C launcher) and the
Python C API tuple-build, not by language choice.

Reference: #2534 reports 10.7 us/call pool=1. Our 9.93 us cpu_only
beats that — the simpler "Rust launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.

Reminder: the -7 us is the *ceiling* (this commit's Rust launcher
has NO correctness guards). A production version must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.

Why no separate "Rust tensor_key" commit
----------------------------------------

The earlier C-stack version of this work investigated a Rust /
C ``tensor_key`` accelerator. ``Kernel.bind`` calls ``_tensor_key``
once per tensor arg; a Rust implementation that mirrors the
static-shapes branch returns the same 4-tuple ``(dtype, sizes,
strides, static_indices)`` ~100 ns faster than the Python path in
microbench. But end-to-end on a typical kernel (vector_add has 2
tensor args) the per-call saving was ~200 ns — within run-to-run
variance.

We dropped the dedicated commit. The ``helion._native.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for a
future accelerator that can do meaningfully better than the ~100
ns/call current Python C-API approach achieves (e.g. via PyTorch's
stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future Rust ``tensor_key``
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.

stack-info: PR: #2609, branch: yushangdi/stack/12
yushangdi added a commit that referenced this pull request May 27, 2026
Adds a single-crate Rust extension (``helion/_native/src/lib.rs``,
built via PyO3) that exposes ``CompiledLauncher`` — a Python type
with a ``__call__`` slot. A primed launcher dispatches a Triton
kernel launch directly into ``compiled_kernel.run`` (the C launcher
Triton emits), bypassing both the Python ``default_launcher`` frame
AND Triton's ``JITFunction.run`` pipeline (binder,
``compute_cache_key``, kernel-cache lookup, ``launch_metadata``
allocation, etc.).

Usage from Python:

    import helion._native
    launcher = helion._native.CompiledLauncher()
    launcher.prime(triton_kernel, grid, args, num_warps=4, num_stages=2)
    # then install as the wrapper's _launcher kwdefault, or call directly:
    launcher(triton_kernel, grid, *args, num_warps=..., num_stages=...)

Why Rust (not C)
----------------

A prior draft of this commit used a hand-written C extension. The
Rust version is functionally identical and benchmarks the same on
the hot path (see numbers below). Rust buys us:

- Memory safety on error paths. PyO3's ``PyResult`` / ``Bound<'_, T>``
  types prevent the kinds of reference-leak bugs the manual
  ``Py_XDECREF`` ladder in C is prone to. The Rust source is about
  half the length of the equivalent C and reads closer to Python.
- A standard build system (``cargo``) instead of an ad-hoc ``gcc``
  one-liner.

PyO3 0.28 with the ``abi3-py310`` feature makes the resulting .so
binary-compatible with any CPython >= 3.10. The crate's only
dependency is ``pyo3`` itself.

Scope caveats (deliberately omitted to demonstrate the ceiling)
---------------------------------------------------------------

The launcher in this commit is intentionally minimal — these
correctness guards are NOT implemented, so the perf number below
represents the upper bound on the saving (i.e. how fast the Rust
launcher could possibly go before re-adding any safety):

- No multi-spec cache. The compiled kernel captured at prime time is
  reused for EVERY subsequent call. Caller must keep arg specs stable
  (same alignment, same shape, same dtype) or the GPU launch will
  silently use a wrong-spec binary (e.g. vectorized 16-byte loads
  against an unaligned pointer -> ``CUDA error: misaligned address``).
- No knob/hook re-reads. ``triton.knobs.runtime.debug``,
  ``triton.knobs.compilation.instrumentation_mode``,
  ``triton.knobs.runtime.add_stages_inspection_hook``, and the launch
  hooks (``launch_enter_hook`` / ``launch_exit_hook``) are NOT
  observed — we always pass ``None`` for ``launch_metadata`` and for
  both hooks. A profiler attached after priming silently sees no
  Helion launches.
- No ``used_global_vals`` mutation check. Mutating a Helion-tracked
  global (e.g. a ``_BLOCK_SIZE_*`` constexpr) between calls silently
  uses the stale binary — Triton's own ``RuntimeError`` raise is
  bypassed.
- No multi-device guard. Switching CUDA devices after priming and
  then calling the launcher dispatches to the priming device's
  stream — likely an ``invalid resource handle`` crash.
- No ``pre_run_hooks`` invocation. Hooks installed on the underlying
  ``JITFunction`` (e.g. autotune timing hooks) silently don't fire.
- No fallback to ``default_launcher`` on any of the above. The
  launcher just dispatches with stale / wrong state.

A production version would port the Phase-2-launcher's correctness
guards to Rust; their per-call cost in Python is ~2 us, so a
ship-able Rust launcher nets out around -5 us/call vs
``default_launcher`` rather than the -7 us the un-guarded version
below achieves.

The build is NOT wired into hatchling — see
``helion/_native/README.md`` for the manual ``cargo build`` flow.
Default install leaves ``_native.CompiledLauncher = None`` and
Helion uses the Python ``default_launcher``.

Tests in ``test_native_launcher.py`` cover:
- Correctness: primed Rust launcher produces same output as Python
  path on the same args, both direct-call and routed through the
  generated wrapper.
- ``CompiledLauncher`` is subclassable (declared with
  ``#[pyclass(subclass)]``).
- Calling without priming raises ``RuntimeError``.

Benchmarks (H100, vector_add at N=4096)
---------------------------------------

End-to-end per-call timing with the Rust launcher installed
(manual prime + swap of the wrapper's ``_launcher`` kwdefault):

  Variant                                       wall+sync   cpu_only
  baseline (main)                               23.94 us    17.17 us
  + Chunk D pool runtime (no rewrite)           24.34 us    17.95 us
  + codegen rewrite (HELION_OUTPUT_POOL=1)      22.21 us    15.96 us
  + Concrete-tensor _hashable_dims fast path    ~21 us      ~14.5 us
  + Phase 2 fast launcher (Python multi-spec)   19.49 us    13.07 us
  + Chunk A (helion._native shim, no .rs)       ~19.7 us    ~13.0 us
  + Chunk E (Rust launcher, this commit)        ~16.7 us    ~9.93 us

  Total saving vs baseline:                     ~-7.2 us    ~-7.2 us

Per-call comparison vs the equivalent C version (prior draft of
this commit) at the same stack position:

  C ``tp_call``                                 16.64 us    9.77 us
  Rust ``__call__`` (this commit)               ~16.7 us    9.93 us

The ~0.16 us difference is within run-to-run noise. PyO3's
macro-generated trampolines compile down to the same CPython entry
points a hand-written C extension would issue; the per-call ceiling
is set by ``compiled_kernel.run`` (Triton's C launcher) and the
Python C API tuple-build, not by language choice.

Reference: #2534 reports 10.7 us/call pool=1. Our 9.93 us cpu_only
beats that — the simpler "Rust launcher + Python pool returning a
cached tensor + Phase-2 multi-spec correctness" stack hits the same
performance ceiling as #2534's design on this kernel.

Reminder: the -7 us is the *ceiling* (this commit's Rust launcher
has NO correctness guards). A production version must add the
Phase-2 guards (multi-spec cache, knob checks, used_global_vals
snapshots, etc.) back, costing ~2 us/call. The Phase 2 Python
launcher in the prior commit already has those guards, so its 13.07
us/call number is the realistic apples-to-apples comparison.

Why no separate "Rust tensor_key" commit
----------------------------------------

The earlier C-stack version of this work investigated a Rust /
C ``tensor_key`` accelerator. ``Kernel.bind`` calls ``_tensor_key``
once per tensor arg; a Rust implementation that mirrors the
static-shapes branch returns the same 4-tuple ``(dtype, sizes,
strides, static_indices)`` ~100 ns faster than the Python path in
microbench. But end-to-end on a typical kernel (vector_add has 2
tensor args) the per-call saving was ~200 ns — within run-to-run
variance.

We dropped the dedicated commit. The ``helion._native.tensor_key = None``
slot remains in Chunk A's shim as a documented extension point for a
future accelerator that can do meaningfully better than the ~100
ns/call current Python C-API approach achieves (e.g. via PyTorch's
stable C ABI to skip Python attribute machinery for
``tensor.size()`` / ``tensor.stride()`` reads). The Python C-API
path's gain is bounded by the cost of those attribute lookups,
which it doesn't actually avoid — so any future Rust ``tensor_key``
accelerator should target the stable ABI or unstable ``at::Tensor``
to get a meaningful per-call saving.

stack-info: PR: #2609, branch: yushangdi/stack/12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant