[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only) by yushangdi · Pull Request #2611 · pytorch/helion

yushangdi · 2026-05-27T18:38:44Z

Stacked PRs:

[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)

Splits the per-tensor specialization-key extractor in two:

_concrete_tensor_key (new) handles torch.Tensor and
torch.nn.Parameter. Both always have concrete-int sizes/strides,
so the cache-key components are just obj.size() and
obj.stride() directly — torch.Size and the stride tuple are
already tuple subclasses whose hash matches a plain tuple of the same
ints, so the dispatch produces an identical hashable key without
rebuilding it.
_tensor_key (existing, unchanged) stays as the
FakeTensor-dispatch entry. FakeTensors have SymInt sizes during
tracing, and _hashable_dims is the normalization that maps each
SymInt to (id(shape_env), expr) so two SymInts from different
shape envs don't accidentally collide.

The split happens at the _specialization_extractors table, so the
right extractor is selected by a single dict lookup with no per-call
type check on the hot path.

Hash equality between the two extractors is verified by a test (see
test_fast_path_key_hash_matches_wrapped) so existing on-disk
LooseAutotuneCacheKey entries built before this change keep
matching.

Benchmarks (H100, vector_add at N=4096, cycling through 8 different
x/y tensors so the cache hits but never on the same args twice)

_base_specialization_key in isolation (the dominant component of
Kernel.bind() per-call cost):

before: 2.76 us
after: 2.14 us
saving: -0.62 us (-22%)

End-to-end per-call timing at this stack position (Chunk D pool +
codegen rewrite applied, Phase 2 fast launcher not yet applied):

before (parent commit, debc08a): cpu_only=~16.0 us wall+sync=~16.0 us
after (this commit): cpu_only= 14.87 us wall+sync= 14.87 us
saving: ~-1.1 us/call

Tests

test/test_tensor_key_fast_path.py covers:

Dispatch table routes concrete tensors to the fast path and
FakeTensor to the old path.
The fast-path key hashes and compares equal to the old wrapped key
(cache-compat invariant).
The fast-path key contains a torch.Size (not a wrapped plain
tuple) when the kernel uses static_shapes=True.
bind() still returns the same BoundKernel for different
tensor objects with matching dtype/shape/stride, and distinguishes
different dtypes and shapes.

…n-only) Splits the per-tensor specialization-key extractor in two: - ``_concrete_tensor_key`` (new) handles ``torch.Tensor`` and ``torch.nn.Parameter``. Both always have concrete-int sizes/strides, so the cache-key components are just ``obj.size()`` and ``obj.stride()`` directly — ``torch.Size`` and the stride tuple are already tuple subclasses whose hash matches a plain tuple of the same ints, so the dispatch produces an identical hashable key without rebuilding it. - ``_tensor_key`` (existing, unchanged) stays as the ``FakeTensor``-dispatch entry. FakeTensors have SymInt sizes during tracing, and ``_hashable_dims`` is the normalization that maps each ``SymInt`` to ``(id(shape_env), expr)`` so two SymInts from different shape envs don't accidentally collide. The split happens at the ``_specialization_extractors`` table, so the right extractor is selected by a single dict lookup with no per-call type check on the hot path. Hash equality between the two extractors is verified by a test (see ``test_fast_path_key_hash_matches_wrapped``) so existing on-disk ``LooseAutotuneCacheKey`` entries built before this change keep matching. Benchmarks (H100, vector_add at N=4096, cycling through 8 different x/y tensors so the cache hits but never on the same args twice) ----------------------------------------------------------- ``_base_specialization_key`` in isolation (the dominant component of ``Kernel.bind()`` per-call cost): before: 2.76 us after: 2.14 us saving: -0.62 us (-22%) End-to-end per-call timing at this stack position (Chunk D pool + codegen rewrite applied, Phase 2 fast launcher not yet applied): before (parent commit, debc08a): cpu_only=~16.0 us wall+sync=~16.0 us after (this commit): cpu_only= 14.87 us wall+sync= 14.87 us saving: ~-1.1 us/call Tests ----- ``test/test_tensor_key_fast_path.py`` covers: - Dispatch table routes concrete tensors to the fast path and FakeTensor to the old path. - The fast-path key hashes and compares equal to the old wrapped key (cache-compat invariant). - The fast-path key contains a ``torch.Size`` (not a wrapped plain tuple) when the kernel uses ``static_shapes=True``. - ``bind()`` still returns the same ``BoundKernel`` for different tensor objects with matching dtype/shape/stride, and distinguishes different dtypes and shapes. stack-info: PR: #2611, branch: yushangdi/stack/13

yushangdi force-pushed the yushangdi/stack/13 branch from a147793 to 0068448 Compare May 27, 2026 18:38

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 27, 2026

yushangdi changed the base branch from yushangdi/stack/9 to main May 27, 2026 19:31

yushangdi changed the base branch from main to yushangdi/stack/9 May 27, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)#2611

[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)#2611
yushangdi wants to merge 1 commit into
yushangdi/stack/9from
yushangdi/stack/13

yushangdi commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yushangdi commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!