Skip to content

[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)#2611

Draft
yushangdi wants to merge 1 commit into
yushangdi/stack/9from
yushangdi/stack/13
Draft

[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)#2611
yushangdi wants to merge 1 commit into
yushangdi/stack/9from
yushangdi/stack/13

Conversation

@yushangdi
Copy link
Copy Markdown
Contributor

@yushangdi yushangdi commented May 27, 2026

Stacked PRs:


[fast-launcher] Concrete-tensor fast path skips _hashable_dims (Python-only)

Splits the per-tensor specialization-key extractor in two:

  • _concrete_tensor_key (new) handles torch.Tensor and
    torch.nn.Parameter. Both always have concrete-int sizes/strides,
    so the cache-key components are just obj.size() and
    obj.stride() directly — torch.Size and the stride tuple are
    already tuple subclasses whose hash matches a plain tuple of the same
    ints, so the dispatch produces an identical hashable key without
    rebuilding it.
  • _tensor_key (existing, unchanged) stays as the
    FakeTensor-dispatch entry. FakeTensors have SymInt sizes during
    tracing, and _hashable_dims is the normalization that maps each
    SymInt to (id(shape_env), expr) so two SymInts from different
    shape envs don't accidentally collide.

The split happens at the _specialization_extractors table, so the
right extractor is selected by a single dict lookup with no per-call
type check on the hot path.

Hash equality between the two extractors is verified by a test (see
test_fast_path_key_hash_matches_wrapped) so existing on-disk
LooseAutotuneCacheKey entries built before this change keep
matching.

Benchmarks (H100, vector_add at N=4096, cycling through 8 different
x/y tensors so the cache hits but never on the same args twice)

_base_specialization_key in isolation (the dominant component of
Kernel.bind() per-call cost):

before: 2.76 us
after: 2.14 us
saving: -0.62 us (-22%)

End-to-end per-call timing at this stack position (Chunk D pool +
codegen rewrite applied, Phase 2 fast launcher not yet applied):

before (parent commit, debc08a): cpu_only=~16.0 us wall+sync=~16.0 us
after (this commit): cpu_only= 14.87 us wall+sync= 14.87 us
saving: ~-1.1 us/call

Tests

test/test_tensor_key_fast_path.py covers:

  • Dispatch table routes concrete tensors to the fast path and
    FakeTensor to the old path.
  • The fast-path key hashes and compares equal to the old wrapped key
    (cache-compat invariant).
  • The fast-path key contains a torch.Size (not a wrapped plain
    tuple) when the kernel uses static_shapes=True.
  • bind() still returns the same BoundKernel for different
    tensor objects with matching dtype/shape/stride, and distinguishes
    different dtypes and shapes.

…n-only)

Splits the per-tensor specialization-key extractor in two:

- ``_concrete_tensor_key`` (new) handles ``torch.Tensor`` and
  ``torch.nn.Parameter``. Both always have concrete-int sizes/strides,
  so the cache-key components are just ``obj.size()`` and
  ``obj.stride()`` directly — ``torch.Size`` and the stride tuple are
  already tuple subclasses whose hash matches a plain tuple of the same
  ints, so the dispatch produces an identical hashable key without
  rebuilding it.
- ``_tensor_key`` (existing, unchanged) stays as the
  ``FakeTensor``-dispatch entry. FakeTensors have SymInt sizes during
  tracing, and ``_hashable_dims`` is the normalization that maps each
  ``SymInt`` to ``(id(shape_env), expr)`` so two SymInts from different
  shape envs don't accidentally collide.

The split happens at the ``_specialization_extractors`` table, so the
right extractor is selected by a single dict lookup with no per-call
type check on the hot path.

Hash equality between the two extractors is verified by a test (see
``test_fast_path_key_hash_matches_wrapped``) so existing on-disk
``LooseAutotuneCacheKey`` entries built before this change keep
matching.

Benchmarks (H100, vector_add at N=4096, cycling through 8 different
x/y tensors so the cache hits but never on the same args twice)
-----------------------------------------------------------

``_base_specialization_key`` in isolation (the dominant component of
``Kernel.bind()`` per-call cost):

  before:                                       2.76 us
  after:                                        2.14 us
  saving:                                      -0.62 us  (-22%)

End-to-end per-call timing at this stack position (Chunk D pool +
codegen rewrite applied, Phase 2 fast launcher not yet applied):

  before (parent commit, debc08a):    cpu_only=~16.0 us  wall+sync=~16.0 us
  after (this commit):                 cpu_only= 14.87 us wall+sync= 14.87 us
  saving:                                     ~-1.1 us/call

Tests
-----

``test/test_tensor_key_fast_path.py`` covers:
- Dispatch table routes concrete tensors to the fast path and
  FakeTensor to the old path.
- The fast-path key hashes and compares equal to the old wrapped key
  (cache-compat invariant).
- The fast-path key contains a ``torch.Size`` (not a wrapped plain
  tuple) when the kernel uses ``static_shapes=True``.
- ``bind()`` still returns the same ``BoundKernel`` for different
  tensor objects with matching dtype/shape/stride, and distinguishes
  different dtypes and shapes.

stack-info: PR: #2611, branch: yushangdi/stack/13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant