feat: add little_kernel#159
Merged
KnowingNothing merged 7 commits intoByteDance-Seed:mainfrom Feb 13, 2026
Merged
Conversation
* fix * clean & fix gemm * code format * feat: qkv proj + a2a overlap support See merge request: !314
* feat * feat: ep moe e2e See merge request: !315
* ci: use cached llvm lib * refactor: overflow_factor * fix: add __del__ to ctx class * refactor: use triton_dist wrapper * chore: code format and fix Aime's comments * Revert "Revert "feat: amd upgrade docker rocm7 & amd ag gemm autotune fix & intra-kernel profile & fix oom"" * fix: IMA bug when token_len=1 * v7 kernel: using BS_M and binary workload search * Revert "feat: amd upgrade docker rocm7 & amd ag gemm autotune fix & intra-kernel profile & fix oom" * fix: optimize D2D mem copy * fix: symm buffer size when reused as combine op * CI: suite pattern for multinode kernels * fix send_token_num and recv_token_num error * code format and add ci test * add intra-node all2all_vdev_2d v2 kernel * chore: code format * CI: add multinode tests * chore: import error * feat: add a2a_v_dev_2d for internode * ci: add unit test * feat: add intra-node all_to_all_vdev_2d See merge request: !308
* fix: use stub for cuda and cudart * fix: build dist include refactor dirs See merge request: !317
* fix: tutorial needs fence before notify * fix: ci * fix: disable custom_llvm * fix: no import cuda is not on nvgpu * fix: add comments for cuda bindings * fix: address comments * fix: address comments * fix: resolve comments of aime * fix: format * fix: format * fix: format * fix: license * feat: add license * fix: test location * fix: format See merge request: !325
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LittleKernel
LittleKernel is a Python DSL for writing CUDA kernels with full PTX-level
control. It generates CUDA source code from Python functions and compiles
them at runtime using
nvcc.Overview
LittleKernel bridges the gap between high-level frameworks (PyTorch, Triton)
and hand-written CUDA/PTX. You write kernels as decorated Python functions
using explicit types and intrinsics, and LittleKernel:
folding, inlining),
nvccand wraps the result in a callable thatinteroperates with PyTorch tensors.
Architecture
Supported Architectures