Skip to content

feat: add little_kernel#159

Merged
KnowingNothing merged 7 commits intoByteDance-Seed:mainfrom
KnowingNothing:zsz/little-kernel-pr
Feb 13, 2026
Merged

feat: add little_kernel#159
KnowingNothing merged 7 commits intoByteDance-Seed:mainfrom
KnowingNothing:zsz/little-kernel-pr

Conversation

@KnowingNothing
Copy link
Collaborator

LittleKernel

LittleKernel is a Python DSL for writing CUDA kernels with full PTX-level
control. It generates CUDA source code from Python functions and compiles
them at runtime using nvcc.

Overview

LittleKernel bridges the gap between high-level frameworks (PyTorch, Triton)
and hand-written CUDA/PTX. You write kernels as decorated Python functions
using explicit types and intrinsics, and LittleKernel:

  1. Parses the Python AST,
  2. Runs compiler passes (type inference, memory allocation, constant
    folding, inlining),
  3. Generates CUDA C++ source with inline PTX assembly,
  4. Compiles via nvcc and wraps the result in a callable that
    interoperates with PyTorch tensors.

Architecture

python/little_kernel/
├── language/        # DSL types, decorators, intrinsics
│   └── intrin/      # Per-feature intrinsic modules
│       ├── wgmma.py     # SM90 Tensor Core (WGMMA)
│       ├── tma.py       # Tensor Memory Accelerator
│       ├── barrier.py   # MBarrier & cluster sync
│       └── ...
├── core/            # IR, passes, type system
│   └── passes/      # Compiler pipeline
├── codegen/         # CUDA code generation
├── runtime/         # nvcc compilation, TMA descriptors, kernel launch
├── atom/            # High-level building blocks (MMA, TMA, barriers)
└── benchmark/       # GPU micro-benchmarks and GEMM kernels

Supported Architectures

  • SM90 (Hopper) -- WGMMA, TMA, MBarrier, Cluster, async pipelines

XG-zheng and others added 7 commits February 12, 2026 19:52
* fix
* clean & fix gemm
* code format
* feat: qkv proj + a2a overlap support

See merge request: !314
* feat
* feat: ep moe e2e

See merge request: !315
* ci: use cached llvm lib
* refactor: overflow_factor
* fix: add __del__ to ctx class
* refactor: use triton_dist wrapper
* chore: code format and fix Aime's comments
* Revert "Revert "feat: amd upgrade docker rocm7 & amd ag gemm autotune fix & intra-kernel profile & fix oom""
* fix: IMA bug when token_len=1
* v7 kernel: using BS_M and binary workload search
* Revert "feat: amd upgrade docker rocm7 & amd ag gemm autotune fix & intra-kernel profile & fix oom"
* fix: optimize D2D mem copy
* fix: symm buffer size when reused as combine op
* CI: suite pattern for multinode kernels
* fix send_token_num and recv_token_num error
* code format and add ci test
* add intra-node all2all_vdev_2d v2 kernel
* chore: code format
* CI: add multinode tests
* chore: import error
* feat: add a2a_v_dev_2d for internode
* ci: add unit test
* feat: add intra-node all_to_all_vdev_2d

See merge request: !308
* fix: use stub for cuda and cudart
* fix: build dist include refactor dirs

See merge request: !317
* fix: tutorial needs fence before notify
* fix: ci
* fix: disable custom_llvm
* fix: no import cuda is not on nvgpu
* fix: add comments for cuda bindings
* fix: address comments
* fix: address comments
* fix: resolve comments of aime
* fix: format
* fix: format
* fix: format
* fix: license
* feat: add license
* fix: test location
* fix: format

See merge request: !325
@KnowingNothing KnowingNothing merged commit 01555f7 into ByteDance-Seed:main Feb 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants