Thanks for considering a contribution. Lucebox is a hub of self-contained optimization projects. Each one lives with its own README, benchmarks, and code, and the hub stays thin on purpose.
- Kernel improvements that preserve correctness and improve
tok/s,tok/J, or memory footprint on the target hardware. Benchmark deltas required. - Speculative decoding algorithms that improve our current SOTA performances
- Benchmark harness work under
benchmarks/once that directory starts shipping code. - Doc fixes and writeups — always welcome.
- Closed-source dependencies. Everything here has to be reproducible from public sources.
Hardware: NVIDIA sm_86+ GPU (RTX 3090, A10, A40, 4090) or Jetson AGX Thor sm_110, 24 GB VRAM. Thor requires CUDA 13+.
On Ubuntu 22.04 or 24.04, one script installs all system dependencies — build-essential, cmake, git, git-lfs, and the CUDA Toolkit from NVIDIA's repo:
sudo dflash/scripts/setup_system.shThe script is idempotent and configures nvcc on PATH for both bash and zsh. For other distros see the CUDA installation guide.
| Tool | Min version |
|---|---|
| GCC / G++ | 11 |
| CMake | 3.18 |
| Git | 2.x |
| git-lfs | any |
| CUDA Toolkit | 12.0+ |
| huggingface-cli | any |
After setup:
git submodule update --init --recursive
cmake -B dflash/build -S dflash -DCMAKE_BUILD_TYPE=Release
cmake --build dflash/build --target test_dflash -jIf cmake was previously run without CUDA, wipe the build directory first (
rm -rf dflash/build) to avoid a stale compiler cache.
- Benchmark before and after on the same hardware, at the same power limit, with the same warmup. Numbers without methodology don't get merged.
- Run the existing correctness check (
bench_pp_tg.pyfor megakernel) and confirm your change doesn't regress output parity. - One concern per PR. Kernel/algorithms changes, docs, and build config go in separate commits or separate PRs.
Conventional commits:
feat(megakernel): fused QKV+RoPE path cuts per-token launch by 1 kernel
fix(dflash): clamp int8 DeltaNet state update before dequant
docs(hub): add DVFS methodology link
Allowed types: feat, fix, refactor, perf, docs, test, bench, chore, ci.
If you want to contribute benchmarks but don't have the hardware:
- We can run numbered runs on our RTX 3090 (24GB) or Ryzen 395 AI Max (128GB). Open an issue with the PR.
- Apple Silicon numbers need an M-series machine running
powermetrics, not a remote box.
- Discord — fastest feedback
- Issues — for bugs and proposals
- Mention
@Luce-Org/maintainerson a PR when it's ready for review
By contributing you agree your work is licensed under the Apache License, Version 2.0, same as the rest of the repo (see LICENSE). Historical contributions before the relicense remain available under their original MIT terms in the git history.