-
SmallThinker (SmallThinker-21BA3B-Instruct and SmallThinker-4BA0.6B-Instruct) is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.
-
This inference framework is specifically optimized for sparse model inference to achieve faster speeds, leveraging the router's pre-selection mechanism to enable efficient inference even in memory-constrained scenarios.
demo.mp4
| Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
|---|---|---|---|---|---|
| SmallThinker 21B (sparse) | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 |
| SmallThinker 21B (sparse + limited memory) | limit 8G | 20.30 | 15.50 | 8.56 | - |
| Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - |
| Qwen3 30B A3B (limited memory) | limit 8G | 10.11 | 0.18 | 6.32 | - |
| Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 |
| Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 |
| Model | Memory(GiB) | i9 14900 | 1+13 8gen4 | rk3588 (16G) | rk3576 | Raspberry PI 5 | RDK X5 | rk3566 |
|---|---|---|---|---|---|---|---|---|
| SmallThinker 4B (sparse) | 2.24 | 108.17 | 78.99 | 39.76 | 15.10 | 28.77 | 7.23 | 6.33 |
| SmallThinker 4B (sparse + limited memory) | limit 1G | 29.99 | 20.91 | 15.04 | 2.60 | 0.75 | 0.67 | 0.74 |
| Qwen3 0.6B | 0.6 | 148.56 | 94.91 | 45.93 | 15.29 | 27.44 | 13.32 | 9.76 |
| Qwen3 1.7B | 1.3 | 62.24 | 41.00 | 20.29 | 6.09 | 11.08 | 6.35 | 4.15 |
| Qwen3 1.7B (limited memory) | limit 1G | 2.66 | 1.09 | 1.00 | 0.47 | - | - | 0.11 |
| Gemma3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 3.80 | 6.66 | 3.46 | 2.45 |
Note:
- sparse: refers to leveraging the sparsity induced by the ReLU activation function to skip certain computations during the UP/DOWN calculation of each expert based on the GATE output, as well as using a predictor to perform sparse computation when calculating the lm_head
- init submodule:
git submodule update --init --recursive- install clang-21 and mold:
sudo apt install clang-21 mold- Install the required Python packages
pip install -r requirements.txt- cd smallthinker before compiling
cd smallthinkerNOTE: Compilation, model conversion, and other related operations must be performed in the smallthinker directory.
python3 convert_hf_to_gguf.py /path/to/safetensors_model --outtype f16 --outfile /path/to/gguf_fp16 --transpose-down all
./build/bin/llama-quantize --pure /path/to/gguf_fp16 /path/to/gguf_q4_0 Q4_0 8Note:lm_head sparsity is not included. If needed, please merge model_lm_head.pt into the safetensors file before executing the above commands, or directly download the GGUF file we provide.
cmake -S . -B build \
-DCMAKE_C_COMPILER=clang-21 \
-DCMAKE_CXX_COMPILER=clang++-21 \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_OPENMP=OFF \
-DLLAMA_CURL=OFF \
-DBUILD_SHARED_LIBS=OFF \
-DAZ_ENABLE_PERFETTO=OFF \
-DPOWERINFER_NO_FFN_REPACK=ON \
-DPOWERINFER_WITH_TRACING=OFF \
-DGGML_CPU_AARCH64=OFF
cmake --build build --config RelWithDebInfo --target llama-cli -j32- Need to manually compile and install libaio into the NDK.:
cd powerinfer/third_part/libaio
export TOOLCHAIN=$NDK/toolchains/llvm/prebuilt/linux-x86_64
export TARGET=aarch64-linux-android
export HOST=$TARGET
export API=34
export AR=$TOOLCHAIN/bin/llvm-ar
export CC=$TOOLCHAIN/bin/$TARGET$API-clang
export AS=$CC
export CXX=$TOOLCHAIN/bin/$TARGET$API-clang++
export LD=$TOOLCHAIN/bin/ld
export RANLIB=$TOOLCHAIN/bin/llvm-ranlib
export STRIP=$TOOLCHAIN/bin/llvm-strip
make prefix=$NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr install- liburing is the same
cd powerinfer/third_part/liburing
export TOOLCHAIN=$NDK/toolchains/llvm/prebuilt/linux-x86_64
export TARGET=aarch64-linux-android
export HOST=$TARGET
export API=34
export AR=$TOOLCHAIN/bin/llvm-ar
export CC=$TOOLCHAIN/bin/$TARGET$API-clang
export AS=$CC
export CXX=$TOOLCHAIN/bin/$TARGET$API-clang++
export LD=$TOOLCHAIN/bin/ld
export RANLIB=$TOOLCHAIN/bin/llvm-ranlib
export STRIP=$TOOLCHAIN/bin/llvm-strip
./configure --prefix=$NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr
make installcmake -S . -B build_a \
-DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-34 \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_OPENMP=OFF \
-DLLAMA_CURL=OFF \
-DAZ_ENABLE_PERFETTO=ON \
-DPOWERINFER_NO_FFN_REPACK=ON \
-DDISABLE_ARM_FEATURE_CHECK=ON \
-DCMAKE_C_FLAGS="-march=armv8.6-a -D__USE_GNU -Ofast -flto" \
-DCMAKE_CXX_FLAGS="-march=armv8.6-a -D__USE_GNU -Ofast -flto"
cmake --build build_a --config RelWithDebInfo --target llama-cli -j32Other platforms (such as rk3588) compile commands refer to toolchains/cross_compile.md
./llama-cli -m /path/to/gguf_q4_0 -no-cnv --temp 0.6 --top-k 20 --top-p 0.95 --samplers "temperature;top_k;top_p" -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nCalculate the integral of f(x) = sin(x) from 0 to 3pi/4.<|im_end|>\n<|im_start|>assistant" -t 4 -n 256- generate expert bundle
GENERATE_EXPERT_BUNDLE=/path/to/bundle ./llama-cli -m /path/to/gguf_q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --samplers "penalties;temperature;top_k;top_p" -t 4 -n 128 -no-cnv- remove moe weights in the gguf file(use when run in termux)
python get_no_moe_weights_ffn.py /path/to/gguf_q4_0 /path/to/no_moe_gguf_q4_03.Configure the environment variable MAX_N_CACHED based on the desired memory limitation. here are some recommended configuration for SmallThinker:
- 21B model under 8GB limit: max_n_cached_matrices = 6144
- 4B model under 1GB limit: max_n_cached_matrices = 768
MAX_N_CACHED=768 EXPERT_BUNDLE_PATH=/path/to/bundle ./llama-cli -m /path/to/no_moe_gguf_q4_0 --no-cnv --temp 0.6 --top-k 20 --top-p 0.95 --samplers "temperature;top_k;top_p" -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nCalculate the integral of f(x) = sin(x) from 0 to 3pi/4.<|im_end|>\n<|im_start|>assistant" -t 4 -n 256 -ub 4- The models use a sparse lm_head which may lead to some loss in precision. If you want to disable it, change the condition at src/llama-model.cpp:7580 to false.But the speed is slower.
- It may require root privileges when running in Termux when run the Memory-Efficient Version.
We would like to thank the following projects: