PowerInfer/smallthinker at main · stophobia/PowerInfer

Name	Name	Last commit message	Last commit date
parent directory ..
ci	ci
cmake	cmake
common	common
docs	docs
examples	examples
ggml	ggml
gguf-py	gguf-py
grammars	grammars
include	include
licenses	licenses
media	media
models	models
pocs	pocs
powerinfer	powerinfer
requirements	requirements
scripts	scripts
src	src
tests	tests
toolchains	toolchains
tools	tools
vendor	vendor
AUTHORS	AUTHORS
CMakeLists.txt	CMakeLists.txt
CMakePresets.json	CMakePresets.json
CODEOWNERS	CODEOWNERS
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
SECURITY.md	SECURITY.md
build-xcframework.sh	build-xcframework.sh
convert_hf_to_gguf.py	convert_hf_to_gguf.py
convert_hf_to_gguf_update.py	convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py	convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py	convert_lora_to_gguf.py
flake.lock	flake.lock
flake.nix	flake.nix
get_no_moe_weights_ffn.py	get_no_moe_weights_ffn.py
mypy.ini	mypy.ini
pyproject.toml	pyproject.toml
pyrightconfig.json	pyrightconfig.json
requirements.txt	requirements.txt

Intro

SmallThinker (SmallThinker-21BA3B-Instruct and SmallThinker-4BA0.6B-Instruct) is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.
This inference framework is specifically optimized for sparse model inference to achieve faster speeds, leveraging the router's pre-selection mechanism to enable efficient inference even in memory-constrained scenarios.

Demo

demo.mp4

Speed

SmallThinker 21B

Model	Memory(GiB)	i9 14900	1+13 8ge4	rk3588 (16G)	Raspberry PI 5
SmallThinker 21B (sparse)	11.47	30.19	23.03	10.84	6.61
SmallThinker 21B (sparse + limited memory)	limit 8G	20.30	15.50	8.56	-
Qwen3 30B A3B	16.20	33.52	20.18	9.07	-
Qwen3 30B A3B (limited memory)	limit 8G	10.11	0.18	6.32	-
Gemma 3n E2B	1G, theoretically	36.88	27.06	12.50	6.66
Gemma 3n E4B	2G, theoretically	21.93	16.58	7.37	4.01

SmallThinker 4B

Model	Memory(GiB)	i9 14900	1+13 8gen4	rk3588 (16G)	rk3576	Raspberry PI 5	RDK X5	rk3566
SmallThinker 4B (sparse)	2.24	108.17	78.99	39.76	15.10	28.77	7.23	6.33
SmallThinker 4B (sparse + limited memory)	limit 1G	29.99	20.91	15.04	2.60	0.75	0.67	0.74
Qwen3 0.6B	0.6	148.56	94.91	45.93	15.29	27.44	13.32	9.76
Qwen3 1.7B	1.3	62.24	41.00	20.29	6.09	11.08	6.35	4.15
Qwen3 1.7B (limited memory)	limit 1G	2.66	1.09	1.00	0.47	-	-	0.11
Gemma3n E2B	1G, theoretically	36.88	27.06	12.50	3.80	6.66	3.46	2.45

Note：

sparse: refers to leveraging the sparsity induced by the ReLU activation function to skip certain computations during the UP/DOWN calculation of each expert based on the GATE output, as well as using a predictor to perform sparse computation when calculating the lm_head

Setup

init submodule：

git submodule update --init --recursive

install clang-21 and mold：

sudo apt install clang-21 mold

Install the required Python packages

pip install -r requirements.txt

cd smallthinker before compiling

cd smallthinker

NOTE: Compilation, model conversion, and other related operations must be performed in the `smallthinker` directory.

Convert Model

python3 convert_hf_to_gguf.py /path/to/safetensors_model --outtype f16 --outfile /path/to/gguf_fp16 --transpose-down all

./build/bin/llama-quantize --pure /path/to/gguf_fp16  /path/to/gguf_q4_0 Q4_0  8

Note:lm_head sparsity is not included. If needed, please merge model_lm_head.pt into the safetensors file before executing the above commands, or directly download the GGUF file we provide.

x86 Compile

cmake -S . -B build \
    -DCMAKE_C_COMPILER=clang-21 \
    -DCMAKE_CXX_COMPILER=clang++-21 \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DGGML_OPENMP=OFF \
    -DLLAMA_CURL=OFF \
    -DBUILD_SHARED_LIBS=OFF \
    -DAZ_ENABLE_PERFETTO=OFF \
    -DPOWERINFER_NO_FFN_REPACK=ON \
    -DPOWERINFER_WITH_TRACING=OFF \
    -DGGML_CPU_AARCH64=OFF  

cmake --build build --config RelWithDebInfo --target llama-cli -j32

Android NDK (Qualcomm 8 Elite)

Need to manually compile and install libaio into the NDK.：

cd powerinfer/third_part/libaio
export TOOLCHAIN=$NDK/toolchains/llvm/prebuilt/linux-x86_64
export TARGET=aarch64-linux-android
export HOST=$TARGET
export API=34
export AR=$TOOLCHAIN/bin/llvm-ar
export CC=$TOOLCHAIN/bin/$TARGET$API-clang
export AS=$CC
export CXX=$TOOLCHAIN/bin/$TARGET$API-clang++
export LD=$TOOLCHAIN/bin/ld
export RANLIB=$TOOLCHAIN/bin/llvm-ranlib
export STRIP=$TOOLCHAIN/bin/llvm-strip
make prefix=$NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr install

liburing is the same

cd powerinfer/third_part/liburing
export TOOLCHAIN=$NDK/toolchains/llvm/prebuilt/linux-x86_64
export TARGET=aarch64-linux-android
export HOST=$TARGET
export API=34
export AR=$TOOLCHAIN/bin/llvm-ar
export CC=$TOOLCHAIN/bin/$TARGET$API-clang
export AS=$CC
export CXX=$TOOLCHAIN/bin/$TARGET$API-clang++
export LD=$TOOLCHAIN/bin/ld
export RANLIB=$TOOLCHAIN/bin/llvm-ranlib
export STRIP=$TOOLCHAIN/bin/llvm-strip
./configure --prefix=$NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr
make install

cmake -S . -B build_a \
    -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-34 \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_OPENMP=OFF \
    -DLLAMA_CURL=OFF \
    -DAZ_ENABLE_PERFETTO=ON \
    -DPOWERINFER_NO_FFN_REPACK=ON \
    -DDISABLE_ARM_FEATURE_CHECK=ON \
    -DCMAKE_C_FLAGS="-march=armv8.6-a -D__USE_GNU -Ofast -flto" \
    -DCMAKE_CXX_FLAGS="-march=armv8.6-a -D__USE_GNU -Ofast -flto"

    cmake --build build_a --config RelWithDebInfo --target llama-cli -j32

Other platforms (such as rk3588) compile commands refer to toolchains/cross_compile.md

Run(need to use a sparse model with Q4_0 quantization, and a maximum of 8 threads)

Normal Run

./llama-cli -m /path/to/gguf_q4_0 -no-cnv --temp 0.6 --top-k 20 --top-p 0.95 --samplers "temperature;top_k;top_p" -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nCalculate the integral of f(x) = sin(x) from 0 to 3pi/4.<|im_end|>\n<|im_start|>assistant" -t 4 -n 256

Memory-Efficient Run

Prepare：

generate expert bundle

GENERATE_EXPERT_BUNDLE=/path/to/bundle ./llama-cli -m /path/to/gguf_q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --samplers "penalties;temperature;top_k;top_p" -t 4 -n 128  -no-cnv

remove moe weights in the gguf file(use when run in termux)

python get_no_moe_weights_ffn.py /path/to/gguf_q4_0 /path/to/no_moe_gguf_q4_0

3.Configure the environment variable MAX_N_CACHED based on the desired memory limitation. here are some recommended configuration for SmallThinker:

21B model under 8GB limit: max_n_cached_matrices = 6144
4B model under 1GB limit: max_n_cached_matrices = 768

Run the Memory-Efficient Version：

MAX_N_CACHED=768 EXPERT_BUNDLE_PATH=/path/to/bundle ./llama-cli -m /path/to/no_moe_gguf_q4_0 --no-cnv --temp 0.6 --top-k 20 --top-p 0.95 --samplers "temperature;top_k;top_p" -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nCalculate the integral of f(x) = sin(x) from 0 to 3pi/4.<|im_end|>\n<|im_start|>assistant" -t 4 -n 256 -ub 4

Note:

The models use a sparse lm_head which may lead to some loss in precision. If you want to disable it, change the condition at src/llama-model.cpp:7580 to false.But the speed is slower.
It may require root privileges when running in Termux when run the Memory-Efficient Version.

Acknowledgements

We would like to thank the following projects:

llama.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Intro

Demo

Speed

SmallThinker 21B

SmallThinker 4B

Setup

NOTE: Compilation, model conversion, and other related operations must be performed in the `smallthinker` directory.

Convert Model

x86 Compile

Android NDK (Qualcomm 8 Elite)

Run(need to use a sparse model with Q4_0 quantization, and a maximum of 8 threads)

Normal Run

Memory-Efficient Run

Prepare：

Run the Memory-Efficient Version：

Note:

Acknowledgements

FilesExpand file tree

smallthinker

Directory actions

More options

Directory actions

More options

Latest commit

History

smallthinker

Folders and files

parent directory

README.md

Intro

Demo

Speed

SmallThinker 21B

SmallThinker 4B

Setup

NOTE: Compilation, model conversion, and other related operations must be performed in the smallthinker directory.

Convert Model

x86 Compile

Android NDK (Qualcomm 8 Elite)

Run(need to use a sparse model with Q4_0 quantization, and a maximum of 8 threads)

Normal Run

Memory-Efficient Run

Prepare：

Run the Memory-Efficient Version：

Note:

Acknowledgements

NOTE: Compilation, model conversion, and other related operations must be performed in the `smallthinker` directory.