Skip to content

mkdgs/Bonsai-demo

 
 

Repository files navigation

Bonsai Demo

Bonsai

Website  |  HuggingFace Collection  |  Whitepaper  |  GitHub  |  Discord

Using this demo you can run 1-bit Bonsai language models locally on Mac (Metal), Linux/Windows (CUDA, Vulkan, ROCm), or CPU.

Other Bonsai Demos

Below are featured community demos and additional Bonsai demos hosted outside this repository.

Demo Description
Bonsai WebGPU Community / HuggingFace — runs directly in your browser, no setup required
Bonsai GPU Demo Our hosted demo on GPUs on HuggingFace Spaces
Google Colab Notebook Run Bonsai in a Google Colab notebook

Upstream Status for 1-bit

  • llama.cpp (GGUF) — C/C++, runs on Mac (Metal), Linux/Windows (CUDA, Vulkan, ROCm/HIP), and CPU.
  • MLX (MLX format) — Python, optimized for Apple Silicon.

Q1_0 support for CPU, Metal, CUDA, and Vulkan backends is already merged into upstream llama.cpp. Additional backends (optimized x86 CPU, AMD) are pending. In the meantime, our fork provides a more complete set of backends in one place:

Backend Status PR
CPU (generic) ✅ Merged #21273
Metal ✅ Merged #21528
CUDA ✅ Merged #21629
Vulkan ✅ Merged #21539 (community contribution)
CPU (optimized x86) ⏳ Pending #21636
MLX ⏳ Pending mlx#3161

Benchmarks

See community-benchmarks/ for results on different hardware and templates to submit your own.

Models

Three model sizes are available: 8B, 4B, and 1.7B, each in two formats:

Bonsai accuracy vs model size frontier

Model Format HuggingFace Repo
Bonsai-8B GGUF prism-ml/Bonsai-8B-gguf
Bonsai-8B MLX prism-ml/Bonsai-8B-mlx-1bit
Bonsai-4B GGUF prism-ml/Bonsai-4B-gguf
Bonsai-4B MLX prism-ml/Bonsai-4B-mlx-1bit
Bonsai-1.7B GGUF prism-ml/Bonsai-1.7B-gguf
Bonsai-1.7B MLX prism-ml/Bonsai-1.7B-mlx-1bit

Set BONSAI_MODEL to choose which size to download and run (default: 8B).


Quick Start

macOS / Linux

git clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo

# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
export BONSAI_MODEL=8B

# One command does everything: installs deps, downloads models + binaries
./setup.sh

Windows (PowerShell)

git clone https://github.com/PrismML-Eng/Bonsai-demo.git
cd Bonsai-demo

# (Optional) Choose a model size: 8B (default), 4B, or 1.7B
$env:BONSAI_MODEL = "8B"

# Run setup
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\setup.ps1

Switching models

You can download a different size and switch between them instantly — no full re-setup needed:

BONSAI_MODEL=4B ./scripts/download_models.sh
BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you? Introduce yourself in haiku"

What setup.sh Does

The setup script handles everything for you, even on a fresh machine:

  1. Checks/installs system deps — Xcode CLT on macOS, build-essential on Linux
  2. Installs uv — fast Python package manager (user-local, not global)
  3. Creates a Python venv and runs uv sync — installs cmake, ninja, huggingface-cli from pyproject.toml
  4. Downloads models from HuggingFace (needs PRISM_HF_TOKEN while repos are private)
  5. Downloads pre-built binaries from GitHub Release (or builds from source if you prefer)
  6. Builds MLX from source (macOS only) — clones our fork, then uv sync --extra mlx for the full ML stack

Re-running setup.sh is safe — it skips already-completed steps.


Running the Model

All run scripts respect BONSAI_MODEL (default 8B). Set it to run a different size:

llama.cpp (Mac / Linux — auto-detects platform)

./scripts/run_llama.sh -p "What is the capital of France?"

# Run a different model size
BONSAI_MODEL=4B ./scripts/run_llama.sh -p "Who are you? Introduce yourself in haiku"

llama.cpp (Windows PowerShell)

.\scripts\run_llama.ps1 -p "What is the capital of France?"

# Run a different model size
$env:BONSAI_MODEL = "4B"
.\scripts\run_llama.ps1 -p "Who are you? Introduce yourself in haiku"

MLX — Mac (Apple Silicon)

source .venv/bin/activate
./scripts/run_mlx.sh -p "What is the capital of France?"

Chat Server

Start llama-server with its built-in chat UI:

./scripts/start_llama_server.sh    # http://localhost:8080

# Serve a different model size
BONSAI_MODEL=4B ./scripts/start_llama_server.sh

For Windows PowerShell:

.\scripts\start_llama_server.ps1

Context Size

The 8B model supports up to 65,536 tokens.

By default the scripts pass -c 0, which lets llama.cpp's --fit automatically size the KV cache to your available memory (no pre-allocation waste). If your build doesn't support -c 0, the scripts fall back to a safe value based on system RAM:

Estimates for Bonsai-8B (weights + KV cache + activations):

Context Size Est. Memory Usage
8,192 tokens ~2.5 GB
32,768 tokens ~5.9 GB
65,536 tokens ~10.5 GB

Override with: ./scripts/run_llama.sh -c 8192 -p "Your prompt"


Open WebUI (Optional)

Open WebUI provides a ChatGPT-like browser interface. It auto-starts the backend servers if they're not already running. Ctrl+C stops everything.

# Install (heavy — separate from base deps)
source .venv/bin/activate
uv pip install open-webui

# One command — starts backends + opens http://localhost:9090
./scripts/start_openwebui.sh

Building from Source

If you prefer to build llama.cpp from source instead of using pre-built binaries:

Mac (Apple Silicon — Metal)

./scripts/build_mac.sh

Clones PrismML-Eng/llama.cpp, builds with Metal, outputs to bin/mac/.

Mac (Intel — CPU only)

./scripts/build_mac.sh

The script auto-detects Intel vs Apple Silicon. On Intel Macs, it builds with -DGGML_METAL=OFF (CPU only). MLX is also skipped automatically since it requires Apple Silicon.

Linux (CPU only)

./scripts/build_cpu_linux.sh

Builds a CPU-only binary with no GPU dependencies. Works on both x64 and arm64. Outputs to bin/cpu/.

Linux (CUDA)

./scripts/build_cuda_linux.sh

Auto-detects CUDA version. Pass --cuda-path /usr/local/cuda-12.8 to use a specific toolkit.

Linux (Vulkan)

# Install Vulkan SDK first (e.g. sudo apt install libvulkan-dev glslc)
git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build -j$(nproc)
# Binaries in build/bin/

Linux (ROCm / AMD GPU)

# Requires ROCm toolkit (hipcc)
git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIP=ON
cmake --build build -j$(nproc)
# Binaries in build/bin/

Windows (CUDA)

.\scripts\build_cuda_windows.ps1

Auto-detects CUDA toolkit. Pass -CudaPath "C:\path\to\cuda" to use a specific version. Requires Visual Studio Build Tools (or full Visual Studio) and CUDA toolkit.

Windows (CPU only)

git clone -b prism https://github.com/PrismML-Eng/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Binaries in build\bin\Release\

Requires Visual Studio Build Tools or full Visual Studio with C++ workload.


llama.cpp Pre-built Binary Downloads

All binaries are available from the GitHub Release:

Platform
macOS Apple Silicon (arm64)
macOS Apple Silicon (KleidiAI)
macOS Intel (x64)
Linux x64 (CPU)
Linux arm64 (CPU)
Linux x64 (CUDA 12.4)
Linux x64 (CUDA 12.8)
Linux x64 (Vulkan)
Linux arm64 (Vulkan)
Linux x64 (ROCm 7.2)
Windows x64 (CPU)
Windows arm64 (CPU)
Windows x64 (CUDA 12.4)
Windows x64 (Vulkan)
Windows x64 (HIP/ROCm)
iOS (XCFramework)

Folder Structure

After setup, the directory looks like this:

Bonsai-demo/
├── README.md
├── setup.sh                        # macOS/Linux setup
├── setup.ps1                       # Windows setup
├── pyproject.toml                  # Python dependencies
├── scripts/
│   ├── common.sh                   # Shared helpers + BONSAI_MODEL
│   ├── download_models.sh          # HuggingFace download
│   ├── download_binaries.sh        # GitHub release download
│   ├── run_llama.sh                # llama.cpp (auto-detects Mac/Linux)
│   ├── run_llama.ps1               # llama.cpp (Windows PowerShell)
│   ├── run_mlx.sh                  # MLX inference
│   ├── mlx_generate.py             # MLX Python script
│   ├── start_llama_server.sh       # llama.cpp server (port 8080)
│   ├── start_llama_server.ps1      # llama.cpp server (Windows PowerShell)
│   ├── start_mlx_server.sh         # MLX server (port 8081)
│   ├── start_openwebui.sh          # Open WebUI + auto-starts backends
│   ├── build_mac.sh                # Build llama.cpp for Mac
│   ├── build_cpu_linux.sh          # Build llama.cpp for Linux (CPU only)
│   ├── build_cuda_linux.sh         # Build llama.cpp for Linux CUDA
│   └── build_cuda_windows.ps1      # Build llama.cpp for Windows CUDA
├── models/                         # ← downloaded by setup
│   ├── gguf/
│   │   ├── 8B/                     # GGUF 8B model
│   │   ├── 4B/                     # GGUF 4B model
│   │   └── 1.7B/                   # GGUF 1.7B model
│   ├── Bonsai-8B-mlx/             # MLX 8B model (macOS)
│   ├── Bonsai-4B-mlx/             # MLX 4B model (macOS)
│   └── Bonsai-1.7B-mlx/           # MLX 1.7B model (macOS)
├── bin/                            # ← downloaded or built by setup
│   ├── mac/                        # macOS binaries (Metal or CPU)
│   ├── cuda/                       # CUDA binaries (Linux/Windows)
│   ├── cpu/                        # CPU-only binaries (Linux/Windows)
│   ├── vulkan/                     # Vulkan binaries
│   ├── rocm/                       # ROCm binaries (AMD Linux)
│   └── hip/                        # HIP binaries (AMD Windows)
├── mlx/                            # ← cloned by setup (macOS)
└── .venv/                          # ← created by setup

Items marked with ← are created at setup time and excluded from git.


Appendix — FAQ

CUDA source build runs out of memory or freezes

Symptom: cmake --build hangs, the system becomes unresponsive, or the build process is killed with an OOM error when building llama.cpp from source with CUDA enabled.

Cause: Compiling CUDA kernels is memory-intensive — each parallel compile job can consume several GB of GPU VRAM and/or system RAM. Running make -j$(nproc) on a machine with a low-VRAM GPU (< 16 GB) or limited system RAM can exhaust available memory.

How the build scripts handle this: build_cuda_linux.sh and build_cuda_windows.ps1 automatically detect the GPU's VRAM before building. If the maximum detected VRAM is less than 16 GB, the scripts cap parallelism at -j 2 instead of using all logical CPU cores. You will see a message like:

Detected GPU VRAM: 8.0 GB (< 16 GB) -- limiting CUDA build to -j 2

Manual override: If you still encounter OOM errors, reduce parallelism further by editing the build invocation in the relevant script, or close other GPU-heavy applications before building.

About

Bonsai Demo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 66.2%
  • PowerShell 31.1%
  • Python 2.7%