Skip to content

bcourtney/mesh-llm

 
 

Repository files navigation

Mesh LLM

Mesh LLM logo

Mesh LLM

⚠️ Built with caffeine and anger. Harnesses used: Goose, pi, Claude Code. Models: Opus, GPT 5.x, some MiniMax M2.5 and GLM 4.7 Flash.

Pool spare GPU capacity to run LLMs at larger scale. Models that don't fit on one machine are automatically distributed — dense models via pipeline parallelism, MoE models via expert sharding with zero cross-node inference traffic. Have your agents gossip across the mesh — share status, findings, and questions without a central server.

Install

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash

The installer probes your machine, recommends a flavor, and asks what you want to install.

If you want it to run as a per-user background service, see Background service.

The installer currently targets macOS and Linux release bundles. Windows is supported through source builds and published .zip release assets instead.

For non-interactive installs, set the flavor explicitly:

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bash

Release bundles install flavor-specific llama.cpp binaries:

  • macOS: rpc-server-metal, llama-server-metal
  • Linux CPU: rpc-server-cpu, llama-server-cpu
  • Linux CUDA: rpc-server-cuda, llama-server-cuda
  • Linux ROCm: rpc-server-rocm, llama-server-rocm
  • Linux Vulkan: rpc-server-vulkan, llama-server-vulkan

If you keep more than one flavor installed in the same bin directory, select the one you want explicitly:

mesh-llm --llama-flavor vulkan --model Qwen2.5-32B

If you want a local GPU build from source instead:

git clone https://github.com/michaelneale/mesh-llm
cd mesh-llm
just build

Requires: just, cmake, Rust toolchain, Node.js 24 + npm. NVIDIA GPU builds need nvcc (CUDA toolkit). AMD GPU builds need ROCm/HIP. Vulkan GPU builds need the Vulkan development files plus glslc. CPU-only and Jetson/Tegra also work. For source builds, just build auto-detects CUDA vs ROCm vs Vulkan on Linux, or you can force backend=rocm or backend=vulkan. See CONTRIBUTING.md for details.

Windows source builds are also supported for cuda, rocm/hip, vulkan, and cpu via just build. Metal remains macOS-only. Tagged GitHub releases now publish Windows .zip bundles for cpu, cuda, rocm, and vulkan, and you can generate the same artifacts locally with just release-build-windows, just release-build-cuda-windows, just release-build-amd-windows, just release-build-vulkan-windows, and the matching release-bundle-*-windows recipes.

Run

Once installed, you can run:

mesh-llm --auto                            # join the best public mesh, start serving

That's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.

Or start your own:

mesh-llm --model Qwen2.5-32B              # downloads model (~20GB), starts API + web console
mesh-llm --model Qwen2.5-3B               # or a small model first (~2GB)

Add another machine:

mesh-llm --join <token>                    # token printed by the first machine

Or discover and join public meshes:

mesh-llm --auto                            # find and join the best mesh
mesh-llm --client --auto                   # join as API-only client (no GPU)

How it works

Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm --model X and the mesh figures out the best strategy:

  • Model fits on one machine? → runs solo, full speed, no network overhead
  • Dense model too big? → pipeline parallelism — layers split across nodes
  • MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic

If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Currently using a lightly forked version of llama.cpp (see the Justfile for where it pulls branch from).

Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.

MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek — increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.

Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.

Demand-aware rebalancing — a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.

Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.

Network optimizations

  • Zero-transfer GGUF loadingSET_TENSOR_GGUF tells rpc-server to read weights from local disk. Dropped model load from 111s → 5s.
  • RPC round-trip reduction — cached get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8.
  • Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
  • Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).

Usage

Start a mesh

mesh-llm --model Qwen2.5-32B

Starts serving a model and prints an invite token. This mesh is private — only people you share the token with can join.

To make it public (discoverable by others via --auto):

mesh-llm --model Qwen2.5-32B --publish

Join a mesh

mesh-llm --join <token>                    # join with invite token (GPU node)
mesh-llm --client --join <token>           # join as API-only client (no GPU)

Named mesh (buddy mode)

mesh-llm --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"

Everyone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. --mesh-name implies --publish — named meshes are always published to the directory.

Auto-discover

mesh-llm --auto                            # discover, join, and serve a model
mesh-llm --client --auto                   # join as API-only client (no GPU)
mesh-llm discover                          # browse available meshes

Multi-model

mesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash

# Route by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'

Different nodes serve different models. The API proxy routes by the model field.

No-arg behavior

mesh-llm                                   # no args — prints --help and exits

Does not start the console or bind any ports. Use the CLI flags shown in --help to start or join a mesh.

Background service

To install it as a per-user background service:

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash -s -- --service

To seed the service with a custom startup command on first install:

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash -s -- --service --service-args '--model Qwen2.5-3B'

Service installs are user-scoped:

  • macOS installs a launchd agent at ~/Library/LaunchAgents/com.mesh-llm.mesh-llm.plist
  • Linux installs a systemd --user unit at ~/.config/systemd/user/mesh-llm.service
  • Shared environment config lives in ~/.config/mesh-llm/service.env

The two platforms handle launch args differently:

  • macOS: launchd runs ~/.config/mesh-llm/run-service.sh, which reads ~/.config/mesh-llm/service.args. service.args is one mesh-llm CLI argument per line. The installer creates it with --auto by default and preserves your edits on reinstall unless you pass --service-args again.
  • Linux: the installer writes the mesh-llm argv directly into ExecStart= in ~/.config/systemd/user/mesh-llm.service. If you pass --service-args, those replace the current unit args; otherwise the installer preserves the existing unit args on reinstall.

service.env is optional and shared by both platforms. Use plain KEY=value lines, for example:

MESH_LLM_NO_SELF_UPDATE=1

If you edit the Linux unit manually, reload and restart it:

systemctl --user daemon-reload
systemctl --user restart mesh-llm.service

On Linux this is a user service, so if you want it to keep running after reboot before login, enable lingering once:

sudo loginctl enable-linger "$USER"

Web console

mesh-llm --model Qwen2.5-32B    # dashboard at http://localhost:3131

Live topology, VRAM bars per node, model picker, built-in chat. Everything comes from /api/status (JSON) and /api/events (SSE).

Development

Build-from-source and UI development instructions are in CONTRIBUTING.md.

Using with agents

mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.

For built-in launcher integrations (goose, claude):

  • If a mesh is already running locally on --port, it is reused.
  • If not, mesh-llm auto-starts a background client node that auto-joins the mesh.
  • If --model is omitted, the launcher picks the strongest tool-capable model available on the mesh.
  • When the harness exits (e.g. claude quits), the auto-started node is cleaned up automatically.

goose

Goose is available as both CLI (goose session) and desktop app (Goose.app).

mesh-llm goose

Use a specific model (example: MiniMax):

mesh-llm goose --model MiniMax-M2.5-Q4_K_M

This command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.

pi

  1. Start a mesh client:
mesh-llm --client --auto --port 9337
  1. Check what models are available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'
  1. Add a mesh provider to ~/.pi/agent/models.json (adjust model IDs to match your mesh):
{
  "providers": {
    "mesh": {
      "api": "openai-completions",
      "apiKey": "mesh",
      "baseUrl": "http://localhost:9337/v1",
      "models": [
        {
          "id": "MiniMax-M2.5-Q4_K_M",
          "name": "MiniMax M2.5 (Mesh)",
          "contextWindow": 65536,
          "maxTokens": 8192,
          "reasoning": true,
          "input": ["text"],
          "compat": {
            "maxTokensField": "max_tokens",
            "supportsDeveloperRole": false,
            "supportsUsageInStreaming": false
          }
        }
      ]
    }
  }
}
  1. Run pi:
pi --model mesh/MiniMax-M2.5-Q4_K_M

Or switch models interactively with Ctrl+M inside pi.

opencode

OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 opencode -m openai/GLM-4.7-Flash-Q4_K_M

claude code

Claude Code can be launched directly through mesh-llm (no proxy required):

mesh-llm claude

Use a specific model (example: MiniMax):

mesh-llm claude --model MiniMax-M2.5-Q4_K_M

curl / any OpenAI client

curl http://localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

Blackboard

The mesh doesn't just share compute — it shares knowledge. Agents and people post status updates, findings, and questions to a shared blackboard that propagates across the mesh.

Works standalone — you don't need to run models through the mesh. Using your own API keys or a cloud provider? Just run mesh-llm --client to give your agents a gossip layer. No GPU needed, no model needed.

mesh-llm --client

# Install the agent skill (works with pi, Goose, others)
mesh-llm blackboard install-skill

# Post what you're working on
mesh-llm blackboard "STATUS: [org/repo branch:main] refactoring billing module"

# Search the blackboard
mesh-llm blackboard --search "billing refactor"

# Check for unanswered questions
mesh-llm blackboard --search "QUESTION"

With the skill installed, agents proactively search before starting work, post their status, share findings, and answer each other's questions — all through the mesh.

Messages are ephemeral (48h), PII is auto-scrubbed, and everything stays within the mesh — no cloud, no external services.

MCP Server

The blackboard is available as an MCP server for agent integration. Any MCP-compatible agent (pi, Claude Code, Goose, etc.) can post, search, and read the feed directly:

# Run as MCP server over stdio
mesh-llm blackboard --mcp

Configure in your agent's MCP settings:

{
  "mcpServers": {
    "mesh-blackboard": {
      "command": "mesh-llm",
      "args": ["blackboard", "--mcp"]
    }
  }
}

Tools exposed: blackboard_post, blackboard_search, blackboard_feed.

Benchmarks

GLM-4.7-Flash-Q4_K_M (17GB), M4 Max + Mac Mini M4, WiFi:

Configuration tok/s
Solo (no mesh) 68
2-node split (85/15) 21
3-node split (62/31/8) 12-13

Cross-network (Sydney ↔ Queensland, ~20ms RTT): 10-25 tok/s. Overhead dominated by per-token RPC latency.

Stock llama.cpp RPC transfers 16.88GB on connect. This fork: 0 bytes, ~9 seconds.

Model catalog

mesh-llm download           # list models
mesh-llm download 32b       # Qwen2.5-32B (~20GB)
mesh-llm download 72b --draft  # Qwen2.5-72B + draft model

Draft pairings for speculative decoding:

Model Size Draft Draft size
Qwen2.5 (3B/7B/14B/32B/72B) 2-47GB Qwen2.5-0.5B 491MB
Qwen3-32B 20GB Qwen3-0.6B 397MB
Llama-3.3-70B 43GB Llama-3.2-1B 760MB
Gemma-3-27B 17GB Gemma-3-1B 780MB

Specifying models

--model accepts several formats. Hugging Face-backed models are cached in the standard Hugging Face cache (~/.cache/huggingface/hub/ by default) on first use.

# Catalog name (fuzzy match — finds Qwen3-8B-Q4_K_M)
mesh-llm --model Qwen3-8B

# Full catalog name
mesh-llm --model Qwen3-8B-Q4_K_M

# HuggingFace URL (any GGUF)
mesh-llm --model https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# HuggingFace shorthand (org/repo/file.gguf)
mesh-llm --model bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Local file path (legacy/raw file mode)
mesh-llm --gguf ~/my-models/custom-model.gguf

Catalog models are downloaded with resume support. Use the models subcommands to browse, inspect, and fetch exact refs.

Model storage and migration

  • Hugging Face repo snapshots are the canonical managed model store.
  • ~/.models/ is deprecated and will be removed in a future release.
  • Arbitrary local GGUF files remain supported through --gguf.
  • MoE split artifacts are cached separately under ~/.cache/mesh-llm/splits/.

Useful commands:

mesh-llm models recommended      # list built-in recommended models
mesh-llm models installed        # list installed local models
mesh-llm models search qwen 8b   # search Hugging Face GGUF repos
mesh-llm models search --catalog qwen
mesh-llm models show Qwen/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
mesh-llm models download Qwen/Qwen3-8B-GGUF/Qwen3-8B-Q4_K_M.gguf
mesh-llm models migrate          # inspect deprecated ~/.models content
mesh-llm models migrate --apply  # materialize recognized HF-backed models into the HF cache
mesh-llm models updates --check  # check cached HF repos for newer upstream revisions
mesh-llm models updates --all    # refresh all cached HF repos
mesh-llm models updates Qwen/Qwen3-8B-GGUF

Local runtime control

Stage one supports local-only hot load/unload on a running node.

mesh-llm load Llama-3.2-1B-Instruct-Q4_K_M
mesh-llm unload Llama-3.2-1B-Instruct-Q4_K_M
mesh-llm status

REST endpoints on the management API:

curl localhost:3131/api/runtime
curl localhost:3131/api/runtime/processes
curl -X POST localhost:3131/api/runtime/models \
  -H 'Content-Type: application/json' \
  -d '{"model":"Llama-3.2-1B-Instruct-Q4_K_M"}'
curl -X DELETE localhost:3131/api/runtime/models/Llama-3.2-1B-Instruct-Q4_K_M

This is intentionally node-local in stage one. Mesh-wide rebalancing and distributed load/unload are stage two.

Community

Join the #mesh-llm channel on the Goose Discord for discussion, support, and development chat.

Contributing

See CONTRIBUTING.md for build and development workflows.

Project Structure

Path Purpose
llama.cpp/ Fork with zero-transfer RPC patches
mesh-llm/ Rust QUIC mesh (internals)

"You are all a bunch of dirty hackers" — Author's CompSci 101 professor

About

reference impl with llama.cpp compiled to distributed inference across machines, with real end to end demo

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 74.3%
  • TypeScript 11.2%
  • Shell 8.6%
  • PowerShell 2.0%
  • Python 1.3%
  • Just 0.7%
  • Other 1.9%