__ ___ __ __ / |/ /__ ____ / /_____ _/ /_ / /|_/ / _ \/ __ \/ __/ __ `/ __/ / / / / __/ / / / /_/ /_/ / /_ /_/ /_/\___/_/ /_/\__/\__,_/\__/
A Sovereign, Rust-Native Inference Engine for High-Performance Reasoning Models.
Mentat is a completely independent, fast, and secure platform for running AI models locally. Built from the ground up in Rust, it extraction maximum performance from consumer hardware while ensuring absolute data privacy and architectural sovereignty.
graph TD
subgraph Input ["Input Layer"]
Prompt["User Prompt"]
Weights["Safetensors (.bin / .safetensors)"]
end
subgraph Core ["Mentat Inference Engine (Rust)"]
direction TB
Loader["Weight Loader (mmap)"]
Tokenizer["BPE Tokenizer"]
Parser["Harmony Parser"]
subgraph Brain ["Transformer Model"]
direction LR
Attn["Attention (GQA)"]
MoE["Mixture of Experts"]
KV["KV Cache"]
end
Math["Tensor Ops (MatMul, Add, Mul)"]
end
subgraph Tools ["Agentic Layer (Phase 4)"]
Python["Python (WASM Sandbox)"]
Browser["Headless Browser"]
FS["File Patcher"]
end
subgraph Hardware ["Hardware Acceleration (Phase 6)"]
Metal["Apple Metal"]
CUDA["NVIDIA CUDA"]
end
Prompt --> Tokenizer
Weights --> Loader
Loader --> Math
Tokenizer --> Brain
Brain --> Math
Math --> Attn
Math --> MoE
Attn --> KV
Brain --> Parser
Parser --> Python
Parser --> Browser
Parser --> FS
Math -.-> Hardware
Mentat is designed with a modular, "purity-first" approach, separating the mathematical engine from the agentic capabilities.
The foundation of Mentat is a custom tensor library implemented in pure Rust.
- Tensor Ops: Efficient implementation of
MatMul,Add, andMul. - Memory Management: Leverages
memmap2for zero-copy weight loading, allowing multi-gigabyte models to be loaded with minimal RAM overhead. - Precision Support: Native support for
F32,F16, andBF16(Brain-Float 16), ensuring compatibility with modern models like Llama 3 and GPT-OSS.
A sovereign implementation of the Transformer architecture:
- Transformer Blocks: Modular blocks featuring
RMSNorm(Root Mean Square Normalization) for stability. - Grouped-Query Attention (GQA): Optimized attention mechanism for high-speed context processing.
- Mixture of Experts (MoE): Implementation of Gated Routing logic, enabling massive models to run efficiently by activating only a subset of parameters (Experts) per token.
- KV Cache: Advanced caching of Key-Value pairs to ensure O(1) inference time relative to sequence length.
- BPE Tokenizer: A high-performance Byte Pair Encoding implementation for text-to-ID conversion.
- Harmony Parser: A specialized parser for structured outputs, capable of live-extracting reasoning chains (
<think>) and agentic tool calls (<python>,<browser>) from the model's stream.
- Secure Sandbox: A WASM-based or Docker-isolated environment for executing model-generated Python code.
- Sovereign Browser: A headless navigation tool for real-time web research.
- Atomic File Patcher: Safe filesystem operations for direct codebase modifications.
- OpenAI-Compatible API: A local HTTP server that acts as a drop-in replacement for OpenAI endpoints.
- Static Binaries: Ensuring Mentat can be distributed as a single, dependency-free executable for Mac, Linux, and Windows.
- Apple Metal Support: Native GPU acceleration for Apple Silicon via
metal-rs. - CUDA Integration: High-performance kernels for NVIDIA hardware via
cudarc. - Deep Benchmarking: Built-in performance and memory profiling using
criterionanddhat.
- Data Collection Pipelines: Local, opt-in privacy-first data recording.
- Native LoRA: Implementation of Low-Rank Adaptation to allow users to adapt models to their own data locally without Python.
git clone https://github.com/mentat-ai/mentat
cd mentat
cargo build --releaseMentat provides a suite of tools for inspecting and testing models:
# 🔍 Inspect a model's internal architecture and tensors
cargo run --release -- inspect --model ./models/model.safetensors
# 📖 Test the BPE Tokenizer
cargo run --release -- tokenize "Hello, world!"
# 🧩 Test the Harmony Parser
cargo run --release -- parse "<think>Reasoning...</think> <python>print(1)</python>"
# 🛡️ Run with local, opt-in data collection for future fine-tuning
cargo run --release -- --opt-in-data-collection true tokenize "Hello, world!"Apache 2.0 - See LICENSE for details.