This codebase supports inference for Llama 2/3 architecture models using both GGUF and Safetensors formats, and additionally contains Python bindings.
The GGUF format support includes loading models with various tensor types such as BF16, FP16, FP32, and the quantised types: Q4_K_M, Q6_K, Q8_0, and Q8_K.
Note for Older Llama Models (Llama 2/TinyLlama): Some older GGUF files may not contain explicit BPE merge rules. The system automatically handles this by generating merge rules from vocabulary and token scores (similar to llama.cpp's approach), ensuring proper tokenization without requiring external files.
- Pure C++ inference core (CPU-based).
- Optional CUDA backend for GPU acceleration.
- Support for both safetensors and GGUF formats (various quantizations like Q4_K_M, Q6_K, Q8_0, Q8_K for GGUF).
- Python bindings
- Built-in web server (
cpp-httplib) for easy interaction via a web UI. - Minimal external dependencies managed via CMake.
- Cross-platform (tested on Linux, Windows - requires C++17 compiler).
This section outlines the necessary components to build and run the project, and how to obtain model files.
Core requirements to build and run the C++ application:
- CMake (>= 3.11): For building the project.
- C++17 Compliant Compiler: Such as g++, Clang, or MSVC.
- Boost Libraries (Specifically Regex & Xpressive): Needed for tokenizer functionalities, especially for Llama 3 tokenizers.
- Linux (Debian/Ubuntu):
sudo apt update && sudo apt install build-essential cmake libboost-all-dev libomp-devlibboost-all-devis simplest. For minimal, ensurelibboost-regex-devand Boost Xpressive headers are installed.
- Linux (Fedora/RHEL):
sudo dnf install gcc-c++ cmake boost-devel libgomp(orlibomp-devel). - macOS (Homebrew):
brew install cmake boost llvm(llvm for OpenMP; may need extra flags if clang isn't finding OpenMP). - Windows (vcpkg):
vcpkg install boost-regex nlohmann-json cpp-httplib(OpenMP usually included with MSVC).- Ensure Xpressive headers are pulled, often via
boost-headersor a fullboostpackage ifboost-regexalone is insufficient.
- Ensure Xpressive headers are pulled, often via
- Windows (Chocolatey):
choco install cmake visualstudio2022buildtools boost-msvc-14.3(or similar for your VS and Boost versions).
- Linux (Debian/Ubuntu):
- nlohmann/json & cpp-httplib: These are fetched automatically by CMake if not found system-wide (e.g., if not installed via vcpkg or a system package manager). Usually, no separate manual installation is needed.
- OpenMP (Optional but Recommended): For multi-threading CPU acceleration. Often included with the compiler. If missing, install (e.g.,
libomp-devon Debian,libgompon Fedora, fromllvmon macOS, or part of MSVC). Performance will be lower without it. - CUDA Toolkit (Optional - For GPU Acceleration):
- Required only if you want GPU-accelerated inference. You'll need a compatible NVIDIA GPU and drivers.
- Installation: Download from the NVIDIA CUDA Toolkit Archive and follow NVIDIA's official installation guide. Ensure
nvccis in your PATH. - Alternative (Linux - Simpler, May Be Older): On Debian/Ubuntu,
sudo apt install nvidia-cuda-toolkit libcublas-devcan be used after NVIDIA drivers are set up. - CMake (
-DHAS_CUDA=ON) will detect it.nvcc(compiler) andcublas(library) are key.
To run the model, you need both the model weights and tokenizer information. These should be placed in an accessible directory (e.g., data/ or models/ within your project structure).
-
SafeTensors: This format typically involves three files:
config.json: Contains the model architecture, hyperparameters, and other metadata.tokenizer.json: Defines the vocabulary, merge rules, and other tokenizer configurations. Required for SafeTensors format.model.safetensors: The file containing the model weights.- Data Types: The loader supports
F32,BF16, andF16weight types from SafeTensors.BF16andF16are automatically converted toF32upon loading. Internal computation then proceeds inF32.
-
GGUF (GPT-Generated Unified Format): This format packages the model into a single self-contained file (
.gguf).- Tokenizer Requirements:
- All GGUF models: Tokenizer is embedded in the GGUF file. No external tokenizer files are required.
- Llama 2/TinyLlama models: Older GGUF files may lack explicit BPE merge rules, but the system automatically generates them from embedded vocabulary and scores.
- Llama 3+ models: Full tokenizer with merge rules is typically embedded.
- You can set
tokenizer_pathto the same path as the model file, or omit it entirely in Python bindings.
- Quantizations: Supports various tensor types including
FP32,FP16,BF16, and common quantized types likeQ4_K_M,Q6_K,Q8_0,Q8_K, etc., as supported by the underlying GGUF parsing library.
- Tokenizer Requirements:
It's recommended to download models from reputable sources like Hugging Face. Here are some examples that have been tested:
- TinyLlama 1.1B Chat v1.0:
- SafeTensors: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (download
config.json,tokenizer.json,model.safetensors) - GGUF (e.g., Q8_0): TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF (download the
.gguffile)
- SafeTensors: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (download
- Llama-2 7B:
- SafeTensors (HF format): meta-llama/Llama-2-7b-hf
- GGUF (e.g., Q8_0): TheBloke/Llama-2-7B-GGUF
- Meta-Llama-3 8B:
- GGUF (e.g., Q4_K_M): QuantFactory/Meta-Llama-3-8B-GGUF
- For Llama 3 SafeTensors, you would typically download from the official Meta Llama repository or a trusted conversion, ensuring you get the Llama 3 specific
tokenizer.jsonandconfig.json.
Use CMake to build the project. Navigate to the project root directory in your terminal.
# Create a build directory (if it doesn't exist)
mkdir -p build
cd build
# Configure with CMake and Build
# Option 1: Auto-detect CUDA (Recommended if CUDA is installed and intended for use)
# CMake will try to find CUDA. If found, HAS_CUDA will be ON.
cmake ..
# Option 2: Explicitly Enable CUDA (if auto-detection fails or to be certain)
# cmake .. -DHAS_CUDA=ON
# Option 3: Explicitly Disable CUDA (if CUDA is installed but you want a CPU-only build)
# cmake .. -DHAS_CUDA=OFF
# (If CUDA is not installed, HAS_CUDA will automatically be OFF)
# Build the executables
# For Makefiles (Linux/macOS default):
make -j$(nproc) # $(nproc) gets the number of processing units
# For MSVC on Windows (from a Developer Command Prompt, typically):
# cmake --build . --config Release # Or Debug
# Alternatively, open the .sln file generated in the build/ directory with Visual Studio.This process will create executables in the build/ directory (or a subdirectory like build/Release/ on Windows with MSVC). Key executables include:
tinyllama: The main command-line interface for direct interaction (chat/prompt modes).tinyllama_server: A web server for interacting with SafeTensors models via a UI (this may be merged or evolve depending on project direction).
Running the tinyllama Executable:
# Example path, adjust if your executable is in a subdirectory like build/Release
./build/tinyllama <model_path> <tokenizer_path> <num_threads> <prompt|chat> [options...]Key Command-Line Arguments for tinyllama:
<model_path>: Path to model file (.gguf) or directory (SafeTensors).<tokenizer_path>: Path to tokenizer file.- Required for: SafeTensors format (must point to
tokenizer.json) - For GGUF models: Use the same path as model_path (tokenizer is embedded)
- Required for: SafeTensors format (must point to
<num_threads>: Number of CPU threads for generation.<prompt|chat>:promptfor single generation,chatfor interactive mode.--system-prompt "<text>"(Optional): System-level instruction.initial_user_prompt(Optional): First user message or main prompt text.--max-tokens <N>(Optional): Max new tokens to generate (Default: 256).--n-gpu-layers <N>(Optional): Layers to offload to GPU (-1 for all, 0 for none. Default: -1).--use-mmap <true|false>(Optional): Memory-map GGUF files (Default: true).--temperature <F>(Optional): Sampling temperature (e.g., 0.1. Default: 0.1).--top-k <N>(Optional): Top-K sampling parameter (0 to disable). Default: 40.--top-p <F>(Optional): Top-P/nucleus sampling parameter (0.0-1.0). Default: 0.9.--use-kv-quant <true|false>(Optional): Use INT8 KVCache on GPU (Default: false).--use-batch-generation <true|false>(Optional): Enable single-token batch generation (Default: false).--max-batch-size <N>(Optional): Maximum number of sequences for multi-prompt batch processing (Default: 1).
Note on Sampling Parameters: The tinyllama executable supports --temperature, --top-k, and --top-p via command line for full control over text generation sampling.
Example Invocation:
# For GGUF models (tokenizer embedded in file)
./build/tinyllama ./models/model.Q4_K_M.gguf ./models/model.Q4_K_M.gguf 4 chat --system-prompt "You are a helpful AI." --n-gpu-layers -1 --use-kv-quant true --temperature 0.7 --top-k 50 --top-p 0.95 "Tell me a joke."
# For SafeTensors format (separate tokenizer required)
./build/tinyllama ./models/safetensors_directory ./models/tokenizer.json 4 chat --system-prompt "You are a helpful AI." --n-gpu-layers -1 --use-kv-quant true --temperature 0.7 --top-k 50 --top-p 0.95 "Tell me a joke."For detailed operational logs, inspect debugging.log in the application's runtime directory.
Development Installation (CPU-only):
git clone https://github.com/JohnnyTeutonic/tinyllama.cpp.git
cd tinyllama.cpp
# Install from the project directory
pip install .Development Installation with CUDA Support:
git clone https://github.com/JohnnyTeutonic/tinyllama.cpp.git
cd tinyllama.cpp
# Set environment variable to enable CUDA build
export TINYLLAMA_CPP_BUILD_CUDA=1 # Linux/macOS
# or
set TINYLLAMA_CPP_BUILD_CUDA=1 # Windows CMD
# or
$env:TINYLLAMA_CPP_BUILD_CUDA=1 # Windows PowerShell
# Install from the project directory
pip install .Development Installation with PyTorch Dependencies:
git clone https://github.com/JohnnyTeutonic/tinyllama.cpp.git
cd tinyllama.cpp
# Install from the project directory with PyTorch extras
pip install .[torch]Editable Development Installation:
git clone https://github.com/JohnnyTeutonic/tinyllama.cpp.git
cd tinyllama.cpp
# Editable install from the project directory (CPU-only)
pip install -e .
# For CUDA support with editable install:
export TINYLLAMA_CPP_BUILD_CUDA=1 # Linux/macOS
set TINYLLAMA_CPP_BUILD_CUDA=1 # Windows
pip install -e .Prerequisites for CUDA Build:
- NVIDIA CUDA Toolkit (11.0 or later) installed and in PATH
- Compatible NVIDIA GPU drivers
- CMake 3.18 or later
- C++17 compatible compiler
Usage:
import tinyllama_cpp
# For SafeTensors format (tokenizer_path required)
session = tinyllama_cpp.TinyLlamaSession(
model_path="path/to/safetensors/directory",
tokenizer_path="path/to/tokenizer.json",
threads=4,
n_gpu_layers=-1 # Use GPU if available
)
# For GGUF models (tokenizer embedded, use same path for both)
session = tinyllama_cpp.TinyLlamaSession(
model_path="path/to/model.gguf",
tokenizer_path="path/to/model.gguf", # Same as model_path
threads=4,
n_gpu_layers=-1
)
response = session.generate("What is AI?", steps=64)
print(response)For ease of use, comprehensive scripts are provided in the project root to automate common development and project tasks. These scripts simplify building, cleaning, running the applications, formatting code, generating documentation, and packaging releases.
First, make the script executable:
chmod +x manage.shKey Command Options (refer to ./manage.sh help for all options):
./manage.sh build [--build-type <Release|Debug>] [--cuda <ON|OFF>]./manage.sh run-server [--model-dir <path>] [--tokenizer <path>] [--threads <num>] [--host <hostname>] [--port <num>] [--n-gpu-layers <num>] [--mmap <true|false>] [--no-log]./manage.sh run-chat [--model-dir <path>] [--tokenizer <path>] [--threads <num>] [--system-prompt <text>] [--prompt <text>] [--steps <num>] [--n-gpu-layers <num>] [--mmap <true|false>]- (Note:
run-chatspecific sampling parameters like temperature, top-k, top-p are set to defaults in the C++main.)
- (Note:
./manage.sh run-prompt [--model-dir <path>] [--tokenizer <path>] [--prompt <text>] [--steps <num>] [--threads <num>] [--n-gpu-layers <num>] [--mmap <true|false>][--temperature <num>]- This command runs the model with a single provided prompt and then exits.
- If
--model-diris not provided, you can specify the model directory/GGUF file path as a single positional argument afterrun-prompt. - Example:
./manage.sh run-prompt path/to/your/model --prompt "Translate to French: Hello"
./manage.sh install [--gpu|--cpu]- Installs the Python package in editable mode with optional GPU support.
--cpu(default): CPU-only installation--gpu: Installation with CUDA support (requires CUDA toolkit)- Example:
./manage.sh install --gpufor GPU support or./manage.sh install --cpufor CPU-only
It is recommended to use this script for most routine operations. For detailed options for each command, please run ./manage.sh help.
This script provides equivalent functionality to manage.sh for Windows users.
Running the script (example):
.\\manage.ps1 build -BuildType Debug -Cuda OFF
.\\manage.ps1 run-chat -ModelDir .\\models\\my_model.gguf -TokenizerPath .\\models\\tokenizer.json -Threads 2 -SystemPrompt "You are a helpful assistant."Key Command Options (refer to .\\manage.ps1 help for all options):
.\\manage.ps1 build [-BuildType <Release|Debug>] [-Cuda <ON|OFF>].\\manage.ps1 run-server [-ModelDir <path>] [-TokenizerPath <path>] [-Threads <num>] [-Host <hostname>] [-Port <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>] [-NoLog].\\manage.ps1 run-chat [-ModelDir <path>] [-TokenizerPath <path>] [-Threads <num>] [-SystemPrompt <text>] [-Prompt <text>] [-Steps <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>]- (Note:
run-chatspecific sampling parameters like temperature, top-k, top-p are set to defaults in the C++main.)
- (Note:
.\\manage.ps1 run-prompt [-ModelDir <path>] [-TokenizerPath <path>] [-Prompt <text>] [-Steps <num>] [-Threads <num>] [-NGpuLayers <num>] [-Mmap <$true|$false>][-Temperature <num>]- This command runs the model with a single provided prompt and then exits.
- If
-ModelDiris not provided, you can specify the model directory/GGUF file path as a single positional argument afterrun-prompt. - Example:
.\\manage.ps1 run-prompt -ModelDir path\\to\\your\\model -Prompt "What is the capital of France?"
.\\manage.ps1 install [-Gpu|-Cpu]- Installs the Python package in editable mode with optional GPU support.
-Cpu(default): CPU-only installation-Gpu: Installation with CUDA support (requires CUDA toolkit)- Example:
.\\manage.ps1 install -Gpufor GPU support or.\\manage.ps1 install -Cpufor CPU-only
For detailed options for each command, run .\\manage.ps1 help.
The main way to use this project is via the web server:
# Navigate back to the project root or ensure paths are correct
# Run the server, pointing it to your model data directory
./build/tinyllama_server ./data
# Example on Windows Release build:
# ./build/Release/tinyllama_server.exe ./data- Replace
./datawith the actual path to the directory containing yourconfig.json,tokenizer.json, andmodel.safetensors. - The server will start, load the model, and listen on
http://localhost:8080by default. - Open your web browser and navigate to
http://localhost:8080. - You should see a basic chat interface where you can interact with the model.
For users interested in a Python-based reference or for direct PyTorch inference with SafeTensors models (compatible with Llama 2 / TinyLlama architecture), a dedicated implementation is available in the pytorch/ directory.
This directory contains:
run_inference.py: The main script to execute inference using PyTorch.tinyllama.py: Contains the PyTorch model definition (e.g., for TinyLlama).utils.py: Utility helper functions.requirements.txt: Lists the necessary Python packages to run the PyTorch inference scripts. Install these usingpip install -r pytorch/requirements.txt.README.md: A dedicated README within thepytorch/directory provides more specific instructions on how to set up and run the PyTorch-based inference.
This can be useful for:
- Verifying model outputs against a pure PyTorch implementation.
- Experimenting with the model in a Python environment before using the C++ application.
- Users who prefer or require a PyTorch-based workflow for SafeTensors models.
Please refer to the pytorch/README.md for detailed usage instructions for this PyTorch implementation.
CMakeLists.txt: Main build configuration defining dependencies, targets, and compilation options.pyproject.toml: Modern Python packaging configuration with optional dependencies for GPU and PyTorch support.manage.sh: Comprehensive management script for Linux/macOS (build, clean, run, format, docs, etc.).manage.ps1: Windows PowerShell equivalent of the management script..clang-format: Code formatting configuration for consistent C++ style.Doxyfile: Doxygen configuration for generating API documentation.README.md: This comprehensive documentation file.
main.cpp: Command-line interface entry point fortinyllamaexecutable.server.cpp: HTTP server implementation for web UI interaction (tinyllama_serverexecutable).api.cpp/api.h: High-levelTinyLlamaSessionAPI for model loading and text generation.bindings.cpp: Python bindings usingpybind11with comprehensive documentation forhelp()support.model.cpp/model.h: Core Transformer architecture (attention, feed-forward, RoPE, etc.) with SIMD optimizations.model_constants.h: Architecture constants and configuration parameters.model_macros.h: Utility macros for cross-platform compatibility and safe operations.gguf_structs.h: Data structures and type definitions for GGUF format parsing.ggml_types.h: Type definitions compatible with GGML format specifications.
tokenizer.cpp/tokenizer.h: BPE tokenization, chat template application, and multi-format tokenizer support.safetensors_loader.cpp/safetensors_loader.h: SafeTensors format parsing and tensor loading.gguf_parser.cpp/gguf_parser.h: GGUF format parsing with support for various quantizations.quantization.cpp/quantization.h: Quantization utilities and dequantization routines.utils.cpp/utils.h: General utility functions for string processing, file operations, and helper routines.
cuda_kernels.cu/cuda_kernels.h: CUDA kernels for GPU-accelerated inference.logger.cpp/logger.h: Logging utilities with GPU memory monitoring.
tinyllama_cpp/: Python package directory__init__.py: Package initialization with dynamic versioning and error handling._version.py: Auto-generated version file (created during build).
pytorch/: Pure PyTorch implementation for comparison and experimentationrun_inference.py: Main PyTorch inference script.tinyllama.py: PyTorch model definition.utils.py: Utility functions for PyTorch implementation.requirements.txt: PyTorch-specific dependencies.README.md: PyTorch implementation documentation.
examples/: Example scripts and usage demonstrationswww/: Web interface assets for the HTTP serverdocs/: Generated documentation and additional guides
build/: CMake build directory (created during compilation)- Contains compiled executables:
tinyllama,tinyllama_server
- Contains compiled executables:
_skbuild/: Python build artifacts (created duringpip install)debugging.log: Runtime debugging output (created during execution)
- Model Loading: Supports both GGUF (single file) and SafeTensors (multi-file) formats
- Tokenization: Handles Llama/Llama2 SentencePiece and Llama3 TikToken tokenizers
- Inference: CPU with OpenMP + optional SIMD, GPU with CUDA acceleration
- Python Bindings: Full-featured with comprehensive help documentation
- Batch Processing: Efficient parallel processing of multiple prompts
- Memory Management: KV cache with optional INT8 quantization, memory mapping support