Skip to content

cturan/makarna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Experimental Project (Archived)

This is an experimental project and is no longer maintained.

The implementation and algorithms are heavily inspired by llama.cpp and vLLM.

Makarna Engine

High-performance LLM inference engine in Go, optimized with SIMD (AVX2/AVX512).

Installation

Build with Makefile:

make build

This produces binaries in bin/: makarna, quantize, convert.

Build with CUDA:

make build-cuda

Produces bin/makarna-cuda.

Alternatively, use Go install:

go install ./cmd/...

Commands

convert

Convert HuggingFace models (.safetensors) to .mak format.

convert <hf_dir> <output.mak> [flags]

Flags:

  • --quant <type> Options: q2_k, q3_k, q4_k, q5_k, q6_k, q8_k.
  • --mix Enable smart mix quantization.
  • --workers <n> Number of parallel workers.
  • --max-inflight-mb <n> Memory limit during conversion.

quantize

Quantize an existing .mak file to a K-quant format.

quantize <input.mak> <output.mak> <type> [flags]

Flags:

  • --mix Enable smart mix mode.

run-model

Inference CLI.

run-model -model <file.mak> -prompt "text" [flags]

Common Flags:

  • -steps <n> Max tokens (default 10).
  • -temp <f> Temperature (default 0.7).
  • -top-k <n> Top-K (default 40).
  • -top-p <f> Top-P (default 0.9).
  • -rep-penalty <f> Repetition penalty (default 1.1).
  • -chat Use chat formatting.
  • -threads <n> CPU threads (-1 = 90% of cores).
  • -n-gpu-layers <n> Layers to offload to GPU (-1=auto).
  • -gpu-budget <f> GPU memory fraction (0.0-1.0).
  • -mmap Use mmap for weights.
  • -profile-log <val> Profile output (true, report, or ).
  • -listen <addr> Start OpenAI-compatible server on .

openai

Dedicated OpenAI-compatible API server.

openai -model <file.mak> [flags]

Flags:

  • -listen <addr> Default is :8080.
  • -max-seq-len <n> Max context length.
  • -n-gpu-layers <n> Number of GPU layers.

Quantization Types

MAK v2 supports K-quants (block size 256):

  • q8_k: 8-bit.
  • q6_k: 6-bit.
  • q5_k: 5-bit.
  • q4_k: 4-bit (recommended).
  • q3_k: 3-bit.
  • q2_k: 2-bit.

Examples

Convert and quantize:

convert /models/Qwen3-1.7B-Instruct model-q4k.mak --quant q4_k --mix

Run inference:

run-model -model model-q4k.mak -prompt "Explaining quantum physics" -steps 100

Start API server:

run-model -model model-q4k.mak -listen :8080 -chat

Development

Tests:

go test ./...
go test -tags cuda ./... # Requires GPU

Benchmarks:

go test -bench=. ./pkg/tensor/...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors