Tested configs, launchers, and benchmark helpers for running Qwen3.5 GGUF models on a single 16GB NVIDIA GPU with llama.cpp.
This repo is aimed at people who want a fast local Qwen setup without reverse-engineering a pile of Discord messages, Reddit comments, and half-working launch flags.
It stays strict about evidence, but it should still feel useful at a glance: what works, what was measured, what ships in the repo, and what still needs more data.
| What you get | Current repo stance |
|---|---|
| 35B coding preset that fits a 16GB card | Verified on the checked-in RTX 5080 test machine |
| Vision-capable local setup | Implemented and usable, but image throughput claims are kept narrower than text benchmarks |
| One-command launchers and API helper | Shipped in this repo |
| Benchmark numbers | Backed by checked-in JSON artifacts |
| Cross-GPU claims | Treated as estimates until reproduced elsewhere |
- Why this repo exists
- Recommended presets
- Reference measurements
- Quick start
- Terminal chat and API helper
- Context guidance
- The 35B preset is the main attraction: a practical local coding model that still feels fast on a single 16GB GPU.
- The repo keeps the launchers, config, helper scripts, and benchmark artifacts in one place so people can reproduce the setup instead of copy-pasting random command lines.
- The goal is not to sound maximalist. The goal is to help someone get from zero to a working, measured setup quickly.
- Primary test machine: RTX 5080 16GB, Windows 11,
llama.cppb8196+. - Primary workflow: one server at a time.
- Primary 35B preset:
Qwen3.5-35B-A3B-Q3_K_S.ggufwithmmprojloaded and--parallel 1. - Shipped 35B context in this repo:
262144tokens (256K) for maximum local context. - Recommended day-to-day 35B operating point:
64Kto120Kfor better headroom and speed.
- The 35B
Q3_K_Spreset can keep all layers on GPU on the tested RTX 5080 16GB machine. --parallel 1is required for the 35B preset to avoid a major throughput drop on that setup.- Text generation stays fast with
mmprojloaded. The checked-in 35B headline artifacts are text-prompt benchmarks against a server that had the vision projector enabled. - The repo includes working image-input tooling through the OpenAI-compatible
image_urlrequest format.
- Exact speeds on every 16GB NVIDIA card.
- A universal
155,904-token cliff on all GPUs and operating systems. - Full multimodal throughput parity between text-only and image requests.
- Direct PDF or video pipelines. In practice those need preprocessing to images first.
| Key | Port | Model | Default Context | Estimated VRAM | Use |
|---|---|---|---|---|---|
coding |
8002 | Qwen3.5-35B-A3B-Q3_K_S.gguf |
256K | 15.7 GB | Maximum local context, edge-fit on 16 GB |
fast_vision |
8003 | Qwen3.5-9B-UD-Q4_K_XL.gguf |
256K | 10.6 GB | Fast image input and lighter chat |
quality_vision |
8004 | Qwen3.5-27B-Q3_K_S.gguf |
96K | 14.5 GB | Higher quality output, slower generation |
Best measured presets from the March 7-8 benchmark sweep:
- 35B strongest overall:
Q3_K_S + iq4_nl - 27B best general preset:
IQ4_XS + iq4_nl + 32K - 9B best practical preset:
UD-Q4_K_XL + q8_0 + 256K
The canonical settings live in config/servers.yaml.
Checked-in artifacts for the 35B preset:
- results/benchmark_35b_128k_vis_final_20260304.json
- Text prompts
mmprojloaded131072context119.7avg gen t/s523.9avg prompt t/s
- results/benchmark_35b_152k_vis_final_20260304.json
- Text prompts
mmprojloaded155904context124.7avg gen t/s538.4avg prompt t/s
Important caveat: those files measure text generation on a server with vision support enabled. They do not prove that image requests have identical throughput.
Additional notes and historical summaries are in:
If a narrative doc and a checked-in JSON file disagree, prefer the raw JSON artifact.
Place a CUDA build in ./llama-bin/, or build a native SM120 binary if you are on RTX 5080 or 5090.
- Build guide: docs/RTX5080-NATIVE-BUILD.md
- Release downloads: ggml-org/llama.cpp releases
The repo keeps separate local projector filenames for 35B, 27B, and 9B so the shared ./models/unsloth-gguf/ folder does not overwrite one model family's mmproj with another. Upstream Unsloth repos currently publish these projector files as mmproj-F16.gguf, so download them and rename them locally as shown below:
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
Qwen3.5-35B-A3B-Q3_K_S.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-35B-F16.gguf
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
Qwen3.5-9B-UD-Q4_K_XL.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-9B-F16.gguf
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
Qwen3.5-27B-Q3_K_S.gguf \
mmproj-F16.gguf \
--local-dir ./models/unsloth-gguf/
mv ./models/unsloth-gguf/mmproj-F16.gguf ./models/unsloth-gguf/mmproj-27B-F16.ggufOn Windows PowerShell, use Rename-Item instead of mv, or use scripts/windows/download_model.ps1 to download and rename the files automatically.
Windows launcher:
start_servers_speed.bat coding
start_servers_speed.bat vision
start_servers_speed.bat qualityCross-platform Python manager:
python server_manager.py start --server coding
python server_manager.py start --server fast_vision
python server_manager.py start --server quality_vision
python server_manager.py stopLinux and macOS shell scripts:
./scripts/start_servers.sh coding
./scripts/stop_servers.shcurl http://127.0.0.1:8002/health
curl http://127.0.0.1:8002/v1/modelspython tests/simple_benchmark.py 8002
python tests/health_check.py
python tests/compare_models.py
python tests/vision_test.py path/to/image.pngNote: vision_test.py sends actual image requests. simple_benchmark.py does not.
Terminal chat:
python chat.py
python chat.py --port 8003Useful in-chat commands:
/img <path> [question]/speed/clear/quit
Python helper:
from qwen_api import api_35b, api_9b_vision, SamplingMode
response = api_35b.chat(
prompt="Write a Python function to reverse a list.",
mode=SamplingMode.THINKING_CODING,
)
vision = api_9b_vision.vision(
prompt="Describe this image.",
image_path="example.png",
)- The shipped
codingpreset is now 256K for maximum local context. - The better day-to-day 35B operating point is still 64K to 120K when speed and headroom matter more than maximum context.
- The
155,904figure is a measured reference point on the tested RTX 5080 machine, not a promise for every other GPU. - The explanation in DISCOVERY.md is an informed hypothesis based on observed buffers and timings, not a proven
llama.cpproot-cause analysis.
config/ Canonical server settings
docs/ Technical notes and analysis
results/ Checked-in benchmark artifacts
tests/ Benchmark and validation scripts
chat.py Terminal client with image support
qwen_api.py Minimal Python API helper
server_manager.py Cross-platform launcher and process manager
start_servers_speed.bat Windows single-server launcher
scripts/windows/ Legacy Windows helpers, demos, and extra benchmark scripts
If you update numbers or claims:
- Keep launchers aligned with config/servers.yaml.
- Commit raw JSON results with any new benchmark summary.
- Separate measured facts from extrapolation.
- Avoid claiming support for workflows that are not implemented in the repo.
This repo now uses a two-worktree local workflow:
- use one dev worktree on
personal/devfor everyday development - use a separate release worktree on
mainfor clean review and pushes - keep the two worktrees as sibling folders on the same machine
- only reviewed commits get cherry-picked into the release worktree
- only the release worktree is allowed to push to GitHub
Windows helper scripts:
scripts/windows/setup-worktrees.ps1
scripts/windows/promote-to-release.ps1 <commit-sha>
scripts/windows/check-release.ps1
scripts/windows/push-release.ps1Recommended flow:
- Work and commit in the dev worktree on
personal/dev - Promote selected commits into the release worktree
- Run the release checks there
- Push only from the release worktree