-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Add OpenVINO backend #15307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add OpenVINO backend #15307
Conversation
Hello, in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported. Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here, A few other questions:
Thank you for your work! |
Hi @SearchSavior ,
Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)
The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.
The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.
We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details. |
Hey @ravi9 , Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem. |
I can't wait for openVINO support to get upstreamed |
e180b86
to
80f0969
Compare
6ccaecf
to
76ab76e
Compare
…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
…ontend-utils, GraphIterator, Decoder
…on openvino device
76ab76e
to
2e1dd8d
Compare
Overview
This PR introduces an OpenVINO backend for
llama.cpp
, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.Key Features:
New backend implementation
ggml/src/ggml-openvino
.Supported precisions
Supported devices
Tested Models
The following models are validated for functionality.
Accuracy and performance are WIP.
Llama-3.2-1B-Instruct-GGUF
Llama-3.1-8B-Instruct
microsoft/Phi-3-mini-4k-instruct-gguf
Qwen/Qwen2.5-1.5B-Instruct-GGUF
Qwen/Qwen3-8B
openbmb/MiniCPM-1B-sft-bf16
tencent/Hunyuan-7B-Instruct
mistralai/Mistral-7B-Instruct-v0.3
google/gemma-3-4b-it
(without multimodal)Work in Progress
Notes on quantization support
NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).
CPU
GPU
NPU