Add OpenVINO backend #15307

wine99 · 2025-08-14T09:09:11Z

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

llama.cpp with OpenVINO backend: Build Instructions

Key Features:

New backend implementation
- Added OpenVINO backend in ggml/src/ggml-openvino.
- Implemented translations for core GGML operations
Supported precisions
- FP16/BF16 GGUF models supported.
- Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
Supported devices
- Intel CPUs
- Intel integrated and discrete GPUs
- Intel NPUs (requires UD32+ driver).

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

Performance and memory optimizations
Broader quantization coverage.
Support for additional model architectures.
Extensive accuracy testing.

Notes on quantization support

Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

CPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

Main quantization scheme for the supported models in this PR is Q4_0.
Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
Q6_K tensors are dequantized to fp16.

SearchSavior · 2025-08-19T01:33:55Z

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

What parts of OpenVINO feature set are intended to be brought into llama.cpp?
Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?
Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

ravi9 · 2025-08-21T06:27:50Z

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

SearchSavior · 2025-08-21T11:34:19Z

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

Bionic-Squash · 2025-08-24T15:31:13Z

I can't wait for openVINO support to get upstreamed

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

… operator

…ontend-utils, GraphIterator, Decoder

…on openvino device

…view op.

…end of llama.cpp

wine99 requested review from ngxson and ggerganov as code owners August 14, 2025 09:09

wine99 marked this pull request as draft August 14, 2025 09:09

github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025

wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from e180b86 to 80f0969 Compare September 5, 2025 08:36

wine99 force-pushed the dev_backend_openvino branch from 6ccaecf to 76ab76e Compare September 26, 2025 14:09

YangleiZouIntel and others added 16 commits September 28, 2025 11:19

Add ggml-openvino base files

2457a5f

add openvino as optional backend for Llama.cpp ggml

facf5c2

* Configure the device(default CPU) that uses OpenVINO to compile th…

1791997

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

Solve the issue of abnormal model output caused by using OpenVINO ADD…

28a4de5

… operator

Add OpenVINO MUL operator to GGML of Llama.cpp.

965c78b

Add compile options

6b76f7d

add OpenVINO frontend convert process steps

8c009c8

add get openvino available ops function

4918794

Add PoC of integration of openvino frontend. Main changes: ggml-ov-fr…

407dc66

…ontend-utils, GraphIterator, Decoder

Implement GgmlOvDecoder. Add dump functions.

7e82644

Convert subgraph with add, sub, mul, div op to ov model and do infer …

8ce0f75

…on openvino device

Add GGML_OV_FRONTEND option. Add readme.

a5ae635

Change output for infer request to set output tensor. Support scale, …

466eeef

…view op.

add GET_ROWS operator of OpenVINO to GGML of llama.cpp

a478d1e

Update build.md and add operation mapping(GGML to OpenVINO)

1866163

add the rms_norm operator implemented using OpenVINO to the GGML back…

a876f36

…end of llama.cpp

wine99 added 28 commits September 28, 2025 11:21

Fix CI; Disable test-backend-ops

ab72c67

Fix Q4_1

eb824ba

Fix test-thread-safety

1238276

Fix test-backend-ops: Treat quantized tensors as weights

7548513

Add NPU Q4_0 support

30a6546

NPU perf: eliminate zp

a46ad78

NPU perf: Faster compilation

49e500d

Dequantize q4_1 q4_k q6_k for NPU

50aea5f

Add custom quant type: q8_1_c, q4_0_128

7edfd66

Set m_is_static=false as default in decoder

427b2cd

Simpilfy translation of get_rows

8afb507

Fix after rebasing

10d2393

Improve debug util; Eliminate nop ReshapeReshape

aa99094

STYLE: make get_types_to_requant a function

2d00504

Support BF16 model

cf75a74

Fix NPU compile

aa41b19

WA for npu 1st token acc issue

711ee86

Apply EliminateZP only for npu

2fd4f60

Add GeGLU

3d63638

Fix Hunyuan

67ab5bb

Support iSWA

3411652

Fix NPU accuracy

3bd0806

Fix ROPE accuracy when freq_scale != 1

2deb674

Minor: not add attention_size_swa for non-swa model

6dc25a0

Minor refactor

327e156

Add Q5_K to support phi-3-q4_k_m

b5bfc0a

Requantize Q6_K (gs16) to gs32 on GPU

82c3c54

Fix after rebasing

2e1dd8d

wine99 force-pushed the dev_backend_openvino branch from 76ab76e to 2e1dd8d Compare September 28, 2025 03:25

Always apply Eliminate_ZP to fix GPU compile issue on some platforms

847da1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenVINO backend #15307

Add OpenVINO backend #15307

wine99 commented Aug 14, 2025 •

edited

Loading

Uh oh!

SearchSavior commented Aug 19, 2025 •

edited

Loading

Uh oh!

ravi9 commented Aug 21, 2025

Uh oh!

SearchSavior commented Aug 21, 2025

Uh oh!

Bionic-Squash commented Aug 24, 2025

Uh oh!

Uh oh!

Add OpenVINO backend #15307

Are you sure you want to change the base?

Add OpenVINO backend #15307

Conversation

wine99 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features:

Tested Models

Work in Progress

Notes on quantization support

CPU

GPU

NPU

Uh oh!

SearchSavior commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi9 commented Aug 21, 2025

Uh oh!

SearchSavior commented Aug 21, 2025

Uh oh!

Bionic-Squash commented Aug 24, 2025

Uh oh!

Uh oh!

wine99 commented Aug 14, 2025 •

edited

Loading

SearchSavior commented Aug 19, 2025 •

edited

Loading