Skip to content

Conversation

wine99
Copy link

@wine99 wine99 commented Aug 14, 2025

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

  • New backend implementation

    • Added OpenVINO backend in ggml/src/ggml-openvino.
    • Implemented translations for core GGML operations
  • Supported precisions

    • FP16/BF16 GGUF models supported.
    • Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
  • Supported devices

    • Intel CPUs
    • Intel integrated and discrete GPUs
    • Intel NPUs (requires UD32+ driver).

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

  • Performance and memory optimizations
  • Broader quantization coverage.
  • Support for additional model architectures.
  • Extensive accuracy testing.

Notes on quantization support

  • Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
  • Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
  • Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

CPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

  • Main quantization scheme for the supported models in this PR is Q4_0.
  • Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
  • Q6_K tensors are dequantized to fp16.

@wine99 wine99 marked this pull request as draft August 14, 2025 09:09
@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025
@SearchSavior
Copy link

SearchSavior commented Aug 19, 2025

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

  • What parts of OpenVINO feature set are intended to be brought into llama.cpp?

  • Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

  • Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

@ravi9
Copy link

ravi9 commented Aug 21, 2025

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

@SearchSavior
Copy link

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

@Bionic-Squash
Copy link

I can't wait for openVINO support to get upstreamed

@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from e180b86 to 80f0969 Compare September 5, 2025 08:36
@wine99 wine99 force-pushed the dev_backend_openvino branch from 6ccaecf to 76ab76e Compare September 26, 2025 14:09
@wine99 wine99 force-pushed the dev_backend_openvino branch from 76ab76e to 2e1dd8d Compare September 28, 2025 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants