UPSTREAM PR #16309: ggml : remove KQ mask padding #156

DajanaV · 2025-11-10T12:46:30Z

target #16148
save gg/fa-no-kq-pad-save

Gauging what would it take to remove the KQ mask padding along the batch dimension (ne31). Removing this padding would simplify the graph building logic and will reduce the amount of memory that we allocate and transfer for KQ masks.

Metal (after 46c338f)
CUDA ?
Vulkan (after vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) ggml-org/llama.cpp#16316)

loci-agentic-ai · 2025-11-10T13:23:26Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 2df87aba compared to baseline 889fd4a4 reveals localized performance changes in non-core functions with minimal impact on LLaMA.cpp's primary inference capabilities.

Key Findings

Highest Performance Changes:

Response Time: stbi__setup_jpeg in libmtmd.so increased 28% (15 ns → 19 ns)
Throughput: std::unique_ptr::operator= (Neo-BERT context) in libllama.so increased 92% (77 ns → 147 ns)

Core Function Impact Assessment:
Neither affected function impacts primary inference operations. Core LLaMA functions (llama_decode, llama_encode, llama_tokenize) show no performance changes, indicating no impact on tokens per second throughput for model inference.

Power Consumption Analysis:

libmtmd.so: +0.157% increase (210,476 nJ → 210,807 nJ)
libllama.so: -0.021% decrease (280,860 nJ → 280,800 nJ)
libggml-base.so: +0.013% increase
Overall system power impact remains negligible across all binaries

Technical Analysis Insights:

Flame Graph Analysis: stbi__setup_jpeg shows single-frame execution with no call stack depth, confirming the 4 ns regression stems from internal instruction changes rather than algorithmic modifications.

CFG Comparison: Identical control flow structure between versions with only memory address offset changes (-28 bytes across three calculations). The performance degradation likely results from altered memory layout affecting cache locality or alignment, not instruction count changes.

Root Cause:
The systematic address offset reduction suggests intentional data structure reorganization or build system modifications affecting memory positioning. Both functions show structural preservation with implementation-level optimizations that introduce minor performance trade-offs.

Impact Scope:
Changes affect auxiliary components (JPEG processing, memory management utilities) rather than core inference pathways. The LLaMA.cpp inference engine's primary performance characteristics remain unchanged, with no measurable impact on model execution speed or token processing throughput.

ggml : remove KQ mask padding

3ad5336

DajanaV temporarily deployed to PROD__AL_DEMO November 10, 2025 12:46 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 9248736 to 4f73918 Compare November 10, 2025 13:17

DajanaV force-pushed the main branch 26 times, most recently from 6f7320f to 24733fb Compare November 13, 2025 11:55

loci-dev force-pushed the main branch 30 times, most recently from 4f731df to 8e6f6e8 Compare December 12, 2025 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16309: ggml : remove KQ mask padding #156

UPSTREAM PR #16309: ggml : remove KQ mask padding #156

Uh oh!

DajanaV commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16309: ggml : remove KQ mask padding #156

Are you sure you want to change the base?

UPSTREAM PR #16309: ggml : remove KQ mask padding #156

Uh oh!

Conversation

DajanaV commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants