Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 10, 2025

Mirrored from ggml-org/llama.cpp#16309

target #16148
save gg/fa-no-kq-pad-save

Gauging what would it take to remove the KQ mask padding along the batch dimension (ne31). Removing this padding would simplify the graph building logic and will reduce the amount of memory that we allocate and transfer for KQ masks.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 2df87aba compared to baseline 889fd4a4 reveals localized performance changes in non-core functions with minimal impact on LLaMA.cpp's primary inference capabilities.

Key Findings

Highest Performance Changes:

  • Response Time: stbi__setup_jpeg in libmtmd.so increased 28% (15 ns → 19 ns)
  • Throughput: std::unique_ptr::operator= (Neo-BERT context) in libllama.so increased 92% (77 ns → 147 ns)

Core Function Impact Assessment:
Neither affected function impacts primary inference operations. Core LLaMA functions (llama_decode, llama_encode, llama_tokenize) show no performance changes, indicating no impact on tokens per second throughput for model inference.

Power Consumption Analysis:

  • libmtmd.so: +0.157% increase (210,476 nJ → 210,807 nJ)
  • libllama.so: -0.021% decrease (280,860 nJ → 280,800 nJ)
  • libggml-base.so: +0.013% increase
  • Overall system power impact remains negligible across all binaries

Technical Analysis Insights:

Flame Graph Analysis: stbi__setup_jpeg shows single-frame execution with no call stack depth, confirming the 4 ns regression stems from internal instruction changes rather than algorithmic modifications.

CFG Comparison: Identical control flow structure between versions with only memory address offset changes (-28 bytes across three calculations). The performance degradation likely results from altered memory layout affecting cache locality or alignment, not instruction count changes.

Root Cause:
The systematic address offset reduction suggests intentional data structure reorganization or build system modifications affecting memory positioning. Both functions show structural preservation with implementation-level optimizations that introduce minor performance trade-offs.

Impact Scope:
Changes affect auxiliary components (JPEG processing, memory management utilities) rather than core inference pathways. The LLaMA.cpp inference engine's primary performance characteristics remain unchanged, with no measurable impact on model execution speed or token processing throughput.

@DajanaV DajanaV force-pushed the main branch 26 times, most recently from 6f7320f to 24733fb Compare November 13, 2025 11:55
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 4f731df to 8e6f6e8 Compare December 12, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants