UPSTREAM PR #17910: ggml : remove GGML_KQ_MASK_PAD constant #513

loci-dev · 2025-12-10T16:42:48Z

Mirrored from ggml-org/llama.cpp#17910

cont #16309

Simplify code now that we no longer need to pad the KQ mask for flash attention.

loci-agentic-ai · 2025-12-10T18:02:53Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #513

Project: llama.cpp
PR #513: Remove GGML_KQ_MASK_PAD constant
Scope: 7 files modified (+19/-36 lines)

Summary

This PR removes the GGML_KQ_MASK_PAD constant and associated padding logic from attention mask operations. The changes affect mask tensor allocation in graph construction, KV cache indexing, and CLIP multimodal processing. Performance analysis reveals localized variations in STL container operations (allocator helpers, iterator arithmetic) with no direct impact on core inference functions. The observed function-level changes are compiler optimization artifacts rather than source code modifications in performance-critical paths.

Key Findings

Impact on Inference Performance

Core Inference Functions:
No changes detected in primary inference functions:

llama_decode - No modifications
llama_encode - No modifications
llama_tokenize - No modifications
llama_graph_compute - No modifications
ggml_mul_mat - No modifications
ggml_flash_attn_ext - Validation logic simplified (assertion removed)

Tokens Per Second Impact: None. The modified functions are not in the inference execution path. Based on the reference model (smollm:135m on Intel i7-1255U), where 2 ms degradation in llama_decode causes 7% tokens/second reduction, this PR introduces zero measurable impact as no inference functions show response time changes.

Most-Impacted Functions

The top performance variations occur in STL utility functions, not application code:

Regressions:

_S_key (libllama.so): +81 ns response time, +135 ns throughput
_S_max_size (libmtmd.so, clip_image_size): +61 ns response time, +99 ns throughput
end (vector<llama_grammar_candidate>): +32 ns response time, +24 ns throughput

Improvements:

end (vector<sub_match>): -113 ns response time, -135 ns throughput
operator+ (Bit_const_iterator): -87 ns response time, -86 ns throughput

Analysis: These functions are STL template instantiations (allocators, iterators) showing compiler optimization differences. The absolute changes (20-135 ns) are negligible relative to inference operations (milliseconds). The flame graph and CFG analysis confirm these are inlining decisions and instruction selection changes, not source code modifications.

Power Consumption Analysis

Binary-Level Changes:

libllama.so: -66 nJ (-0.034%) - marginal improvement
libmtmd.so: +88 nJ (+0.067%) - marginal increase
libggml-base.so: +57 nJ (+0.096%) - marginal increase
Other binaries: No change (0.0%)

Assessment: Power consumption remains effectively neutral. The sub-0.1% variations are within measurement noise and indicate no meaningful energy efficiency impact.

Code Changes Analysis

Mask Allocation Simplification:
The PR modifies mask tensor dimensions from [n_kv, GGML_PAD(n_tokens, 1), 1, 1] to [n_kv, n_tokens, 1, 1]. With GGML_KQ_MASK_PAD=1, this removes at most 1 element per row, saving ~8-32 KB per mask tensor.

Index Calculation Simplification:
In llama_kv_cache::set_input_kq_mask(), the index calculation changed from n_kv*(h*n_stream*n_tps_pad + s*n_tps_pad + ii) to n_kv*(h*n_stream*n_tps + s*n_tps + ii), eliminating one variable and simplifying arithmetic.

Dead Code Removal:
The loop for (int i = n_tokens; i < n_tokens; ++i) in cross-attention mask initialization is now a no-op and should be removed.

Correctness Dependency:
The changes assume GPU flash attention kernels no longer require padded masks. The original padding was added for GPU kernel safety to prevent out-of-bounds accesses. This PR removes that safety margin based on updated kernel implementations.

Conclusion

PR #513 successfully simplifies attention mask handling with no measurable impact on inference performance or energy efficiency. The observed function-level variations are compiler artifacts in STL utilities, not related to the source code changes. The removal of padding logic improves code clarity and marginally reduces memory usage without affecting tokens per second throughput.

ggerganov added 2 commits December 10, 2025 17:30

ggml : remove GGML_KQ_MASK_PAD constant

2b17c65

cont : remove comment

0bfa070

loci-dev had a problem deploying to PROD__AL_DEMO December 10, 2025 16:42 — with GitHub Actions Failure

loci-dev force-pushed the main branch 8 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17910: ggml : remove GGML_KQ_MASK_PAD constant #513

UPSTREAM PR #17910: ggml : remove GGML_KQ_MASK_PAD constant #513

loci-dev commented Dec 10, 2025

Uh oh!

loci-agentic-ai bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17910: ggml : remove GGML_KQ_MASK_PAD constant #513

Are you sure you want to change the base?

UPSTREAM PR #17910: ggml : remove GGML_KQ_MASK_PAD constant #513

Conversation

loci-dev commented Dec 10, 2025

Uh oh!

loci-agentic-ai bot commented Dec 10, 2025

Performance Analysis Summary: PR #513

Summary

Key Findings

Impact on Inference Performance

Most-Impacted Functions

Power Consumption Analysis

Code Changes Analysis

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants