Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17910

cont #16309

Simplify code now that we no longer need to pad the KQ mask for flash attention.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #513

Project: llama.cpp
PR #513: Remove GGML_KQ_MASK_PAD constant
Scope: 7 files modified (+19/-36 lines)


Summary

This PR removes the GGML_KQ_MASK_PAD constant and associated padding logic from attention mask operations. The changes affect mask tensor allocation in graph construction, KV cache indexing, and CLIP multimodal processing. Performance analysis reveals localized variations in STL container operations (allocator helpers, iterator arithmetic) with no direct impact on core inference functions. The observed function-level changes are compiler optimization artifacts rather than source code modifications in performance-critical paths.


Key Findings

Impact on Inference Performance

Core Inference Functions:
No changes detected in primary inference functions:

  • llama_decode - No modifications
  • llama_encode - No modifications
  • llama_tokenize - No modifications
  • llama_graph_compute - No modifications
  • ggml_mul_mat - No modifications
  • ggml_flash_attn_ext - Validation logic simplified (assertion removed)

Tokens Per Second Impact: None. The modified functions are not in the inference execution path. Based on the reference model (smollm:135m on Intel i7-1255U), where 2 ms degradation in llama_decode causes 7% tokens/second reduction, this PR introduces zero measurable impact as no inference functions show response time changes.

Most-Impacted Functions

The top performance variations occur in STL utility functions, not application code:

Regressions:

  • _S_key (libllama.so): +81 ns response time, +135 ns throughput
  • _S_max_size (libmtmd.so, clip_image_size): +61 ns response time, +99 ns throughput
  • end (vector<llama_grammar_candidate>): +32 ns response time, +24 ns throughput

Improvements:

  • end (vector<sub_match>): -113 ns response time, -135 ns throughput
  • operator+ (Bit_const_iterator): -87 ns response time, -86 ns throughput

Analysis: These functions are STL template instantiations (allocators, iterators) showing compiler optimization differences. The absolute changes (20-135 ns) are negligible relative to inference operations (milliseconds). The flame graph and CFG analysis confirm these are inlining decisions and instruction selection changes, not source code modifications.

Power Consumption Analysis

Binary-Level Changes:

  • libllama.so: -66 nJ (-0.034%) - marginal improvement
  • libmtmd.so: +88 nJ (+0.067%) - marginal increase
  • libggml-base.so: +57 nJ (+0.096%) - marginal increase
  • Other binaries: No change (0.0%)

Assessment: Power consumption remains effectively neutral. The sub-0.1% variations are within measurement noise and indicate no meaningful energy efficiency impact.

Code Changes Analysis

Mask Allocation Simplification:
The PR modifies mask tensor dimensions from [n_kv, GGML_PAD(n_tokens, 1), 1, 1] to [n_kv, n_tokens, 1, 1]. With GGML_KQ_MASK_PAD=1, this removes at most 1 element per row, saving ~8-32 KB per mask tensor.

Index Calculation Simplification:
In llama_kv_cache::set_input_kq_mask(), the index calculation changed from n_kv*(h*n_stream*n_tps_pad + s*n_tps_pad + ii) to n_kv*(h*n_stream*n_tps + s*n_tps + ii), eliminating one variable and simplifying arithmetic.

Dead Code Removal:
The loop for (int i = n_tokens; i < n_tokens; ++i) in cross-attention mask initialization is now a no-op and should be removed.

Correctness Dependency:
The changes assume GPU flash attention kernels no longer require padded masks. The original padding was added for GPU kernel safety to prevent out-of-bounds accesses. This PR removes that safety margin based on updated kernel implementations.


Conclusion

PR #513 successfully simplifies attention mask handling with no measurable impact on inference performance or energy efficiency. The observed function-level variations are compiler artifacts in STL utilities, not related to the source code changes. The removal of padding logic improves code clarity and marginally reduces memory usage without affecting tokens per second throughput.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants