UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328

loci-dev · 2025-11-25T22:37:19Z

This PR makes the following changes to the CUDA FlashAttention code:

All kernels have been extended with support for attention masks that are not padded in mask->ne[1] direction. This is done by applying a modulo on the mask column that is being read so no conditional statements need to be evaluated. The impact on performance is negligible and I do not deem it necessary to compile additional template specializations. See ggml : remove KQ mask padding ggml-org/llama.cpp#16309 . cc @ggerganov .
The mma kernel has been extended with support for Volta tensor cores. Previously the WMMA kernel was used. The WMMA kernel is now only needed for AMD. After AMD support has been added to the mma kernel the WMMA kernel can be safely removed, leaving only 3 kernels to maintain going forward. On master the mma kernel has defects w.r.t. tile shapes that do not manifest as bugs, those should be fixed with this PR and I think it is now feasible for other developers to add support for e.g. AMD wmma instructions. cc @zhang-hui-yulo @jiachengjason @unverbraucht .
The tile template in mma.cuh has been extended with additional, optional arguments to safely handle situations where tiles of the same shape can have different physical data layouts.
The mma kernel is refactored to allow more flexible configuration. The configuration is now also done without the use of templating which seems to be causing issues for __launch_bounds__ when using ROCm (as of right now ROCm is not used).
The mma kernel is extended with support for out-of-bounds checks in direction of K->ne[1]. As with the tile kernel, because this comes at a cost to performance it is still preferable to pad the KV cache length. As of right now this is still required to be 256, for the currently supported GPUs it should be possible to lower this to 128 without issue once the WMMA kernel has been completely replaced. For Hopper it may still make sense to have a padding of 256 but as it is I have no idea whether the 256x64 instruction would actually have better performance than the 128x64 instruction.

As of right now the interface in mma.cuh is suboptimal and long-term I intend to refactor it to allow the use of tensor cores in a more uniform way. However, I don't know the exact requirements until we have proper support for AMD WMMA and AMD MFMA instructions. So for now I think the correct choice is to prioritize getting working support for those at the cost of maintainability and to do a refactor afterwards.

V100 performance

GPU	Model	Microbatch size	Test	t/s master	t/s 277014f50	Speedup
V100-PCIE-32GB	deepseek2 16B Q4_0	1	pp512@d32768	84.06	89.23	1.06
V100-PCIE-32GB	deepseek2 16B Q4_0	2	pp512@d32768	88.28	86.50	0.98
V100-PCIE-32GB	deepseek2 16B Q4_0	4	pp512@d32768	122.04	134.50	1.10
V100-PCIE-32GB	deepseek2 16B Q4_0	8	pp512@d32768	159.61	204.43	1.28
V100-PCIE-32GB	deepseek2 16B Q4_0	16	pp512@d32768	187.50	274.82	1.47
V100-PCIE-32GB	deepseek2 16B Q4_0	32	pp512@d32768	208.08	340.50	1.64
V100-PCIE-32GB	deepseek2 16B Q4_0	64	pp512@d32768	196.49	312.07	1.59
V100-PCIE-32GB	deepseek2 16B Q4_0	128	pp512@d32768	217.64	371.18	1.71
V100-PCIE-32GB	deepseek2 16B Q4_0	256	pp512@d32768	227.55	408.51	1.80
V100-PCIE-32GB	deepseek2 16B Q4_0	512	pp512@d32768	250.76	432.14	1.72
V100-PCIE-32GB	gemma 2B Q4_0	1	pp512@d32768	196.73	276.43	1.41
V100-PCIE-32GB	gemma 2B Q4_0	2	pp512@d32768	341.32	472.67	1.38
V100-PCIE-32GB	gemma 2B Q4_0	4	pp512@d32768	233.69	461.42	1.97
V100-PCIE-32GB	gemma 2B Q4_0	8	pp512@d32768	433.09	705.18	1.63
V100-PCIE-32GB	gemma 2B Q4_0	16	pp512@d32768	779.04	1095.12	1.41
V100-PCIE-32GB	gemma 2B Q4_0	32	pp512@d32768	981.00	1506.68	1.54
V100-PCIE-32GB	gemma 2B Q4_0	64	pp512@d32768	859.59	1260.66	1.47
V100-PCIE-32GB	gemma 2B Q4_0	128	pp512@d32768	1032.55	1735.64	1.68
V100-PCIE-32GB	gemma 2B Q4_0	256	pp512@d32768	1089.22	1833.70	1.68
V100-PCIE-32GB	gemma 2B Q4_0	512	pp512@d32768	995.95	1613.81	1.62
V100-PCIE-32GB	llama 1B Q4_0	1	pp512@d32768	237.92	323.72	1.36
V100-PCIE-32GB	llama 1B Q4_0	2	pp512@d32768	417.22	588.65	1.41
V100-PCIE-32GB	llama 1B Q4_0	4	pp512@d32768	448.34	838.65	1.87
V100-PCIE-32GB	llama 1B Q4_0	8	pp512@d32768	824.46	1445.37	1.75
V100-PCIE-32GB	llama 1B Q4_0	16	pp512@d32768	1435.92	1917.20	1.34
V100-PCIE-32GB	llama 1B Q4_0	32	pp512@d32768	1769.39	2566.43	1.45
V100-PCIE-32GB	llama 1B Q4_0	64	pp512@d32768	1991.61	2289.92	1.15
V100-PCIE-32GB	llama 1B Q4_0	128	pp512@d32768	2391.19	2843.04	1.19
V100-PCIE-32GB	llama 1B Q4_0	256	pp512@d32768	2312.60	2559.85	1.11
V100-PCIE-32GB	llama 1B Q4_0	512	pp512@d32768	1900.53	2137.76	1.12
V100-PCIE-32GB	llama 8B Q4_0	1	pp512@d32768	61.12	81.47	1.33
V100-PCIE-32GB	llama 8B Q4_0	2	pp512@d32768	115.57	154.44	1.34
V100-PCIE-32GB	llama 8B Q4_0	4	pp512@d32768	120.26	220.87	1.84
V100-PCIE-32GB	llama 8B Q4_0	8	pp512@d32768	215.88	323.48	1.50
V100-PCIE-32GB	llama 8B Q4_0	16	pp512@d32768	380.43	467.35	1.23
V100-PCIE-32GB	llama 8B Q4_0	32	pp512@d32768	470.78	656.82	1.40
V100-PCIE-32GB	llama 8B Q4_0	64	pp512@d32768	228.56	456.01	2.00
V100-PCIE-32GB	llama 8B Q4_0	128	pp512@d32768	278.85	670.43	2.40
V100-PCIE-32GB	llama 8B Q4_0	256	pp512@d32768	307.17	872.91	2.84
V100-PCIE-32GB	llama 8B Q4_0	512	pp512@d32768	314.34	932.41	2.97

Other GPU performance

GPU	Model	Microbatch size	Test	t/s master	t/s e44ebb095	Speedup
MI60 / MI50	llama 8B Q4_0	1	pp512@d32768	59.80	64.40	1.08
MI60 / MI50	llama 8B Q4_0	2	pp512@d32768	106.46	113.46	1.07
MI60 / MI50	llama 8B Q4_0	4	pp512@d32768	119.84	97.07	0.81
MI60 / MI50	llama 8B Q4_0	8	pp512@d32768	162.89	167.55	1.03
MI60 / MI50	llama 8B Q4_0	16	pp512@d32768	228.46	229.93	1.01
MI60 / MI50	llama 8B Q4_0	32	pp512@d32768	269.06	268.69	1.00
MI60 / MI50	llama 8B Q4_0	64	pp512@d32768	291.15	289.38	0.99
MI60 / MI50	llama 8B Q4_0	128	pp512@d32768	335.13	332.27	0.99
MI60 / MI50	llama 8B Q4_0	256	pp512@d32768	351.75	349.71	0.99
MI60 / MI50	llama 8B Q4_0	512	pp512@d32768	357.18	355.12	0.99
MI100	llama 8B Q4_0	1	pp512@d32768	77.78	82.66	1.06
MI100	llama 8B Q4_0	2	pp512@d32768	133.33	139.16	1.04
MI100	llama 8B Q4_0	4	pp512@d32768	164.44	169.21	1.03
MI100	llama 8B Q4_0	8	pp512@d32768	232.70	236.51	1.02
MI100	llama 8B Q4_0	16	pp512@d32768	424.09	431.27	1.02
MI100	llama 8B Q4_0	32	pp512@d32768	559.43	563.32	1.01
MI100	llama 8B Q4_0	64	pp512@d32768	648.34	648.77	1.00
MI100	llama 8B Q4_0	128	pp512@d32768	671.01	668.83	1.00
MI100	llama 8B Q4_0	256	pp512@d32768	696.50	692.00	0.99
MI100	llama 8B Q4_0	512	pp512@d32768	706.38	700.32	0.99
P40	llama 8B Q4_0	1	pp512@d32768	31.00	32.45	1.05
P40	llama 8B Q4_0	2	pp512@d32768	59.14	61.75	1.04
P40	llama 8B Q4_0	4	pp512@d32768	87.36	89.87	1.03
P40	llama 8B Q4_0	8	pp512@d32768	122.68	122.31	1.00
P40	llama 8B Q4_0	16	pp512@d32768	178.33	175.34	0.98
P40	llama 8B Q4_0	32	pp512@d32768	189.92	190.07	1.00
P40	llama 8B Q4_0	64	pp512@d32768	209.02	208.27	1.00
P40	llama 8B Q4_0	128	pp512@d32768	217.96	217.49	1.00
P40	llama 8B Q4_0	256	pp512@d32768	223.15	222.81	1.00
P40	llama 8B Q4_0	512	pp512@d32768	219.45	219.48	1.00
Radeon 8060S Graphics	llama 8B Q4_0	1	pp512@d32768	23.92	24.10	1.01
Radeon 8060S Graphics	llama 8B Q4_0	2	pp512@d32768	43.49	43.68	1.00
Radeon 8060S Graphics	llama 8B Q4_0	4	pp512@d32768	77.88	78.19	1.00
Radeon 8060S Graphics	llama 8B Q4_0	8	pp512@d32768	108.82	96.17	0.88
Radeon 8060S Graphics	llama 8B Q4_0	16	pp512@d32768	138.58	140.27	1.01
Radeon 8060S Graphics	llama 8B Q4_0	32	pp512@d32768	151.39	152.96	1.01
Radeon 8060S Graphics	llama 8B Q4_0	64	pp512@d32768	74.81	76.94	1.03
Radeon 8060S Graphics	llama 8B Q4_0	128	pp512@d32768	101.46	102.30	1.01
Radeon 8060S Graphics	llama 8B Q4_0	256	pp512@d32768	115.59	115.84	1.00
Radeon 8060S Graphics	llama 8B Q4_0	512	pp512@d32768	117.65	118.57	1.01
RTX 3090	llama 8B Q4_0	1	pp512@d32768	87.54	92.96	1.06
RTX 3090	llama 8B Q4_0	2	pp512@d32768	160.48	170.31	1.06
RTX 3090	llama 8B Q4_0	4	pp512@d32768	293.48	303.46	1.03
RTX 3090	llama 8B Q4_0	8	pp512@d32768	429.51	439.54	1.02
RTX 3090	llama 8B Q4_0	16	pp512@d32768	844.62	874.15	1.03
RTX 3090	llama 8B Q4_0	32	pp512@d32768	1184.30	1194.99	1.01
RTX 3090	llama 8B Q4_0	64	pp512@d32768	1491.70	1495.43	1.00
RTX 3090	llama 8B Q4_0	128	pp512@d32768	1612.42	1617.77	1.00
RTX 3090	llama 8B Q4_0	256	pp512@d32768	1716.96	1697.92	0.99
RTX 3090	llama 8B Q4_0	512	pp512@d32768	1470.93	1448.12	0.98
RTX 4090	llama 8B Q4_0	1	pp512@d32768	98.14	102.76	1.05
RTX 4090	llama 8B Q4_0	2	pp512@d32768	178.13	190.39	1.07
RTX 4090	llama 8B Q4_0	4	pp512@d32768	349.90	366.50	1.05
RTX 4090	llama 8B Q4_0	8	pp512@d32768	618.83	646.33	1.04
RTX 4090	llama 8B Q4_0	16	pp512@d32768	1095.54	1140.84	1.04
RTX 4090	llama 8B Q4_0	32	pp512@d32768	2007.89	2051.87	1.02
RTX 4090	llama 8B Q4_0	64	pp512@d32768	3091.16	3089.09	1.00
RTX 4090	llama 8B Q4_0	128	pp512@d32768	3188.55	3095.61	0.97
RTX 4090	llama 8B Q4_0	256	pp512@d32768	2961.18	2892.63	0.98
RTX 4090	llama 8B Q4_0	512	pp512@d32768	2464.56	2431.25	0.99
RTX 5090	llama 8B Q4_0	1	pp512@d32768	155.78	167.41	1.07
RTX 5090	llama 8B Q4_0	2	pp512@d32768	239.31	269.27	1.13
RTX 5090	llama 8B Q4_0	4	pp512@d32768	461.48	486.56	1.05
RTX 5090	llama 8B Q4_0	8	pp512@d32768	780.64	810.10	1.04
RTX 5090	llama 8B Q4_0	16	pp512@d32768	1381.19	1408.61	1.02
RTX 5090	llama 8B Q4_0	32	pp512@d32768	2253.55	2308.20	1.02
RTX 5090	llama 8B Q4_0	64	pp512@d32768	2827.63	2828.64	1.00
RTX 5090	llama 8B Q4_0	128	pp512@d32768	3009.14	3075.67	1.02
RTX 5090	llama 8B Q4_0	256	pp512@d32768	3078.24	2981.31	0.97
RTX 5090	llama 8B Q4_0	512	pp512@d32768	2698.04	2640.36	0.98
RX 6800	llama 8B Q4_0	1	pp512@d32768	42.25	44.60	1.06
RX 6800	llama 8B Q4_0	2	pp512@d32768	77.43	81.42	1.05
RX 6800	llama 8B Q4_0	4	pp512@d32768	105.08	108.86	1.04
RX 6800	llama 8B Q4_0	8	pp512@d32768	140.43	140.94	1.00
RX 6800	llama 8B Q4_0	16	pp512@d32768	173.28	175.32	1.01
RX 6800	llama 8B Q4_0	32	pp512@d32768	209.55	210.72	1.01
RX 6800	llama 8B Q4_0	64	pp512@d32768	235.46	235.80	1.00
RX 6800	llama 8B Q4_0	128	pp512@d32768	262.63	262.85	1.00
RX 6800	llama 8B Q4_0	256	pp512@d32768	274.40	274.65	1.00
RX 6800	llama 8B Q4_0	512	pp512@d32768	275.25	274.63	1.00
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	25.67	29.58	1.15
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	49.98	57.25	1.15
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	85.18	97.39	1.14
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	111.87	104.18	0.93
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	162.98	172.35	1.06
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	190.29	195.63	1.03
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	288.59	291.34	1.01
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	322.67	325.96	1.01
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	348.31	351.01	1.01
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	349.45	350.95	1.00

The performance numbers assume that the KQ mask is no longer being padded. This change is also in this PR. I don't have a good overview of which other backends maybe still need support for this change and whether or not it should be reverted prior to merging.

loci-agentic-ai · 2025-11-26T00:20:16Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #328

PR Title: CUDA: Generalized MMA FlashAttention, Add Volta Support
Scope: CUDA FlashAttention kernel refactoring with Volta tensor core support

Overview

This PR refactors CUDA FlashAttention kernels to add Volta tensor core support and remove mask padding requirements. The changes affect 10 files with 935 additions and 735 deletions, primarily in CUDA kernel implementations. Performance metrics show negligible impact on core inference functions, with changes concentrated in utility functions and template infrastructure.

Key Findings

Performance-Critical Functions Impact

The analysis reveals no meaningful changes to core inference functions. The top 10 functions by Response Time change are utility functions in standard library and helper code:

Functions with increased Response Time:

begin in libmtmd.so: +176 ns (210% change, from 84 ns to 261 ns)
operator= in libllama.so: +71 ns (9% change, from 766 ns to 837 ns)
add_kv in libllama.so: +4 ns (17% change, from 25 ns to 29 ns)
stbi__setup_jpeg in libmtmd.so: +3 ns (19% change, from 18 ns to 21 ns)

Functions with decreased Response Time:

back_inserter in libllama.so: -180 ns (66% reduction, from 272 ns to 92 ns)
_S_max_size in libllama.so: -208 ns (56% reduction, from 371 ns to 162 ns)
rbegin in libllama.so: -116 ns (38% reduction, from 304 ns to 188 ns)

None of these functions are in the tokenization or inference critical path (llama_decode, llama_encode, llama_tokenize, llama_model_load).

Tokens Per Second Impact

No impact on inference throughput. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput changes in this version. The modified CUDA kernels are template-based infrastructure changes that do not alter the execution path for existing GPU architectures. The performance benchmarks in the PR description show improvements ranging from 1.05x to 2.97x on V100 GPUs for various batch sizes, indicating the changes optimize GPU execution without affecting CPU inference paths.

Power Consumption Analysis

Power consumption changes are minimal across all binaries:

build.bin.libmtmd.so: +0.13% (+228 nJ)
build.bin.libllama.so: -0.02% (-52 nJ)
build.bin.libggml-base.so: -0.07% (-52 nJ)
All other binaries: 0.0% change

The power consumption variations are within measurement noise and reflect the minor changes in utility function execution times rather than algorithmic modifications.

Code Changes Analysis

Primary modifications:

Mask padding removal: Changed GGML_KQ_MASK_PAD from 64 to 1, eliminating unnecessary padding overhead. The kernels now use modulo operations (fastmodulo) to handle non-padded masks without conditional branches.
Volta tensor core support: Added DATA_SPLIT_MIRRORED tile types and Volta-specific MMA instructions. The tile structure now supports different data layouts across warp subgroups for Volta architecture (tile<8, 4, half2>).
Configuration refactoring: Replaced template-based fattn_mma_f16_config structs with runtime configuration functions that pack parameters into uint32_t. This addresses ROCm compiler limitations with __launch_bounds__ templating.
Out-of-bounds checking: Added oob_check template parameter to handle cases where ne11 % nbatch_fa != 0, zeroing out-of-bounds values instead of requiring padding.
Type parameter changes: Changed ne01 from int32_t to uint3 for fast division support, and mask from half2* to half* for finer-grained access.

Impact on codebase:

The changes maintain backward compatibility for Turing and Ampere architectures while extending support to Volta. The refactoring consolidates three kernel variants (tile, WMMA, MMA) toward a unified MMA implementation. The WMMA kernel is now restricted to AMD GPUs only (removed Volta from GGML_USE_WMMA_FATTN definition).

loci-agentic-ai · 2025-11-26T09:55:11Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
PR #328: CUDA FlashAttention Generalization and Volta Support
Scope: 10 files modified (+935/-735 lines)

Analysis Outcome

This PR introduces CUDA FlashAttention kernel refactoring targeting GPU inference optimization. The analyzed performance metrics compare versions unrelated to this PR's commit (2ef0c5f63d4ac4d1b2622ae8b69be9406262f0f4). The measured changes affect CPU-side STL operations in libmtmd.so and libllama.so, not GPU kernels.

Code Changes: This PR modifies CUDA kernel implementations for V100 tensor core support, removes mask padding requirements (64→1), and refactors MMA configuration from templates to runtime lookup. Changes are isolated to GPU execution paths and do not affect CPU inference or tokenization functions.

Performance Metrics Context: The measured regressions in begin (+177 ns throughput), operator= (+70 ns throughput), and _M_allocate_buckets (+64 ns throughput) are STL container operations unrelated to GPU kernels. These functions are not in the inference critical path for token generation.

Inference Impact: No changes detected in llama_decode, llama_encode, or llama_tokenize functions. CPU-based token generation remains unaffected. The PR targets GPU-accelerated attention computation, which is orthogonal to the measured CPU-side metrics.

Power Consumption: libmtmd.so shows +0.134% (+228 nJ) increase, libllama.so shows -0.023% (-52 nJ) decrease. Net change across binaries is +176 nJ, representing negligible power impact. Changes stem from STL operations, not inference kernels.

Key Findings:

The performance metrics and code changes are decoupled. PR #328 implements GPU kernel optimizations (V100 support, mask handling improvements) that do not intersect with the measured CPU-side STL performance variations. The measured changes in iterator operations, memory allocation, and string handling are artifacts of different code paths or compiler optimizations unrelated to this PR's CUDA kernel modifications.

For GPU inference workloads, this PR is expected to improve V100 performance significantly while maintaining stability on Turing+ architectures based on the PR's benchmark data. CPU inference paths remain unchanged.

loci-agentic-ai · 2025-11-28T17:20:35Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #328 CUDA FlashAttention Optimization

Overview

PR #328 introduces CUDA FlashAttention enhancements targeting GPU inference optimization. The changes span 10 files with 938 additions and 738 deletions, primarily affecting CUDA kernel implementations. This analysis focuses on CPU-side performance metrics from static analysis, which show minimal impact on core inference paths.

Key Findings

Performance-Critical Functions Impact

The static analysis identified performance variations in STL container operations within libllama.so and libmtmd.so binaries. However, these functions are not in the critical inference path:

libllama.so STL Operations:

std::vector<std::pair<std::wstring, std::wstring>>::end(): Response time increased 113 ns (82 ns → 195 ns)
std::vector<llm_symbol>::end(): Response time increased 32 ns (50 ns → 82 ns)
std::back_inserter<std::vector<double>>(): Response time increased 32 ns (60 ns → 92 ns)

libmtmd.so Audio Functions:

ma_dr_flac__on_seek_memory(): Response time increased 24 ns (188 ns → 212 ns)
ma_dr_wav__on_seek_memory_write(): Response time increased 25 ns (233 ns → 259 ns)

These functions are utility operations for tokenization metadata and multimedia processing, not direct inference execution. The absolute changes are measured in nanoseconds, representing negligible overhead.

Inference Performance Impact

Critical Finding: No core inference functions (llama_decode, llama_encode, llama_tokenize) show performance changes in the static analysis. The PR targets CUDA GPU kernels, which operate independently from CPU-side tokenization and model execution measured in this analysis.

Tokens per Second Projection: Based on the reference that 2 ms slower llama_decode causes 7% TPS reduction, the observed nanosecond-level changes in non-inference functions translate to unmeasurable TPS impact. The STL container operations occur during setup/teardown phases, not per-token processing.

Impacted Functions for Inference: None identified in CPU analysis. GPU kernel improvements (1.06-2.97x speedup on V100 per PR benchmarks) occur in CUDA code not captured by CPU binary analysis.

Power Consumption Analysis

libmtmd.so: Increased 508 nJ (130,247 nJ → 130,755 nJ), representing +0.39% change. This binary handles multimedia operations (audio decoding via miniaudio library), not core LLM inference.

libllama.so: Decreased 54 nJ (193,066 nJ → 193,012 nJ), representing -0.028% change. The reduction indicates slightly improved efficiency in the core library despite individual function variations.

Other Binaries: All other binaries (llama-bench, llama-quantize, llama-run, libggml.so, etc.) show 0.0% power consumption change, confirming the modifications are isolated to specific components.

Code Change Analysis

The PR implements:

Volta tensor core support via generalized MMA kernel architecture
Mask padding reduction (GGML_KQ_MASK_PAD: 64 → 1) for memory efficiency
Out-of-bounds checking for flexible KV cache lengths
Configuration system refactoring using macro-based packing

These changes target GPU execution paths. The CPU-side STL regressions observed in static analysis stem from compiler optimization differences or structure layout changes, not algorithmic modifications. The wide string and symbol vector operations showing degradation are used in vocabulary management and Unicode handling during model initialization, not per-token inference loops.

Synthesis

The static analysis captures CPU binary performance while the PR optimizes GPU kernels. The observed nanosecond-level variations in STL containers and audio functions do not affect inference throughput. The power consumption changes are minimal across all binaries, with libllama.so showing slight improvement. For GPU inference workloads (the PR's target), the CUDA kernel optimizations deliver substantial speedups (up to 2.97x on Volta) without measurable CPU-side regression in core inference functions.

loci-dev force-pushed the upstream-PR17505-branch_JohannesGaessler-cuda-fa-mma-update-5 branch from 6455c6d to 2ef0c5f Compare November 25, 2025 23:35

loci-dev had a problem deploying to PROD__AL_DEMO November 25, 2025 23:35 — with GitHub Actions Failure

loci-dev force-pushed the main branch 2 times, most recently from 53eeb3f to 2531f8a Compare November 26, 2025 08:11

loci-dev temporarily deployed to PROD__AL_DEMO November 26, 2025 09:09 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 2531f8a to 4600128 Compare November 26, 2025 09:10

loci-dev force-pushed the main branch 13 times, most recently from aaa8a85 to 9239ee7 Compare November 28, 2025 14:08

JohannesGaessler added 3 commits November 28, 2025 16:52

CUDA: generalized (mma) FA, add Volta support

17f191e

fix const correctness

e2c50b1

fix turing config lookup

301ae30

loci-dev force-pushed the main branch from 9239ee7 to 96dc574 Compare November 28, 2025 16:10

loci-dev force-pushed the upstream-PR17505-branch_JohannesGaessler-cuda-fa-mma-update-5 branch from 2ef0c5f to 301ae30 Compare November 28, 2025 16:39

loci-dev temporarily deployed to PROD__AL_DEMO November 28, 2025 16:39 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 9a74048 to af6127b Compare November 28, 2025 20:09

loci-dev force-pushed the main branch 30 times, most recently from 4f731df to 8e6f6e8 Compare December 12, 2025 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328

Uh oh!

loci-dev commented Nov 25, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Uh oh!

loci-agentic-ai bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328

Are you sure you want to change the base?

UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328

Uh oh!

Conversation

loci-dev commented Nov 25, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Performance Analysis Summary: PR #328

Overview

Key Findings

Performance-Critical Functions Impact

Tokens Per Second Impact

Power Consumption Analysis

Code Changes Analysis

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Performance Analysis Summary

Analysis Outcome

Uh oh!

loci-agentic-ai bot commented Nov 28, 2025

Performance Analysis Summary: PR #328 CUDA FlashAttention Optimization

Overview

Key Findings

Performance-Critical Functions Impact

Inference Performance Impact

Power Consumption Analysis

Code Change Analysis

Synthesis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants