-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #17505: CUDA: ganeralized (mma) FA, add Volta support #328
Conversation
6455c6d to
2ef0c5f
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #328PR Title: CUDA: Generalized MMA FlashAttention, Add Volta Support OverviewThis PR refactors CUDA FlashAttention kernels to add Volta tensor core support and remove mask padding requirements. The changes affect 10 files with 935 additions and 735 deletions, primarily in CUDA kernel implementations. Performance metrics show negligible impact on core inference functions, with changes concentrated in utility functions and template infrastructure. Key FindingsPerformance-Critical Functions ImpactThe analysis reveals no meaningful changes to core inference functions. The top 10 functions by Response Time change are utility functions in standard library and helper code: Functions with increased Response Time:
Functions with decreased Response Time:
None of these functions are in the tokenization or inference critical path (llama_decode, llama_encode, llama_tokenize, llama_model_load). Tokens Per Second ImpactNo impact on inference throughput. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput changes in this version. The modified CUDA kernels are template-based infrastructure changes that do not alter the execution path for existing GPU architectures. The performance benchmarks in the PR description show improvements ranging from 1.05x to 2.97x on V100 GPUs for various batch sizes, indicating the changes optimize GPU execution without affecting CPU inference paths. Power Consumption AnalysisPower consumption changes are minimal across all binaries:
The power consumption variations are within measurement noise and reflect the minor changes in utility function execution times rather than algorithmic modifications. Code Changes AnalysisPrimary modifications:
Impact on codebase: The changes maintain backward compatibility for Turing and Ampere architectures while extending support to Volta. The refactoring consolidates three kernel variants (tile, WMMA, MMA) toward a unified MMA implementation. The WMMA kernel is now restricted to AMD GPUs only (removed Volta from |
53eeb3f to
2531f8a
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryProject: llama.cpp Analysis OutcomeThis PR introduces CUDA FlashAttention kernel refactoring targeting GPU inference optimization. The analyzed performance metrics compare versions unrelated to this PR's commit ( Code Changes: This PR modifies CUDA kernel implementations for V100 tensor core support, removes mask padding requirements (64→1), and refactors MMA configuration from templates to runtime lookup. Changes are isolated to GPU execution paths and do not affect CPU inference or tokenization functions. Performance Metrics Context: The measured regressions in Inference Impact: No changes detected in Power Consumption: libmtmd.so shows +0.134% (+228 nJ) increase, libllama.so shows -0.023% (-52 nJ) decrease. Net change across binaries is +176 nJ, representing negligible power impact. Changes stem from STL operations, not inference kernels. Key Findings: The performance metrics and code changes are decoupled. PR #328 implements GPU kernel optimizations (V100 support, mask handling improvements) that do not intersect with the measured CPU-side STL performance variations. The measured changes in iterator operations, memory allocation, and string handling are artifacts of different code paths or compiler optimizations unrelated to this PR's CUDA kernel modifications. For GPU inference workloads, this PR is expected to improve V100 performance significantly while maintaining stability on Turing+ architectures based on the PR's benchmark data. CPU inference paths remain unchanged. |
aaa8a85 to
9239ee7
Compare
2ef0c5f to
301ae30
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #328 CUDA FlashAttention OptimizationOverviewPR #328 introduces CUDA FlashAttention enhancements targeting GPU inference optimization. The changes span 10 files with 938 additions and 738 deletions, primarily affecting CUDA kernel implementations. This analysis focuses on CPU-side performance metrics from static analysis, which show minimal impact on core inference paths. Key FindingsPerformance-Critical Functions ImpactThe static analysis identified performance variations in STL container operations within libllama.so and libmtmd.so binaries. However, these functions are not in the critical inference path: libllama.so STL Operations:
libmtmd.so Audio Functions:
These functions are utility operations for tokenization metadata and multimedia processing, not direct inference execution. The absolute changes are measured in nanoseconds, representing negligible overhead. Inference Performance ImpactCritical Finding: No core inference functions (llama_decode, llama_encode, llama_tokenize) show performance changes in the static analysis. The PR targets CUDA GPU kernels, which operate independently from CPU-side tokenization and model execution measured in this analysis. Tokens per Second Projection: Based on the reference that 2 ms slower llama_decode causes 7% TPS reduction, the observed nanosecond-level changes in non-inference functions translate to unmeasurable TPS impact. The STL container operations occur during setup/teardown phases, not per-token processing. Impacted Functions for Inference: None identified in CPU analysis. GPU kernel improvements (1.06-2.97x speedup on V100 per PR benchmarks) occur in CUDA code not captured by CPU binary analysis. Power Consumption Analysislibmtmd.so: Increased 508 nJ (130,247 nJ → 130,755 nJ), representing +0.39% change. This binary handles multimedia operations (audio decoding via miniaudio library), not core LLM inference. libllama.so: Decreased 54 nJ (193,066 nJ → 193,012 nJ), representing -0.028% change. The reduction indicates slightly improved efficiency in the core library despite individual function variations. Other Binaries: All other binaries (llama-bench, llama-quantize, llama-run, libggml.so, etc.) show 0.0% power consumption change, confirming the modifications are isolated to specific components. Code Change AnalysisThe PR implements:
These changes target GPU execution paths. The CPU-side STL regressions observed in static analysis stem from compiler optimization differences or structure layout changes, not algorithmic modifications. The wide string and symbol vector operations showing degradation are used in vocabulary management and Unicode handling during model initialization, not per-token inference loops. SynthesisThe static analysis captures CPU binary performance while the PR optimizes GPU kernels. The observed nanosecond-level variations in STL containers and audio functions do not affect inference throughput. The power consumption changes are minimal across all binaries, with libllama.so showing slight improvement. For GPU inference workloads (the PR's target), the CUDA kernel optimizations deliver substantial speedups (up to 2.97x on Volta) without measurable CPU-side regression in core inference functions. |
9a74048 to
af6127b
Compare
4f731df to
8e6f6e8
Compare
Mirrored from ggml-org/llama.cpp#17505
This PR makes the following changes to the CUDA FlashAttention code:
mask->ne[1]direction. This is done by applying a modulo on the mask column that is being read so no conditional statements need to be evaluated. The impact on performance is negligible and I do not deem it necessary to compile additional template specializations. See ggml : remove KQ mask padding ggml-org/llama.cpp#16309 . cc @ggerganov .tiletemplate inmma.cuhhas been extended with additional, optional arguments to safely handle situations where tiles of the same shape can have different physical data layouts.__launch_bounds__when using ROCm (as of right now ROCm is not used).K->ne[1]. As with the tile kernel, because this comes at a cost to performance it is still preferable to pad the KV cache length. As of right now this is still required to be 256, for the currently supported GPUs it should be possible to lower this to 128 without issue once the WMMA kernel has been completely replaced. For Hopper it may still make sense to have a padding of 256 but as it is I have no idea whether the 256x64 instruction would actually have better performance than the 128x64 instruction.As of right now the interface in
mma.cuhis suboptimal and long-term I intend to refactor it to allow the use of tensor cores in a more uniform way. However, I don't know the exact requirements until we have proper support for AMD WMMA and AMD MFMA instructions. So for now I think the correct choice is to prioritize getting working support for those at the cost of maintainability and to do a refactor afterwards.V100 performance
Other GPU performance
The performance numbers assume that the KQ mask is no longer being padded. This change is also in this PR. I don't have a good overview of which other backends maybe still need support for this change and whether or not it should be reverted prior to merging.