Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17512

Add operator fusion optimization to improve performance by fusing compatible operations into single kernel calls. Currently supports fusing ADD and RMS_NORM operations.

Changes:

  • Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable operator fusion (default: false)
  • Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and RMS_NORM operations using aclnnAddRmsNorm API
  • Add ggml_cann_can_fuse() helper function to check if operations can be fused in CANN backend
  • Update evaluate_and_capture_cann_graph() to detect and apply operator fusion when enabled

This optimization reduces overhead between operations and improves overall computational efficiency for models using ADD followed by RMS_NORM patterns.

Make sure to read the contributing guidelines before submitting a PR

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #331 - CANN Operator Fusion

Overview

This PR introduces operator fusion for ADD+RMS_NORM operations in the CANN backend, controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: disabled). The implementation adds a new fused kernel ggml_cann_op_add_rms_norm_fused() that combines two sequential operations into a single CANN API call.

Performance Metrics

Analysis shows no measurable performance differences between versions d62f9a1e-8714-4ac6-b038-7cd026d25a68 and aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2. All binaries exhibit 0.0% change in power consumption, and no functions show Response Time or Throughput Time variations.

Key Findings

Power Consumption Analysis:
All 16 analyzed binaries maintain identical power consumption profiles. The largest consumers remain unchanged:

  • build.bin.llama-tts: 285,154 nJ (0.0% change)
  • build.bin.llama-cvector-generator: 278,999 nJ (0.0% change)
  • build.bin.llama-run: 245,370 nJ (0.0% change)
  • build.bin.libllama.so: 228,844 nJ (-0.0% change, sub-nanojoule difference)

Inference Impact:
No functions in the tokenization/inference pipeline show performance changes. Core functions llama_decode, llama_encode, and llama_tokenize maintain baseline performance. Therefore, tokens per second remains unchanged for inference workloads.

Code Implementation:
The PR adds 112 lines implementing fusion logic in ggml-cann.cpp and aclnn_ops.cpp. The fusion path is only activated when GGML_CANN_OPERATOR_FUSION=true, explaining the zero performance delta in default configuration. The implementation uses CANN's native aclnnAddRmsNorm API to combine ADD and RMS_NORM operations, eliminating intermediate memory operations and kernel launch overhead when enabled.

Affected Components:
Changes are isolated to the CANN backend (ggml/src/ggml-cann/ directory). No modifications to core inference functions or CPU/CUDA backends. The fusion detection occurs in evaluate_and_capture_cann_graph() during graph execution, skipping fused RMS_NORM nodes after processing combined operations.

Conclusion:
The measured performance equivalence reflects the opt-in nature of this optimization. When enabled, the fusion is expected to reduce memory bandwidth and kernel launch overhead for models using ADD+RMS_NORM patterns, but default behavior remains unchanged, ensuring stability.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 53eeb3f to 2531f8a Compare November 26, 2025 08:11
@loci-dev loci-dev force-pushed the upstream-PR17512-branch_noemotiovon-cann_opt_fused branch from 3d516d1 to 83a1a22 Compare November 26, 2025 08:41
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #331

Overview

This PR introduces operator fusion for ADD+RMS_NORM operations in the CANN backend. The analysis shows no measurable performance changes across all binaries and functions between versions.

Analysis Results

Performance Metrics:

  • All 16 binaries show zero or negligible power consumption change (< 0.001%)
  • No function-level Response Time or Throughput Time changes detected
  • build.bin.libllama.so: -0.36 nJ (effectively zero)
  • build.bin.llama-run: -0.22 nJ (measurement noise)
  • build.bin.llama-tts: +0.05 nJ (measurement noise)

Code Changes:
The PR adds a new fused kernel ggml_cann_op_add_rms_norm_fused() that combines ADD and RMS_NORM operations into a single CANN API call. The optimization is disabled by default via environment variable GGML_CANN_OPERATOR_FUSION=false.

Inference Impact:
No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput Time changes. The fusion optimization targets CANN backend operations which were not active in the measured configuration.

Power Consumption:
No meaningful power consumption changes across any binary. All variations are within measurement precision limits (sub-nanojoule range).

The versions are functionally equivalent from a performance perspective. The added fusion capability provides infrastructure for future optimization when enabled, but introduces no overhead when disabled.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from fc0f51d to 89ba2e9 Compare November 29, 2025 21:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from b28744d to 4733ac4 Compare December 13, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants