UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations #331

loci-dev · 2025-11-26T03:47:03Z

Add operator fusion optimization to improve performance by fusing compatible operations into single kernel calls. Currently supports fusing ADD and RMS_NORM operations.

Changes:

Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable operator fusion (default: false)
Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and RMS_NORM operations using aclnnAddRmsNorm API
Add ggml_cann_can_fuse() helper function to check if operations can be fused in CANN backend
Update evaluate_and_capture_cann_graph() to detect and apply operator fusion when enabled

This optimization reduces overhead between operations and improves overall computational efficiency for models using ADD followed by RMS_NORM patterns.

Make sure to read the contributing guidelines before submitting a PR

loci-agentic-ai · 2025-11-26T04:28:29Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #331 - CANN Operator Fusion

Overview

This PR introduces operator fusion for ADD+RMS_NORM operations in the CANN backend, controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: disabled). The implementation adds a new fused kernel ggml_cann_op_add_rms_norm_fused() that combines two sequential operations into a single CANN API call.

Performance Metrics

Analysis shows no measurable performance differences between versions d62f9a1e-8714-4ac6-b038-7cd026d25a68 and aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2. All binaries exhibit 0.0% change in power consumption, and no functions show Response Time or Throughput Time variations.

Key Findings

Power Consumption Analysis:
All 16 analyzed binaries maintain identical power consumption profiles. The largest consumers remain unchanged:

build.bin.llama-tts: 285,154 nJ (0.0% change)
build.bin.llama-cvector-generator: 278,999 nJ (0.0% change)
build.bin.llama-run: 245,370 nJ (0.0% change)
build.bin.libllama.so: 228,844 nJ (-0.0% change, sub-nanojoule difference)

Inference Impact:
No functions in the tokenization/inference pipeline show performance changes. Core functions llama_decode, llama_encode, and llama_tokenize maintain baseline performance. Therefore, tokens per second remains unchanged for inference workloads.

Code Implementation:
The PR adds 112 lines implementing fusion logic in ggml-cann.cpp and aclnn_ops.cpp. The fusion path is only activated when GGML_CANN_OPERATOR_FUSION=true, explaining the zero performance delta in default configuration. The implementation uses CANN's native aclnnAddRmsNorm API to combine ADD and RMS_NORM operations, eliminating intermediate memory operations and kernel launch overhead when enabled.

Affected Components:
Changes are isolated to the CANN backend (ggml/src/ggml-cann/ directory). No modifications to core inference functions or CPU/CUDA backends. The fusion detection occurs in evaluate_and_capture_cann_graph() during graph execution, skipping fused RMS_NORM nodes after processing combined operations.

Conclusion:
The measured performance equivalence reflects the opt-in nature of this optimization. When enabled, the fusion is expected to reduce memory bandwidth and kernel launch overhead for models using ADD+RMS_NORM patterns, but default behavior remains unchanged, ensuring stability.

loci-agentic-ai · 2025-11-26T09:20:18Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #331

Overview

This PR introduces operator fusion for ADD+RMS_NORM operations in the CANN backend. The analysis shows no measurable performance changes across all binaries and functions between versions.

Analysis Results

Performance Metrics:

All 16 binaries show zero or negligible power consumption change (< 0.001%)
No function-level Response Time or Throughput Time changes detected
build.bin.libllama.so: -0.36 nJ (effectively zero)
build.bin.llama-run: -0.22 nJ (measurement noise)
build.bin.llama-tts: +0.05 nJ (measurement noise)

Code Changes:
The PR adds a new fused kernel ggml_cann_op_add_rms_norm_fused() that combines ADD and RMS_NORM operations into a single CANN API call. The optimization is disabled by default via environment variable GGML_CANN_OPERATOR_FUSION=false.

Inference Impact:
No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no Response Time or Throughput Time changes. The fusion optimization targets CANN backend operations which were not active in the measured configuration.

Power Consumption:
No meaningful power consumption changes across any binary. All variations are within measurement precision limits (sub-nanojoule range).

The versions are functionally equivalent from a performance perspective. The added fusion capability provides infrastructure for future optimization when enabled, but introduces no overhead when disabled.

loci-dev temporarily deployed to PROD__AL_DEMO November 26, 2025 03:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 53eeb3f to 2531f8a Compare November 26, 2025 08:11

loci-dev force-pushed the upstream-PR17512-branch_noemotiovon-cann_opt_fused branch from 3d516d1 to 83a1a22 Compare November 26, 2025 08:41

loci-dev temporarily deployed to PROD__AL_DEMO November 26, 2025 08:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 2531f8a to 4600128 Compare November 26, 2025 09:10

loci-dev force-pushed the main branch 22 times, most recently from fc0f51d to 89ba2e9 Compare November 29, 2025 21:07

loci-dev force-pushed the main branch 30 times, most recently from b28744d to 4733ac4 Compare December 13, 2025 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations #331

UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations #331

Uh oh!

loci-dev commented Nov 26, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations #331

Are you sure you want to change the base?

UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations #331

Uh oh!

Conversation

loci-dev commented Nov 26, 2025

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Performance Analysis Summary: PR #331 - CANN Operator Fusion

Overview

Performance Metrics

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 26, 2025

Performance Analysis Summary - PR #331

Overview

Analysis Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants