CANN: add operator fusion support for ADD+RMS_NORM operations #17512

noemotiovon · 2025-11-26T03:45:01Z

Add operator fusion optimization to improve performance by fusing compatible operations into single kernel calls. Currently supports fusing ADD and RMS_NORM operations.

Changes:

Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable operator fusion (default: false)
Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and RMS_NORM operations using aclnnAddRmsNorm API
Add ggml_cann_can_fuse() helper function to check if operations can be fused in CANN backend
Update evaluate_and_capture_cann_graph() to detect and apply operator fusion when enabled

This optimization reduces overhead between operations and improves overall computational efficiency for models using ADD followed by RMS_NORM patterns.

Make sure to read the contributing guidelines before submitting a PR

noemotiovon · 2025-11-26T03:49:06Z

Before:

GGML_CANN_OPERATOR_FUSION=0 ......
......
common_perf_print:    sampling time =     233.89 ms
common_perf_print:    samplers time =      85.53 ms /   341 tokens
common_perf_print:        load time =    1825.05 ms
common_perf_print: prompt eval time =      15.38 ms /    20 tokens (    0.77 ms per token,  1300.39 tokens per second)
common_perf_print:        eval time =    1392.89 ms /   320 runs   (    4.35 ms per token,   229.74 tokens per second)
common_perf_print:       total time =    2342.54 ms /   340 tokens
common_perf_print: unaccounted time =     700.38 ms /  29.9 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        318
llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CANN0 (Ascend910B4) | 30196 = 28457 + (1300 =   942 +      48 +     310) +         438 |
llama_memory_breakdown_print: |   - Host                |                   269 =   259 +       0 +       9                |

After:

GGML_CANN_OPERATOR_FUSION=1 ......
......
common_perf_print:    sampling time =     266.40 ms
common_perf_print:    samplers time =      99.42 ms /   411 tokens
common_perf_print:        load time =    1773.20 ms
common_perf_print: prompt eval time =      15.42 ms /    20 tokens (    0.77 ms per token,  1297.44 tokens per second)
common_perf_print:        eval time =    1679.44 ms /   390 runs   (    4.31 ms per token,   232.22 tokens per second)
common_perf_print:       total time =    2562.77 ms /   410 tokens
common_perf_print: unaccounted time =     601.52 ms /  23.5 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        388
llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CANN0 (Ascend910B4) | 30196 = 28458 + (1300 =   942 +      48 +     310) +         436 |
llama_memory_breakdown_print: |   - Host                |                   269 =   259 +       0 +       9                |

noemotiovon · 2025-11-26T07:33:08Z

Op Test:

new_pool_for_device: device 0 use vmm pool
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0): OK
  15/15 tests passed
  Backend CANN0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

hipudding · 2025-12-02T12:38:02Z

docs/backend/CANN.md

+
+### GGML_CANN_OPERATOR_FUSION
+
+Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance.


If this feature has change to improve performance, should we enable by default?

It's better for user not set any parameters.

In some scenarios it may bring performance improvements, but it may also introduce unexpected issues. At the moment it should be considered more like an experimental version. Once the features become stable in the future, it will be enabled by default.

hipudding · 2025-12-02T12:53:12Z

ggml/src/ggml-cann/ggml-cann.cpp

        for (int i = 0; i < cgraph->n_nodes; i++) {
            ggml_tensor * node = cgraph->nodes[i];
+            if (opt_fusion) {
+                if (ggml_cann_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM })) {


Suggested change

if (ggml_cann_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM })) {

if (ggml_cann_can_fuse(cgraph, i, { cgraph->nodes[i]->op, cgraph->nodes[i+1]->op})) {

No changes are needed here. The underlying layer will call ggml’s generic fuse check, which will determine whether, starting from the i-th node of the current cgraph, the operator sequence matches.

hipudding · 2025-12-02T12:56:12Z

ggml/src/ggml-cann/aclnn_ops.cpp

 }
+
+void ggml_cann_op_add_rms_norm_fused(ggml_backend_cann_context & ctx,
+                                     ggml_tensor *               dst,


Result should write in rms_norm's dst?

Add operator fusion optimization to improve performance by fusing compatible operations into single kernel calls. Currently supports fusing ADD and RMS_NORM operations. Changes: - Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable operator fusion (default: false) - Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and RMS_NORM operations using aclnnAddRmsNorm API - Add ggml_cann_can_fuse() helper function to check if operations can be fused in CANN backend - Update evaluate_and_capture_cann_graph() to detect and apply operator fusion when enabled This optimization reduces overhead between operations and improves overall computational efficiency for models using ADD followed by RMS_NORM patterns.

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Nov 26, 2025

loci-dev mentioned this pull request Nov 26, 2025

UPSTREAM PR #17512: CANN: add operator fusion support for ADD+RMS_NORM operations auroralabs-loci/llama.cpp#331

Open

noemotiovon force-pushed the cann_opt_fused branch from 3d516d1 to 968d6e0 Compare November 26, 2025 06:10

noemotiovon force-pushed the cann_opt_fused branch from 968d6e0 to 7867a78 Compare November 26, 2025 07:41

noemotiovon requested a review from ggerganov as a code owner November 26, 2025 07:41

github-actions bot added the testing Everything test related label Nov 26, 2025

hipudding self-requested a review November 28, 2025 07:56

This was referenced Dec 1, 2025

Llama.cpp 支持RMS_NORM + ADD的融合算子 cosdt/llama.cpp#43

Open

田昊-任务清单 cosdt/llama.cpp#42

Open

hipudding reviewed Dec 2, 2025

View reviewed changes

noemotiovon added 2 commits December 5, 2025 06:38

cann: fix codestyle

adc8b0d

noemotiovon force-pushed the cann_opt_fused branch from 83a1a22 to adc8b0d Compare December 5, 2025 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CANN: add operator fusion support for ADD+RMS_NORM operations #17512

CANN: add operator fusion support for ADD+RMS_NORM operations #17512

noemotiovon commented Nov 26, 2025

Uh oh!

noemotiovon commented Nov 26, 2025

Uh oh!

noemotiovon commented Nov 26, 2025

Uh oh!

hipudding Dec 2, 2025

Uh oh!

hipudding Dec 2, 2025

Uh oh!

noemotiovon Dec 5, 2025

Uh oh!

hipudding Dec 2, 2025

Uh oh!

noemotiovon Dec 5, 2025

Uh oh!

hipudding Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### GGML_CANN_OPERATOR_FUSION

		Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance.

	if (ggml_cann_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM })) {
	if (ggml_cann_can_fuse(cgraph, i, { cgraph->nodes[i]->op, cgraph->nodes[i+1]->op})) {

CANN: add operator fusion support for ADD+RMS_NORM operations #17512

Are you sure you want to change the base?

CANN: add operator fusion support for ADD+RMS_NORM operations #17512

Conversation

noemotiovon commented Nov 26, 2025

Uh oh!

noemotiovon commented Nov 26, 2025

Uh oh!

noemotiovon commented Nov 26, 2025

Uh oh!

hipudding Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

noemotiovon Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

noemotiovon Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants