Skip to content

Conversation

@noemotiovon
Copy link
Collaborator

Add operator fusion optimization to improve performance by fusing compatible operations into single kernel calls. Currently supports fusing ADD and RMS_NORM operations.

Changes:

  • Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable operator fusion (default: false)
  • Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and RMS_NORM operations using aclnnAddRmsNorm API
  • Add ggml_cann_can_fuse() helper function to check if operations can be fused in CANN backend
  • Update evaluate_and_capture_cann_graph() to detect and apply operator fusion when enabled

This optimization reduces overhead between operations and improves overall computational efficiency for models using ADD followed by RMS_NORM patterns.

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Nov 26, 2025
@noemotiovon
Copy link
Collaborator Author

Before:

GGML_CANN_OPERATOR_FUSION=0 ......
......
common_perf_print:    sampling time =     233.89 ms
common_perf_print:    samplers time =      85.53 ms /   341 tokens
common_perf_print:        load time =    1825.05 ms
common_perf_print: prompt eval time =      15.38 ms /    20 tokens (    0.77 ms per token,  1300.39 tokens per second)
common_perf_print:        eval time =    1392.89 ms /   320 runs   (    4.35 ms per token,   229.74 tokens per second)
common_perf_print:       total time =    2342.54 ms /   340 tokens
common_perf_print: unaccounted time =     700.38 ms /  29.9 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        318
llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CANN0 (Ascend910B4) | 30196 = 28457 + (1300 =   942 +      48 +     310) +         438 |
llama_memory_breakdown_print: |   - Host                |                   269 =   259 +       0 +       9                |

After:

GGML_CANN_OPERATOR_FUSION=1 ......
......
common_perf_print:    sampling time =     266.40 ms
common_perf_print:    samplers time =      99.42 ms /   411 tokens
common_perf_print:        load time =    1773.20 ms
common_perf_print: prompt eval time =      15.42 ms /    20 tokens (    0.77 ms per token,  1297.44 tokens per second)
common_perf_print:        eval time =    1679.44 ms /   390 runs   (    4.31 ms per token,   232.22 tokens per second)
common_perf_print:       total time =    2562.77 ms /   410 tokens
common_perf_print: unaccounted time =     601.52 ms /  23.5 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        388
llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CANN0 (Ascend910B4) | 30196 = 28458 + (1300 =   942 +      48 +     310) +         436 |
llama_memory_breakdown_print: |   - Host                |                   269 =   259 +       0 +       9                |

@noemotiovon
Copy link
Collaborator Author

Op Test:

new_pool_for_device: device 0 use vmm pool
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000001,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.000100,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=0.100000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[64,5,4,3],eps=1.000000,broadcast=1): OK
  ADD_RMS_NORM(type=f32,ne=[1,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[511,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[1025,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[8192,1,1,1],eps=0.000001,broadcast=0): OK
  ADD_RMS_NORM(type=f32,ne=[16896,1,1,1],eps=0.000001,broadcast=0): OK
  15/15 tests passed
  Backend CANN0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK


### GGML_CANN_OPERATOR_FUSION

Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this feature has change to improve performance, should we enable by default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better for user not set any parameters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some scenarios it may bring performance improvements, but it may also introduce unexpected issues. At the moment it should be considered more like an experimental version. Once the features become stable in the future, it will be enabled by default.

for (int i = 0; i < cgraph->n_nodes; i++) {
ggml_tensor * node = cgraph->nodes[i];
if (opt_fusion) {
if (ggml_cann_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM })) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ggml_cann_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM })) {
if (ggml_cann_can_fuse(cgraph, i, { cgraph->nodes[i]->op, cgraph->nodes[i+1]->op})) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes are needed here. The underlying layer will call ggml’s generic fuse check, which will determine whether, starting from the i-th node of the current cgraph, the operator sequence matches.

}

void ggml_cann_op_add_rms_norm_fused(ggml_backend_cann_context & ctx,
ggml_tensor * dst,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result should write in rms_norm's dst?

Add operator fusion optimization to improve performance by fusing compatible
operations into single kernel calls. Currently supports fusing ADD and RMS_NORM
operations.

Changes:
- Add new environment variable GGML_CANN_OPERATOR_FUSION to enable/disable
  operator fusion (default: false)
- Implement ggml_cann_op_add_rms_norm_fused() function that fuses ADD and
  RMS_NORM operations using aclnnAddRmsNorm API
- Add ggml_cann_can_fuse() helper function to check if operations can be
  fused in CANN backend
- Update evaluate_and_capture_cann_graph() to detect and apply operator
  fusion when enabled

This optimization reduces overhead between operations and improves overall
computational efficiency for models using ADD followed by RMS_NORM patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants