Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 3, 2025

Mirrored from ggml-org/llama.cpp#17737

Description

We implement the SSM_CONV operator using depthwise 1D convolution.
We use high-level builtin aclnnConvolution function.

The goal is to compute the following:

$$ y[i,j,k] = \sum_{l=0}^{dconv}w[l,i] x[l+j, i, k] $$

where the shape of $y$ is $[dinner, nt, ns]$, $x$ is $[dconv - 1 + nt, dinner, ns]$ and $w$ is $[dconv, dinner]$.

In order to use aclnnConvolution to implement this formula, we reshape the tensors and set the groups parameter to d_inner to calculate the convolution for each channel independently.

Testing

We ran test-backend-ops test suite for SSM_CONV on two different cards: 310P3 and 910B3.

34293ac6f3d37fd4488e48435b1853a3

a03110740632632a52e408b865c0a025

For the 310P3 card, it requires setting the cubeMathType parameter to ALLOW_FP32_DOWN_PRECISION, and it seems that causes the computation to be done not in f32, which in turn causes the tests to not pass with a small error (NMSE 0.000000114, greater than the allowed 1e-7). We had to override max_nmse_err() method for test_ssm_conv to set the maximum error to 1e-6 which allows the tests to pass.

On the 910B card, the operator runs in f32 natively, it passes the tests at the original 1e-7 precision.

Co-authored-by: Aleksei Lobanov, <[email protected]>
Co-authored-by: Sujin Kang, <[email protected]>
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #416 - CANN SSM_CONV Operator Implementation

Overview

PR #416 implements the SSM_CONV operator for the CANN backend, adding support for state-space model convolution operations on Ascend NPUs. The changes introduce 137 new lines across 4 files with no deletions, representing a pure feature addition rather than a modification of existing code paths.

Performance Impact Analysis

Power Consumption: Analysis across all binaries shows 0.0% change in power consumption between versions. The measured values for key binaries remain identical:

  • libllama.so: 194,195 nJ (no change)
  • libggml-cpu.so: 116,810 nJ (no change)
  • llama-run: 218,940 nJ (no change)

Inference Performance: No functions in the core inference path (llama_decode, llama_encode, llama_tokenize) were modified. The new ggml_cann_ssm_conv function is an isolated addition to the CANN backend operator set and does not affect existing CPU or GPU inference paths. Tokens per second for standard transformer models remains unchanged.

Code Changes:

  • New function ggml_cann_ssm_conv implements depthwise 1D convolution using aclnnConvolution
  • Tensor reshaping logic converts between GGML layout (CLN format) and CANN NCL format
  • Platform-specific handling for Ascend 310P3 cards sets cubeMathType=1 for FP32 precision
  • Switch case additions in ggml_cann_compute_forward and ggml_backend_cann_supports_op register the new operator
  • Test tolerance adjustment from 1e-7 to 1e-6 accommodates 310P3 precision behavior

Scope: This PR exclusively affects state-space models (Mamba, RWKV architectures) running on CANN backend. Standard transformer models and non-CANN backends are unaffected. The implementation adds 123 lines of tensor manipulation and convolution setup code without modifying any existing operator implementations.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 9612097 to c217e38 Compare December 6, 2025 08:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from b28744d to 4733ac4 Compare December 13, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants