Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17585

Improve Qwen3Next inference speed

Fix Qwen3Next inference speed.

@pwilkin: I don't have a discrete GPU, could you check this PR improves the speed?

main

============================================================
                    Performance Report
============================================================
Operation Name           Total(ms)   Calls     Avg(ms)        Pct(%)
------------------------------------------------------------
attn_calc                26.204      3408      0.008          22.78
ffn_calc                 23.581      3408      0.007          20.50
attn_calc_linear         19.204      2556      0.008          16.69
fwd_exp_upd_recur_sts    15.883      2556      0.006          13.81   <- Bottleneck
build_delta_net          11.294      2556      0.004          9.82
build_dn_recur           8.553       2484      0.003          7.43
attn_calc_full           6.771       852       0.008          5.89
build_dn_chunking        2.575       72        0.036          2.24
decay_mask               0.314       2484      0.000          0.27
broadcast                0.160       2484      0.000          0.14
ssm_conv                 0.144       2556      0.000          0.13
value_mul_mat            0.107       2484      0.000          0.09
solve_tri_recur          0.097       2484      0.000          0.08
cumsum_dn                0.074       2484      0.000          0.06
fwd_expand               0.049       71        0.001          0.04
ggml_exp                 0.028       2484      0.000          0.02
------------------------------------------------------------
Total                    115.038     100.00
============================================================

Performance Bottleneck Analysis:
- Most time-consuming operation: attn_calc (26.204 ms)   <- Bottleneck, here should be ffn calc
- Longest average time: build_dn_chunking (0.036 ms)

PR

============================================================
                    Performance Report
============================================================
Operation Name           Total(ms)   Calls     Avg(ms)        Pct(%)
------------------------------------------------------------
ffn_calc                 56.509      4800      0.012          47.68
build_delta_net          15.740      3600      0.004          13.28
attn_calc                14.720      4800      0.003          12.42
build_dn_recur           13.227      3528      0.004          11.16
attn_calc_full           9.955       1200      0.008          8.40
attn_calc_linear         4.460       3600      0.001          3.76
build_dn_chunking        2.197       72        0.031          1.85
ssm_conv                 0.540       3600      0.000          0.46
decay_mask               0.312       3528      0.000          0.26
value_mul_mat            0.251       3528      0.000          0.21
broadcast                0.159       3528      0.000          0.13
fwd_exp_upd_recur_sts    0.150       3600      0.000          0.13
solve_tri_recur          0.149       3528      0.000          0.13
fwd_expand               0.082       100       0.001          0.07
ggml_exp                 0.041       3528      0.000          0.03
cumsum_dn                0.037       3528      0.000          0.03
------------------------------------------------------------
Total                    118.529     100.00
============================================================

Performance Bottleneck Analysis:
- Most time-consuming operation: ffn_calc (56.509 ms)
- Longest average time: build_dn_chunking (0.031 ms)

Added performance metrics tracking to qwen3next model, including operation timing, reporting, and CSV export functionality.
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #356: Qwen3Next Performance Instrumentation

This PR adds 290 lines of performance monitoring code to src/models/qwen3next.cpp, introducing a PerformanceMetrics class with timing macros throughout the Qwen3Next model implementation. The changes affect a single file but instrument multiple critical functions in the model execution path.

Key Findings

Performance-Critical Functions Impact:

The instrumentation overhead manifests across model building functions:

  • build_delta_net_recurrent: Response time increased by 81549 ns, throughput by 1646 ns. The function now executes 7 timing pairs per call, adding map lookups, string operations, and high-resolution clock reads.

  • build_layer_attn_linear: Response time increased by 115785 ns, throughput by 399 ns. Contains 3 timing pairs plus a functional modification that removes the ggml_cpy operation for state updates, replacing it with only a view operation.

  • llm_build_qwen3next constructor: Response time increased by 195758 ns, throughput by 622 ns. Adds 4 timing pairs per layer iteration and calls print_report() after construction, performing map iteration, sorting, and console I/O.

  • build_delta_net_chunking: Response time increased by 11564 ns, throughput by 152 ns. Single timing pair with minimal overhead.

Tokens Per Second Impact:

The instrumented functions (build_delta_net_recurrent, build_layer_attn_linear, build_delta_net_chunking) are part of the model graph construction phase, not the inference execution path. Functions like llama_decode, llama_encode, and llama_tokenize are not modified in this PR. The performance changes occur during model initialization and graph building, which happens once per inference session rather than per token. Therefore, tokens per second during inference remains unaffected by this instrumentation.

Power Consumption Analysis:

Binary-level analysis shows build.bin.libllama.so increased power consumption by 2796 nJ (1.448% increase). This correlates with the cumulative throughput time increases across instrumented functions. The timing infrastructure adds per-operation overhead through std::chrono calls, std::unordered_map operations, and string allocations. Other binaries show negligible changes: build.bin.llama-run (+62 nJ, 0.033%), build.bin.llama-quantize (+44 nJ, 0.13%).

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from eb7b6bf to 47c3f0a Compare December 2, 2025 11:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from b28744d to 4733ac4 Compare December 13, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants