Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary `gtmp1 + gtmp2` #630

phjmsycc · 2025-08-27T13:23:09Z

Summary

When using EulerHeun() with a sparse noise_rate_prototype (non-diagonal noise), perform_step! allocates every step. The root cause is the second stage forming (gtmp1 + gtmp2)/2 and then multiplying by W.dW, which materializes a new SparseMatrixCSC each step.

This PR removes those allocations by not forming the sparse sum. Instead, it computes

$$ \tfrac12 (\mathrm{gtmp1} + \mathrm{gtmp2})\ \mathrm{W.dW} = \tfrac12 (\mathrm{gtmp1}\ \mathrm{W.dW}) + \tfrac12 (\mathrm{gtmp2}\ \mathrm{W.dW}) $$

using the BLAS-like signature of mul!:

# was: build (gtmp1 + gtmp2)/2 then mul!(...)
# now: y := 0.5*(gtmp2*W.dW) + 0.5*y; (y already holds gtmp1*W.dW from stage 1)
mul!(nrtmp, gtmp2, W.dW, 0.5, 0.5)

This preserves numerical results and makes each step allocation-free for both dense and sparse paths.

Motivation

With in-place f!/g! and an allocation-free noise process, users expect per-step execution to allocate nothing. Dense paths already satisfy this. Sparse paths, however, incurred per-step allocations proportional to length(u) and the stage count, impacting throughput and predictability. Aligning sparse with dense eliminates those costs.

Minimal Reproducer

using StochasticDiffEq, SparseArrays, LinearAlgebra, DiffEqNoiseProcess, BenchmarkTools

f!(du,u,p,t) = (@. du = 0.999u)

# 2×2 identical-column block; g! only writes nzval into an existing sparsity pattern
function sparse_proto(N)
    I = Vector{Int}(undef, 4N); J = similar(I); V = ones(Float64, 4N)
    @inbounds for i in 1:N
        I[4i-3]=2i-1; J[4i-3]=2i-1
        I[4i-2]=2i;   J[4i-2]=2i-1
        I[4i-1]=2i-1; J[4i-1]=2i
        I[4i  ]=2i;   J[4i  ]=2i
    end
    sparse(I,J,V,2N,2N)
end

struct P; N::Int; end
@inline function ensure_pattern!(G::SparseMatrixCSC{T,Int}, N) where T
    s = one(T)
    @inbounds for i in 1:N
        G[2i-1,2i-1]=s; G[2i,2i-1]=s; G[2i-1,2i]=s; G[2i,2i]=s
    end
end

function g!(G::SparseMatrixCSC{T}, u, p::P, t) where T
    if nnz(G) < 4p.N; ensure_pattern!(G, p.N); end
    c012=T(0.12); c18=T(1.8)
    @inbounds for i in 1:p.N
        off = G.colptr[2i-1]; G.nzval[off]=c012*u[2i-1]; G.nzval[off+1]=c18*u[2i]
        off = G.colptr[2i  ]; G.nzval[off]=c012*u[2i-1]; G.nzval[off+1]=c18*u[2i]
    end
end

function make_integrator_sparse(N)
    A = sparse_proto(N)
    p = P(N)
    W = SimpleWienerProcess!(0.0, zeros(2N); save_everystep=false)
    prob = SDEProblem(f!, g!, ones(2N), (0.0, 1.0), p; noise_rate_prototype=A, noise=W)
    integ = init(prob, EulerHeun(); dt=0.01, adaptive=false, save_on=false)
    step!(integ) # warm-up (pattern + JIT)
    integ
end

# Before this PR: @ballocated(step!($integ)) > 0 for sparse
# After  this PR: @ballocated(step!($integ)) == 0

Change Details

Only the non-diagonal noise path of perform_step!(::EulerHeunCache) (second stage) changes:

-    if is_diagonal_noise(integrator.sol.prob)
-        @.. nrtmp=(1/2)*W.dW*(gtmp1+gtmp2)
-    else
-        @.. gtmp1 = (1/2)*(gtmp1+gtmp2)
-        mul!(nrtmp, gtmp1, W.dW)
-    end
+    if is_diagonal_noise(integrator.sol.prob)
+        @.. nrtmp = (0.5) * W.dW * (gtmp1 + gtmp2)
+    else
+        # nrtmp already holds gtmp1*W.dW from stage 1:
+        # nrtmp := 0.5*gtmp2*W.dW + 0.5*nrtmp  (no sparse addition)
+        mul!(nrtmp, gtmp2, W.dW, 0.5, 0.5)
+    end

Keeps exact numerical semantics (0.5*(g1+g2)*dW == 0.5*g1*dW + 0.5*g2*dW).
Avoids materializing g1+g2 (SparseMatrixCSC), eliminating per-step allocations.
The EulerHeunConstantCache path is unaffected functionally; the hot in-place path is where allocations occurred.

Results

Julia 1.11.6, StochasticDiffEq v6.81.0 (per-step @btime step!(integ)):

Before:
N=1    Dense 65.9 ns (0)     Sparse 122.0 ns (6 alloc, 272 B)
N=10   Dense 188 ns (0)      Sparse 319 ns  (6 alloc, 1.00 KiB)
N=100  Dense 6.14 μs (0)     Sparse 2.18 μs (8 alloc, 8.05 KiB)
N=1000 Dense 2.35 ms (0)     Sparse 19.8 μs (9 alloc, 78.40 KiB)
N=10000 Dense 317 ms (0)     Sparse 215 μs  (9 alloc, 781.52 KiB)

After:
N=1    Dense 68.3 ns (0)     Sparse 50.6 ns  (0)
N=10   Dense 178 ns (0)      Sparse 140 ns   (0)
N=100  Dense 3.91 μs (0)     Sparse 0.877 μs (0)
N=1000 Dense 348 μs (0)      Sparse 8.30 μs  (0)
N=10000 Dense 132 ms (0)     Sparse 88.4 μs  (0)

Tests

Added a regression test ensuring @ballocated step!(integ) == 0 for sparse non-diagonal noise with an in-place g! that preserves the sparsity structure (placed under the Interface3 group).

Backward Compatibility

No public API changes.
Numerical results are identical (refactor of evaluation order by linearity).

Changelog

Performance: remove per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary sparse additions and using mul!(y, A, x, α, β).

Checklist

Allocation-regression test added
No public API changes
Conforms to SciML style / COLPRAC
Only public API used in tests

- non-diagonal noise path of `perform_step!(::EulerHeunCache)` second stage

src/perform_step/low_order.jl

test/nondiag_noise_eulerheun_test.jl

src/perform_step/low_order.jl

Co-authored-by: Christopher Rackauckas <[email protected]>

ChrisRackauckas · 2025-09-05T00:45:30Z

src/perform_step/low_order.jl

+        g2::StridedMatrix{T},
+        dW::StridedVector{T}
+) where {T <: LinearAlgebra.BlasFloat}
+    LinearAlgebra.BLAS.gemv!('N', T(0.5), g2, dW, T(0.5), y)


Is this necessary?

Short answer: Not for correctness or runtime performance.
The 5-arg mul!(y, A, x, α, β) already dispatches to BLAS for BlasFloat element types, so this specialization is functionally redundant.

I added the BLAS specialization only to make AllocCheck deterministic on the dense path. With the 5-arg mul!, AllocCheck reports conservative false positives coming from LinearAlgebra.gemv!’s generic wrappers (MulAddMul, wrap(…), etc.), even though runtime @allocated == 0. Calling BLAS.gemv! directly avoids those wrappers, so check_allocs comes back empty.

I’m happy to remove this helper if you’d prefer to keep the kernel simpler. If we drop it, I’ll keep the 5-arg mul! for dense and adjust tests accordingly:

Use AllocCheck for the sparse non-diagonal path (the target of this PR); and

Use @allocated == 0 for the dense BlasFloat path (since the AllocCheck warnings there are known false positives from LinearAlgebra).

Let me know which direction you prefer and I’ll update the PR.

it's deterministic without it

I can reproduce the same AllocCheck failure on four platforms.

Important: per-step heap allocations are actually zero — @allocated returns 0 on all of my machines. The failure comes from check_allocs reporting allocation sites in the typed IR even when those objects are elided or stack-allocated by the optimizer.

Where check_allocs flags (false positives)

From the generic gemv!/mul! wrappers in LinearAlgebra/src/matmul.jl:

construction of MulAddMul{…} (isbits, optimized away),

wrap(A, tA) returning Transpose/Adjoint/Symmetric/Hermitian views,

a Char argument,

and a “dynamic dispatch to _generic_matvecmul!” marker.

These are not observable heap allocations at runtime (hence @allocated == 0), but check_allocs still reports them as sites.

Environments

1) macOS (Apple Silicon)

BLAS.vendor() => :lbt BLAS.get_config => [ILP64] libopenblas64_.dylib Julia => 1.11.6 OS => macOS (arm64-apple-darwin24), Apple M1 Pro

Same failure; allocs come from LinearAlgebra/src/matmul.jl wrappers as listed above.

2) Linux (bare metal, Ubuntu 22.04.5 LTS)

BLAS.vendor() => :lbt BLAS.get_config => [ILP64] libopenblas64_.so Julia => 1.11.6 (x86_64-linux-gnu) Kernel => 6.0.19

Allocations reported at:

gemv! -> _generic_matvecmul! with MulAddMul(α, β)

wrap(A, tA) returning Transpose/Adjoint/Symmetric/Hermitian

Char at the call site

(Representative stack excerpt)

... LinearAlgebra/src/matmul.jl:446/460/462 ... wrap(A,tA) at LinearAlgebra.jl:538 ... _eh_accum_stage2! at StochasticDiffEq/src/perform_step/low_order.jl:115

3) WSL2 (Ubuntu 22.04.5 LTS)

BLAS.vendor() => :lbt BLAS.get_config => [ILP64] libopenblas64_.so Julia => 1.11.6 (x86_64-linux-gnu) Kernel => 6.6.87.2-microsoft-standard-WSL2

Same allocation sites and stacks as Linux.

4) Windows 11 (native)

BLAS.vendor() => :lbt BLAS.get_config => [ILP64] OpenBLAS (LBT) Julia => 1.11.6 (x86_64-w64-mingw32)

Same failure; identical allocation sites as above (omitting repetitive stack for brevity).

Observation

Keeping the dense path as a 5-arg generic mul!(y, A, x, α, β) makes check_allocs fail (false positive), even though runtime heap alloc is zero.

Switching that path to a direct BLAS.gemv! for BlasFloat + Strided removes those IR sites, so check_allocs becomes green on macOS / Linux / WSL2 / Windows (performance unchanged).

Proposal

If we want the AllocCheck-based zero-allocation assertion to stay green and stable across platforms, keep a tiny specialization that calls BLAS.gemv! directly.

Alternatively, for the dense path only, replace the AllocCheck assertion with a runtime @allocated == 0 check (and keep AllocCheck for the sparse non-diagonal path, which is the point of this PR).

Those are error throws IIRC, so they are fine.

Thanks — agreed: the sites check_allocs flags maybe in throw-only code paths of the generic LinearAlgebra gemv!/mul! wrappers. The hot path is allocation-free (@allocated == 0 on my machines across macOS, Linux, WSL2, and Windows with OpenBLAS ILP64).

To keep this robust across platforms, I’ll proceed as follows:

keep the implementation unchanged (no BLAS-specific specialization; use the 5-arg mul!),

switch the dense + BLAS test to a runtime @allocated == 0 assertion,

keep check_allocs for the sparse / non-diagonal path (the focus of this PR),

add a brief inline comment explaining why the dense case uses @allocated.

Unless you’d prefer a different direction, I’ll update the tests accordingly.

@ChrisRackauckas I’ve applied the above plan.
If you’d prefer a different direction, let me know.

…g `mul!`.

phjmsycc added 3 commits August 27, 2025 20:03

fix: numerically identical change for avoid memory allocation

6051e12

- non-diagonal noise path of `perform_step!(::EulerHeunCache)` second stage

fix: add test for check alloc

af6ec8e

fix: add test

19f9fd2

ChrisRackauckas reviewed Aug 27, 2025

View reviewed changes

src/perform_step/low_order.jl Outdated Show resolved Hide resolved

ChrisRackauckas reviewed Aug 27, 2025

View reviewed changes

test/nondiag_noise_eulerheun_test.jl Outdated Show resolved Hide resolved

ChrisRackauckas reviewed Aug 27, 2025

View reviewed changes

src/perform_step/low_order.jl Outdated Show resolved Hide resolved

ChrisRackauckas reviewed Aug 27, 2025

View reviewed changes

src/perform_step/low_order.jl Show resolved Hide resolved

phjmsycc and others added 5 commits August 28, 2025 15:07

Update src/perform_step/low_order.jl

1c9394d

Co-authored-by: Christopher Rackauckas <[email protected]>

Update src/perform_step/low_order.jl

0b8db78

Co-authored-by: Christopher Rackauckas <[email protected]>

fix: add AllocCheck to test target dependencies

ee48b72

fix: add comments

1d41bba

fix: apply AllocCheck

4f8b266

ChrisRackauckas reviewed Sep 5, 2025

View reviewed changes

Switch assert with @allocated == 0 for Dense path and keep the 5-ar…

22ad48d

…g `mul!`.

ChrisRackauckas approved these changes Sep 12, 2025

View reviewed changes

ChrisRackauckas merged commit f90d49d into SciML:master Sep 12, 2025
29 of 39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary `gtmp1 + gtmp2` #630

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary `gtmp1 + gtmp2` #630

Uh oh!

phjmsycc commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisRackauckas Sep 5, 2025

Uh oh!

phjmsycc Sep 5, 2025

Uh oh!

ChrisRackauckas Sep 5, 2025

Uh oh!

phjmsycc Sep 5, 2025 •

edited

Loading

Uh oh!

ChrisRackauckas Sep 5, 2025

Uh oh!

phjmsycc Sep 6, 2025

Uh oh!

phjmsycc Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary gtmp1 + gtmp2 #630

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary gtmp1 + gtmp2 #630

Uh oh!

Conversation

phjmsycc commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Minimal Reproducer

Change Details

Results

Tests

Backward Compatibility

Changelog

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisRackauckas Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

phjmsycc Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisRackauckas Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

phjmsycc Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Where check_allocs flags (false positives)

Environments

1) macOS (Apple Silicon)

2) Linux (bare metal, Ubuntu 22.04.5 LTS)

3) WSL2 (Ubuntu 22.04.5 LTS)

4) Windows 11 (native)

Observation

Proposal

Uh oh!

ChrisRackauckas Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

phjmsycc Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

phjmsycc Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary `gtmp1 + gtmp2` #630

Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary `gtmp1 + gtmp2` #630

phjmsycc commented Aug 27, 2025 •

edited

Loading

phjmsycc Sep 5, 2025 •

edited

Loading

Where `check_allocs` flags (false positives)