Skip to content

Conversation

@phjmsycc
Copy link
Contributor

@phjmsycc phjmsycc commented Aug 27, 2025

Summary

When using EulerHeun() with a sparse noise_rate_prototype (non-diagonal noise), perform_step! allocates every step. The root cause is the second stage forming (gtmp1 + gtmp2)/2 and then multiplying by W.dW, which materializes a new SparseMatrixCSC each step.

This PR removes those allocations by not forming the sparse sum. Instead, it computes

$$ \tfrac12 (\mathrm{gtmp1} + \mathrm{gtmp2})\ \mathrm{W.dW} = \tfrac12 (\mathrm{gtmp1}\ \mathrm{W.dW}) + \tfrac12 (\mathrm{gtmp2}\ \mathrm{W.dW}) $$

using the BLAS-like signature of mul!:

# was: build (gtmp1 + gtmp2)/2 then mul!(...)
# now: y := 0.5*(gtmp2*W.dW) + 0.5*y; (y already holds gtmp1*W.dW from stage 1)
mul!(nrtmp, gtmp2, W.dW, 0.5, 0.5)

This preserves numerical results and makes each step allocation-free for both dense and sparse paths.


Motivation

With in-place f!/g! and an allocation-free noise process, users expect per-step execution to allocate nothing. Dense paths already satisfy this. Sparse paths, however, incurred per-step allocations proportional to length(u) and the stage count, impacting throughput and predictability. Aligning sparse with dense eliminates those costs.


Minimal Reproducer

using StochasticDiffEq, SparseArrays, LinearAlgebra, DiffEqNoiseProcess, BenchmarkTools

f!(du,u,p,t) = (@. du = 0.999u)

# 2×2 identical-column block; g! only writes nzval into an existing sparsity pattern
function sparse_proto(N)
    I = Vector{Int}(undef, 4N); J = similar(I); V = ones(Float64, 4N)
    @inbounds for i in 1:N
        I[4i-3]=2i-1; J[4i-3]=2i-1
        I[4i-2]=2i;   J[4i-2]=2i-1
        I[4i-1]=2i-1; J[4i-1]=2i
        I[4i  ]=2i;   J[4i  ]=2i
    end
    sparse(I,J,V,2N,2N)
end

struct P; N::Int; end
@inline function ensure_pattern!(G::SparseMatrixCSC{T,Int}, N) where T
    s = one(T)
    @inbounds for i in 1:N
        G[2i-1,2i-1]=s; G[2i,2i-1]=s; G[2i-1,2i]=s; G[2i,2i]=s
    end
end

function g!(G::SparseMatrixCSC{T}, u, p::P, t) where T
    if nnz(G) < 4p.N; ensure_pattern!(G, p.N); end
    c012=T(0.12); c18=T(1.8)
    @inbounds for i in 1:p.N
        off = G.colptr[2i-1]; G.nzval[off]=c012*u[2i-1]; G.nzval[off+1]=c18*u[2i]
        off = G.colptr[2i  ]; G.nzval[off]=c012*u[2i-1]; G.nzval[off+1]=c18*u[2i]
    end
end

function make_integrator_sparse(N)
    A = sparse_proto(N)
    p = P(N)
    W = SimpleWienerProcess!(0.0, zeros(2N); save_everystep=false)
    prob = SDEProblem(f!, g!, ones(2N), (0.0, 1.0), p; noise_rate_prototype=A, noise=W)
    integ = init(prob, EulerHeun(); dt=0.01, adaptive=false, save_on=false)
    step!(integ) # warm-up (pattern + JIT)
    integ
end

# Before this PR: @ballocated(step!($integ)) > 0 for sparse
# After  this PR: @ballocated(step!($integ)) == 0

Change Details

Only the non-diagonal noise path of perform_step!(::EulerHeunCache) (second stage) changes:

-    if is_diagonal_noise(integrator.sol.prob)
-        @.. nrtmp=(1/2)*W.dW*(gtmp1+gtmp2)
-    else
-        @.. gtmp1 = (1/2)*(gtmp1+gtmp2)
-        mul!(nrtmp, gtmp1, W.dW)
-    end
+    if is_diagonal_noise(integrator.sol.prob)
+        @.. nrtmp = (0.5) * W.dW * (gtmp1 + gtmp2)
+    else
+        # nrtmp already holds gtmp1*W.dW from stage 1:
+        # nrtmp := 0.5*gtmp2*W.dW + 0.5*nrtmp  (no sparse addition)
+        mul!(nrtmp, gtmp2, W.dW, 0.5, 0.5)
+    end
  • Keeps exact numerical semantics (0.5*(g1+g2)*dW == 0.5*g1*dW + 0.5*g2*dW).
  • Avoids materializing g1+g2 (SparseMatrixCSC), eliminating per-step allocations.
  • The EulerHeunConstantCache path is unaffected functionally; the hot in-place path is where allocations occurred.

Results

Julia 1.11.6, StochasticDiffEq v6.81.0 (per-step @btime step!(integ)):

Before:
N=1    Dense 65.9 ns (0)     Sparse 122.0 ns (6 alloc, 272 B)
N=10   Dense 188 ns (0)      Sparse 319 ns  (6 alloc, 1.00 KiB)
N=100  Dense 6.14 μs (0)     Sparse 2.18 μs (8 alloc, 8.05 KiB)
N=1000 Dense 2.35 ms (0)     Sparse 19.8 μs (9 alloc, 78.40 KiB)
N=10000 Dense 317 ms (0)     Sparse 215 μs  (9 alloc, 781.52 KiB)

After:
N=1    Dense 68.3 ns (0)     Sparse 50.6 ns  (0)
N=10   Dense 178 ns (0)      Sparse 140 ns   (0)
N=100  Dense 3.91 μs (0)     Sparse 0.877 μs (0)
N=1000 Dense 348 μs (0)      Sparse 8.30 μs  (0)
N=10000 Dense 132 ms (0)     Sparse 88.4 μs  (0)

Tests

  • Added a regression test ensuring @ballocated step!(integ) == 0 for sparse non-diagonal noise with an in-place g! that preserves the sparsity structure (placed under the Interface3 group).

Backward Compatibility

  • No public API changes.
  • Numerical results are identical (refactor of evaluation order by linearity).

Changelog

Performance: remove per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary sparse additions and using mul!(y, A, x, α, β).


Checklist

  • Allocation-regression test added
  • No public API changes
  • Conforms to SciML style / COLPRAC
  • Only public API used in tests

g2::StridedMatrix{T},
dW::StridedVector{T}
) where {T <: LinearAlgebra.BlasFloat}
LinearAlgebra.BLAS.gemv!('N', T(0.5), g2, dW, T(0.5), y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: Not for correctness or runtime performance.
The 5-arg mul!(y, A, x, α, β) already dispatches to BLAS for BlasFloat element types, so this specialization is functionally redundant.

I added the BLAS specialization only to make AllocCheck deterministic on the dense path. With the 5-arg mul!, AllocCheck reports conservative false positives coming from LinearAlgebra.gemv!’s generic wrappers (MulAddMul, wrap(…), etc.), even though runtime @allocated == 0. Calling BLAS.gemv! directly avoids those wrappers, so check_allocs comes back empty.

I’m happy to remove this helper if you’d prefer to keep the kernel simpler. If we drop it, I’ll keep the 5-arg mul! for dense and adjust tests accordingly:

  • Use AllocCheck for the sparse non-diagonal path (the target of this PR); and
  • Use @allocated == 0 for the dense BlasFloat path (since the AllocCheck warnings there are known false positives from LinearAlgebra).

Let me know which direction you prefer and I’ll update the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's deterministic without it

Copy link
Contributor Author

@phjmsycc phjmsycc Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce the same AllocCheck failure on four platforms.

Important: per-step heap allocations are actually zero@allocated returns 0 on all of my machines. The failure comes from check_allocs reporting allocation sites in the typed IR even when those objects are elided or stack-allocated by the optimizer.

Where check_allocs flags (false positives)

From the generic gemv!/mul! wrappers in LinearAlgebra/src/matmul.jl:

  • construction of MulAddMul{…} (isbits, optimized away),
  • wrap(A, tA) returning Transpose/Adjoint/Symmetric/Hermitian views,
  • a Char argument,
  • and a “dynamic dispatch to _generic_matvecmul!” marker.

These are not observable heap allocations at runtime (hence @allocated == 0), but check_allocs still reports them as sites.

Environments

1) macOS (Apple Silicon)

BLAS.vendor()    => :lbt
BLAS.get_config  => [ILP64] libopenblas64_.dylib
Julia            => 1.11.6
OS               => macOS (arm64-apple-darwin24), Apple M1 Pro

Same failure; allocs come from LinearAlgebra/src/matmul.jl wrappers as listed above.

2) Linux (bare metal, Ubuntu 22.04.5 LTS)

BLAS.vendor()    => :lbt
BLAS.get_config  => [ILP64] libopenblas64_.so
Julia            => 1.11.6  (x86_64-linux-gnu)
Kernel           => 6.0.19

Allocations reported at:

  • gemv! -> _generic_matvecmul! with MulAddMul(α, β)
  • wrap(A, tA) returning Transpose/Adjoint/Symmetric/Hermitian
  • Char at the call site

(Representative stack excerpt)

... LinearAlgebra/src/matmul.jl:446/460/462
... wrap(A,tA) at LinearAlgebra.jl:538
... _eh_accum_stage2! at StochasticDiffEq/src/perform_step/low_order.jl:115

3) WSL2 (Ubuntu 22.04.5 LTS)

BLAS.vendor()    => :lbt
BLAS.get_config  => [ILP64] libopenblas64_.so
Julia            => 1.11.6  (x86_64-linux-gnu)
Kernel           => 6.6.87.2-microsoft-standard-WSL2

Same allocation sites and stacks as Linux.

4) Windows 11 (native)

BLAS.vendor()    => :lbt
BLAS.get_config  => [ILP64] OpenBLAS (LBT)
Julia            => 1.11.6  (x86_64-w64-mingw32)

Same failure; identical allocation sites as above (omitting repetitive stack for brevity).


Observation

  • Keeping the dense path as a 5-arg generic mul!(y, A, x, α, β) makes check_allocs fail (false positive), even though runtime heap alloc is zero.
  • Switching that path to a direct BLAS.gemv! for BlasFloat + Strided removes those IR sites, so check_allocs becomes green on macOS / Linux / WSL2 / Windows (performance unchanged).

Proposal

  • If we want the AllocCheck-based zero-allocation assertion to stay green and stable across platforms, keep a tiny specialization that calls BLAS.gemv! directly.
  • Alternatively, for the dense path only, replace the AllocCheck assertion with a runtime @allocated == 0 check (and keep AllocCheck for the sparse non-diagonal path, which is the point of this PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are error throws IIRC, so they are fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed: the sites check_allocs flags maybe in throw-only code paths of the generic LinearAlgebra gemv!/mul! wrappers. The hot path is allocation-free (@allocated == 0 on my machines across macOS, Linux, WSL2, and Windows with OpenBLAS ILP64).

To keep this robust across platforms, I’ll proceed as follows:

  • keep the implementation unchanged (no BLAS-specific specialization; use the 5-arg mul!),
  • switch the dense + BLAS test to a runtime @allocated == 0 assertion,
  • keep check_allocs for the sparse / non-diagonal path (the focus of this PR),
  • add a brief inline comment explaining why the dense case uses @allocated.

Unless you’d prefer a different direction, I’ll update the tests accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisRackauckas I’ve applied the above plan.
If you’d prefer a different direction, let me know.

@ChrisRackauckas ChrisRackauckas merged commit f90d49d into SciML:master Sep 12, 2025
29 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants