-
-
Notifications
You must be signed in to change notification settings - Fork 76
Fix: Eliminate per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary gtmp1 + gtmp2
#630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- non-diagonal noise path of `perform_step!(::EulerHeunCache)` second stage
Co-authored-by: Christopher Rackauckas <[email protected]>
Co-authored-by: Christopher Rackauckas <[email protected]>
src/perform_step/low_order.jl
Outdated
| g2::StridedMatrix{T}, | ||
| dW::StridedVector{T} | ||
| ) where {T <: LinearAlgebra.BlasFloat} | ||
| LinearAlgebra.BLAS.gemv!('N', T(0.5), g2, dW, T(0.5), y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: Not for correctness or runtime performance.
The 5-arg mul!(y, A, x, α, β) already dispatches to BLAS for BlasFloat element types, so this specialization is functionally redundant.
I added the BLAS specialization only to make AllocCheck deterministic on the dense path. With the 5-arg mul!, AllocCheck reports conservative false positives coming from LinearAlgebra.gemv!’s generic wrappers (MulAddMul, wrap(…), etc.), even though runtime @allocated == 0. Calling BLAS.gemv! directly avoids those wrappers, so check_allocs comes back empty.
I’m happy to remove this helper if you’d prefer to keep the kernel simpler. If we drop it, I’ll keep the 5-arg mul! for dense and adjust tests accordingly:
- Use AllocCheck for the sparse non-diagonal path (the target of this PR); and
- Use
@allocated == 0for the denseBlasFloatpath (since the AllocCheck warnings there are known false positives fromLinearAlgebra).
Let me know which direction you prefer and I’ll update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's deterministic without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reproduce the same AllocCheck failure on four platforms.
Important: per-step heap allocations are actually zero — @allocated returns 0 on all of my machines. The failure comes from check_allocs reporting allocation sites in the typed IR even when those objects are elided or stack-allocated by the optimizer.
Where check_allocs flags (false positives)
From the generic gemv!/mul! wrappers in LinearAlgebra/src/matmul.jl:
- construction of
MulAddMul{…}(isbits, optimized away), wrap(A, tA)returningTranspose/Adjoint/Symmetric/Hermitianviews,- a
Charargument, - and a “dynamic dispatch to
_generic_matvecmul!” marker.
These are not observable heap allocations at runtime (hence @allocated == 0), but check_allocs still reports them as sites.
Environments
1) macOS (Apple Silicon)
BLAS.vendor() => :lbt
BLAS.get_config => [ILP64] libopenblas64_.dylib
Julia => 1.11.6
OS => macOS (arm64-apple-darwin24), Apple M1 Pro
Same failure; allocs come from LinearAlgebra/src/matmul.jl wrappers as listed above.
2) Linux (bare metal, Ubuntu 22.04.5 LTS)
BLAS.vendor() => :lbt
BLAS.get_config => [ILP64] libopenblas64_.so
Julia => 1.11.6 (x86_64-linux-gnu)
Kernel => 6.0.19
Allocations reported at:
gemv!->_generic_matvecmul!withMulAddMul(α, β)wrap(A, tA)returningTranspose/Adjoint/Symmetric/HermitianCharat the call site
(Representative stack excerpt)
... LinearAlgebra/src/matmul.jl:446/460/462
... wrap(A,tA) at LinearAlgebra.jl:538
... _eh_accum_stage2! at StochasticDiffEq/src/perform_step/low_order.jl:115
3) WSL2 (Ubuntu 22.04.5 LTS)
BLAS.vendor() => :lbt
BLAS.get_config => [ILP64] libopenblas64_.so
Julia => 1.11.6 (x86_64-linux-gnu)
Kernel => 6.6.87.2-microsoft-standard-WSL2
Same allocation sites and stacks as Linux.
4) Windows 11 (native)
BLAS.vendor() => :lbt
BLAS.get_config => [ILP64] OpenBLAS (LBT)
Julia => 1.11.6 (x86_64-w64-mingw32)
Same failure; identical allocation sites as above (omitting repetitive stack for brevity).
Observation
- Keeping the dense path as a 5-arg generic
mul!(y, A, x, α, β)makescheck_allocsfail (false positive), even though runtime heap alloc is zero. - Switching that path to a direct
BLAS.gemv!forBlasFloat+Stridedremoves those IR sites, socheck_allocsbecomes green on macOS / Linux / WSL2 / Windows (performance unchanged).
Proposal
- If we want the AllocCheck-based zero-allocation assertion to stay green and stable across platforms, keep a tiny specialization that calls
BLAS.gemv!directly. - Alternatively, for the dense path only, replace the AllocCheck assertion with a runtime
@allocated == 0check (and keep AllocCheck for the sparse non-diagonal path, which is the point of this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are error throws IIRC, so they are fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks — agreed: the sites check_allocs flags maybe in throw-only code paths of the generic LinearAlgebra gemv!/mul! wrappers. The hot path is allocation-free (@allocated == 0 on my machines across macOS, Linux, WSL2, and Windows with OpenBLAS ILP64).
To keep this robust across platforms, I’ll proceed as follows:
- keep the implementation unchanged (no BLAS-specific specialization; use the 5-arg
mul!), - switch the dense + BLAS test to a runtime
@allocated == 0assertion, - keep
check_allocsfor the sparse / non-diagonal path (the focus of this PR), - add a brief inline comment explaining why the dense case uses
@allocated.
Unless you’d prefer a different direction, I’ll update the tests accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisRackauckas I’ve applied the above plan.
If you’d prefer a different direction, let me know.
Summary
When using EulerHeun() with a sparse
noise_rate_prototype(non-diagonal noise),perform_step!allocates every step. The root cause is the second stage forming(gtmp1 + gtmp2)/2and then multiplying byW.dW, which materializes a newSparseMatrixCSCeach step.This PR removes those allocations by not forming the sparse sum. Instead, it computes
using the BLAS-like signature of
mul!:This preserves numerical results and makes each step allocation-free for both dense and sparse paths.
Motivation
With in-place
f!/g!and an allocation-free noise process, users expect per-step execution to allocate nothing. Dense paths already satisfy this. Sparse paths, however, incurred per-step allocations proportional tolength(u)and the stage count, impacting throughput and predictability. Aligning sparse with dense eliminates those costs.Minimal Reproducer
Change Details
Only the non-diagonal noise path of
perform_step!(::EulerHeunCache)(second stage) changes:0.5*(g1+g2)*dW == 0.5*g1*dW + 0.5*g2*dW).g1+g2(SparseMatrixCSC), eliminating per-step allocations.EulerHeunConstantCachepath is unaffected functionally; the hot in-place path is where allocations occurred.Results
Julia 1.11.6, StochasticDiffEq v6.81.0 (per-step
@btime step!(integ)):Tests
@ballocated step!(integ) == 0for sparse non-diagonal noise with an in-placeg!that preserves the sparsity structure (placed under the Interface3 group).Backward Compatibility
Changelog
Performance: remove per-step allocations in EulerHeun with sparse non-diagonal noise by avoiding temporary sparse additions and using
mul!(y, A, x, α, β).Checklist