Skip to content

Conversation

@KristofferC
Copy link
Contributor

@KristofferC KristofferC commented Sep 30, 2025

Ref #528.

MWE to see the effect of this PR is:

using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings

function generate_data(f)
    number_of_companies = 10000
    dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
    companyid = sample(100000:1000000, number_of_companies, replace = false)

    number_of_items = length(companyid)*length(dates)

    df = DataFrame(
            dates = repeat(dates, outer = number_of_companies),
            companyid = repeat(companyid, inner = length(dates)),
            item1 = rand(number_of_items),
            item2 = randn(number_of_items),
            item3 = rand(1:1000,number_of_items),
            item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates))
        )

    @info "Saving to $f"
    open(f, "w") do f
        Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
    end
end

f = joinpath(@__DIR__, "mytestdata.arrow")

if !isfile(f)
    generate_data(f)
end

Arrow.Table(f)
@time Arrow.Table(f)
@time Arrow.Table(f; threaded=false)

With the result:

  9.964974 seconds (2.57 M allocations: 125.678 MiB, 0.14% gc time, 10610359 lock conflicts)
  0.148734 seconds (2.55 M allocations: 123.376 MiB, 9.03% gc time, 32 lock conflicts, 6.36% compilation time)

We can see that disabling this part of the Table threading avoids the insane lock contention on https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108 and makes the package not grind to a halt when threads are used.

This is a kind of ugly workaround but I don't really have the familiarity with the packages involved to be able to do an assessment on how reasonable implementing good scaling here is.

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.97%. Comparing base (3712291) to head (d52c2e6).
⚠️ Report is 36 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #568      +/-   ##
==========================================
- Coverage   87.43%   86.97%   -0.47%     
==========================================
  Files          26       27       +1     
  Lines        3288     3401     +113     
==========================================
+ Hits         2875     2958      +83     
- Misses        413      443      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KristofferC
Copy link
Contributor Author

A follow-up question: Does the threading here ever actually give good performance, or should it be disabled permanently? Is it something pathological with the MWE shared here that causes the scaling to be so bad?

@KristofferC
Copy link
Contributor Author

cc @quinnj

@quinnj
Copy link
Member

quinnj commented Oct 24, 2025

Sorry for the slow responding here; let me dig into this a bit tonight.

@kou
Copy link
Member

kou commented Oct 24, 2025

@quinnj Do you want to include this in the next release? If so, I'll start a release process after this is completed.

@quinnj
Copy link
Member

quinnj commented Oct 24, 2025

I think #570 should fix the original issue; let's review/test/merge that and then cut the release.

quinnj added a commit that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants