Implement oneTBB-based parallelism for C lib #445

silvanshade · 2025-02-08T14:36:02Z

This PR implements oneTBB based parallelism for C.

The implementation is essentially the same as how the Rayon one works: we recursively create a task_group on the work stealing scheduler to run closures for the left and right calculations and wait on their results.

I haven't included benchmarks here but informal testing on a Zen4 7950X workstation shows nearly identical performance to the Rayon implementation for large memory-mapped input.

I have another PR I'll submit soon which also adds mmap support and will try to include benchmarks there.

Related Issues

Rationale for oneTBB

Rationale for choosing oneTBB versus some other multi-threading framework:

fast
easy to integrate
portable
flexible: oneTBB has a lot of additional functionality we could potentially use in future refinements
widely adopted: Clang and GCC use it for parallel C++ and it's also used for mold

The downside, if you consider it one, is that it's C++ where the rest of libblake3 is C.

Summary of Changes

Here is a summary of changes I've made to integrate TBB:

Added blake3_tbb.cpp which defines blake3_compress_subtree_wide_join_tbb
Removed static from blake3_compress_subtree_wide so it can be called from C++ TU
Declared those two functions as BLAKE3_PRIVATE in blake3_impl.h
Renamed blake3_hasher_update to blake3_hasher_update_base and added use_tbb param
Defined blake3_hasher_update and blake3_hasher_update_tbb which call the _base function
Updated the documentation to mention multi-threading and blake3_hasher_update_tbb
Refactored c/CMakeLists.txt:
- Added support for finding (or fetching) oneTBB
- Modified handling of BLAKE3_NO_* options to control compiled sources (avoids need for rm as with Makefile tests)
- Added target for compiling example.c executable
- Added target for compiling main.c executable
- Added ctest test target for running c/test.py
Refactored CI c_tests job to use CMake instead of Makefile (for TBB compatibility)
Refactored CI cmake_build job:
- Fixed matrix selection of different compilers (didn't do anything before)
- Added more configurations for Windows

Design Considerations

One potential issue with these changes is that now the blake3_compress_subtree_wide is no longer static and has external visibility for linkage. If the user compiles libblake3 as a shared library, the BLAKE3_PRIVATE will cause the linker to complain about hidden visibility but it won't do that if they compile as a static library.

I'm not sure if this is really a problem or not.

One potentially radical solution would be to just make blake3.c into blake3.cpp and then we wouldn't need the separate TU for TBB support. This would have some impact on portability and may not be a good trade off given the project goals.

Another option could be to make blake3_compress_subtree_wide_join_tbb accept a function pointer to static blake3_compress_subtree_wide. But then blake3_compress_subtree_wide_join_tbb still has external visibility in a static library which we can't really eliminate.

BurningEnlightenment

First of all, thank you for contributing.

Aside from high-level decisions like the API shape or whether we want to add oneTBB as a dependency, there are a few technical and hygiene issues with the current changeset.

c/CMakeLists.txt

c/blake3.h

c/blake3_impl.h

c/CMakeLists.txt

silvanshade · 2025-02-10T12:49:46Z

Thanks for the feedback.

I have some additional changes incoming as part of some refactoring I've already been doing to clean things up a bit and to work better with the upcoming mmap PR I plan to submit. I'll also address some of the suggestions you made with those changes.

c/blake3_tbb.cpp

oconnor663 · 2025-02-10T21:25:34Z

We definitely need something like this. I've been kicking the can down the road for years, and I'm glad you've come along and done it :) Some scattered thoughts:

How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.
I care a lot about the "compile by hand" workflow described in https://github.com/BLAKE3-team/BLAKE3/tree/master/c#example and https://github.com/BLAKE3-team/BLAKE3/tree/master/c#building. Thanks for keeping that unchanged for folks who don't use threading. For folks who do, I'm curious how that's going to look. Does something like this seem right?
```
$g++ -c -O3 blake3_tbb.cpp -o blake3_tbb.o
$gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c \
    blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S \
    blake3_avx512_x86-64_unix.S blake3_tbb.o -ltbb -lstdc++
```
Is there a way to do this in one command instead of two? (The intrinsics builds already require multiple commands, so that's not the end of the world.)
On the Rust side of things there's an update_mmap_rayon method, and it seems like something similar (blake3_hasher_update_mmap_tbb?) is what the vast majority of callers will want. Is that something like what you're planning?

sneves · 2025-02-10T21:49:02Z

How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.

I wondered the same. With OpenMP I think the right way to do it is

   // Recurse! If this implementation adds multi-threading support in the
   // future, this is where it will go.
-  size_t left_n = blake3_compress_subtree_wide(input, left_input_len, key,
-                                               chunk_counter, flags, cv_array);
-  size_t right_n = blake3_compress_subtree_wide(
-      right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
-
+  size_t left_n = -1;
+  size_t right_n = -1;
+  #pragma omp taskgroup
+  {
+    #pragma omp task shared(left_n, cv_array)
+    left_n = blake3_compress_subtree_wide(input, left_input_len, key, chunk_counter, flags, cv_array);
+    #pragma omp task shared(right_n, right_cvs)                              
+    right_n = blake3_compress_subtree_wide(right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
+  }
   // The special case again. If simd_degree=1, then we'll have left_n=1 and
   // right_n=1. Rather than compressing them into a single output, return
   // them directly, to make sure we always have at least two outputs.
@@ -342,8 +347,13 @@ INLINE void compress_subtree_to_parent_node(
 #endif
 
   uint8_t cv_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN];
-  size_t num_cvs = blake3_compress_subtree_wide(input, input_len, key,
+  size_t num_cvs = -1;
+  #pragma omp parallel
+  {
+    #pragma omp single nowait
+    num_cvs = blake3_compress_subtree_wide(input, input_len, key,
                                                 chunk_counter, flags, cv_array);
+  }
   assert(num_cvs <= MAX_SIMD_DEGREE_OR_2);

and add -fopenmp to the compiler switches to enable it. The result is a mixed bag:

The baseline time to hash a 3-4 GB file on my laptop is ~0.27s with b3sum / TBB.
With Clang's OpenMP implementation I get ~0.32s.
With GCC's OpenMP implementation I get ~0.85s (!)

Long story short, I don't think OpenMP implementations are tuned for this kind of task-based parallelism? Perf traces show a lot of spinlock activity on the GCC implementation; I'm not sure why.

On another note, the TBB implementation could be simplified to

oneapi::tbb::parallel_invoke(
    [=] { *l_n = blake3_compress_subtree_wide(l_input, l_input_len, key, l_chunk_counter, flags, l_cvs, use_tbb); },
    [=] { *r_n = blake3_compress_subtree_wide(r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb); }
);

silvanshade · 2025-02-10T23:28:43Z

How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.

Honestly I'm not an expert in either and only just learned oneTBB for this specific feature.

But my impression (which seems to be validated by the comment from @sneves), is that OpenMP is generally not suitable for nested parallelism, which is exactly how the problem is being solved here with Rayon and the oneTBB implementation.

Most of the performance recommendations I see regarding OpenMP suggest to avoid nested parallelism at all costs because it's known to be inefficient.

The general recommendation (again shown by @sneves below) seems to be to prefer tasks and related functionality instead of nested parallelism, but even then the performance isn't always great, where as the oneTBB implementation is within margin of error of Rayon.

But if we could make the OpenMP task approach somehow work, the other main problem is that OpenMP is poorly supported on MSVC/Windows since Microsoft only supports version 2.0 which doesn't even have the task functionality.

Support on macOS also a bit problematic since Apple disables OpenMP support in Apple Clang I believe, so users would also have to jump through extra hoops there.

Does something like this seem right?

Yes, that generally looks right. I can add some documentation for exactly what commands should work for compiling by hand.

Is there a way to do this in one command instead of two?

You might be able to do it with some additional flags like -x<language> and related but I'm not sure. It'd probably be more verbose and less portable though.

On the Rust side of things there's an update_mmap_rayon method, and it seems like something similar (blake3_hasher_update_mmap_tbb?) is what the vast majority of callers will want. Is that something like what you're planning?

Yes, this is exactly what I intend.

The upcoming mmap PR is based on llfio. Unfortunately another C++ library but again with similar criteria for choosing it as oneTBB: it's fast, portable, flexible, battle tested, etc.

oconnor663 · 2025-02-11T19:33:20Z

Yes, that generally looks right. I can add some documentation for exactly what commands should work for compiling by hand.

That would be awesome, thanks. I'd like to think that I have all our non-CMake downstream users in mind, but the sad truth is that I've just never learned CMake properly. (And @BurningEnlightenment enables my laziness by being so on top of things.)

oconnor663 · 2025-02-11T19:38:51Z

OpenMP is generally not suitable for nested parallelism, which is exactly how the problem is being solved here with Rayon and the oneTBB implementation...the other main problem is that OpenMP is poorly supported on MSVC/Windows...Support on macOS also a bit problematic

Got it. Sounds clear enough to me.

silvanshade · 2025-02-11T23:27:13Z

I made some more changes renaming the CMake options to BLAKE3_USE_TBB and BLAKE3_FETCH_TBB and updated the build documentation.

silvanshade · 2025-02-17T20:09:03Z

@BurningEnlightenment Are there any remaining changes requested? I think I've addressed all of the issues raised so far and marked them as resolved.

BurningEnlightenment · 2025-02-17T20:21:34Z

@silvanshade I haven't had time to re-review. I'll try to take a look tomorrow. Sorry to string you along.

silvanshade · 2025-03-04T20:47:46Z

I rebased the PR to fix the merge conflict and made one additional change where I modified the manual build instructions to use pkg-config for tbb.

silvanshade · 2025-03-04T21:10:03Z

If you remove reference to the implementation, that precludes the possibility of adding another alternate parallel backend in the future and being able to access both within the same application unless you complicate the design.

OTOH it basically enshrines an implementation detail in the public API. If Intel decides to discontinue oneTBB, we might be forced to switch to a different library. In turn our users would be forced to change their code for no practical gain.

For the record I think this is unlikely to happen given that TBB has been around a lot longer than BLAKE3 and is quite widespread in usage in several central pieces of software.

It's also included as part of the oneAPI specification under the direction of the UXL foundation with several steering members aside from intel.

I also see no reason why an application would want to explicitly refer to the parallelization library.

The main reason is to allow for dynamic selection of the parallel runtime.

If BLAKE3 were to, for example, add a GPU-accelerated or heterogeneous compute backend, it would not be hard to imagine an application that might choose to use different algorithms at different times where it's known that one will be more efficient than another based on the current workload.

One could also imagine a GUI application with a widget for selecting which algorithm to use.

If you make the selection build-configuration based you no longer allow for that possibility.

If different parallelization approaches exhibit sufficiently different trade-offs, it'd make more sense to name the trade-off or defining property in the API name. Switching between similar backends shouldn't be more involved than a recompilation with different options.

If you make the naming characteristic-based it still suffers from this issue of reduced granularity because you could have multiple backends of the same category.

I don't have a strong opinion about the specific naming per se, my concern is more about flexibility.

One option that might satisfy your concern would be to use an enum parameter to control the backend. This would allow for being precise and selecting the specific backend but also allow us to provide a default option that could be changed in the future with less possibility of affecting code downstream.

c/CMakeLists.txt

oconnor663 · 2025-03-09T23:11:03Z

I also see no reason why an application would want to explicitly refer to the parallelization library.

The reason you might care that for example the Rust *_rayon methods use Rayon internally, is that it's possible to configure the global Rayon thread pool, and it's also possible to create a local thread pool and run arbitrary code in that context. I don't know whether oneTBB has similar features, but my friendly neighborhood LLM suggests it does :)

silvanshade · 2025-03-10T00:00:30Z

I also see no reason why an application would want to explicitly refer to the parallelization library.

I don't know whether oneTBB has similar features, but my friendly neighborhood LLM suggests it does :)

Correct, it does. This is also what I was referring to with the backend specific set up. I think most such frameworks probably offer similar functionality.

From a code readability and maintainability perspective, given an example application which uses the parallel hashing, I think it would be clearer to see the parallelism framework being mentioned in the function name rather than trying to infer which framework is implicitly being selected by the BLAKE3 library based on the current build configuration, which will involve more non-local information.

oconnor663 · 2025-03-10T00:41:23Z

I've got an example branch that integrates this feature with the blake3_c_rust_bindings test harness in this repo. We could include that in this PR, or I could land it after, either is fine with me. Here's the commit: 6d05313. (Integrating the LLFIO feature is still a question mark for me, in terms of how to fetch it. That commit just assumes that TBB is installed globally.)

oconnor663 · 2025-03-11T19:51:20Z

c/blake3_tbb.cpp

+      [=]() {
+        *r_n = blake3_compress_subtree_wide(
+            r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb);
+      });


I think you've copied what I did on the Rust/Rayon side of things, which is to keep spawning tasks "all the way down", until input_len <= blake3_simd_degree() * BLAKE3_CHUNK_LEN. In other words, each leaf task hashes 1024 bytes / 16 blocks per SIMD lane. When I benchmark this stuff, it that's in the neighborhood of 2 microseconds per leaf task, which does seem awfully small. On the other hand, I've never been able to measure any significant speedup from hardcoding single-threaded execution below some looser bound (e.g. 128 KiB). So I don't think this needs to change, but since we're looking at it I'm curious if any other folks have thoughts about task spawning overhead.

silvanshade · 2025-03-11T19:54:11Z

I've got an example branch that integrates this feature with the blake3_c_rust_bindings test harness in this repo. We could include that in this PR, or I could land it after, either is fine with me. Here's the commit: 6d05313. (Integrating the LLFIO feature is still a question mark for me, in terms of how to fetch it. That commit just assumes that TBB is installed globally.)

Sounds good. I'll add that commit to the PR here.

oconnor663 · 2025-03-11T20:06:47Z

@silvanshade Update: 97ce1c8 (https://github.com/BLAKE3-team/BLAKE3/tree/rust_bindings_tbb) is a better commit, with functioning CI.

oconnor663 · 2025-03-13T19:16:34Z

LGTM! Thanks for putting so much work into this. I have some edits I want to make to the c/README.md (if you can forgive me, I'm going to move blake3_hasher_update_tbb to the "Less Common Functions" section), and I'll do those as a follow-up commit.

@silvanshade

Changes since 1.6.1: - The C implementation has gained multithreading support, based on Intel's oneTBB library. This works similarly to the Rayon-based multithreading used in the Rust implementation. See c/README.md for details. Contributed by @silvanshade (#445). - The Rust implementation has gained a WASM SIMD backend, gated by the `wasm32_simd` Cargo feature. Under Wasmtime on my laptop, this is a 6x performance improvement for large inputs. This backend is currently Rust-only. Contributed by @monoid (#341). - Fixed cross-compilation builds targeting Windows with cargo-xwin. Contributed by @Sporif and @toothbrush7777777 (#230). - Added `b3sum --tag`, which changes the output format. This is for compatibility with GNU checksum tools (which use the same flag) and BSD checksum tools (which use the output format this flag turns on). Contributed by @leahneukirchen (#453) and @dbohdan (#430).

lelik107 · 2025-04-03T07:00:07Z

@silvanshade @oconnor663
Sorry guys for disturbing you, but I'm neither a coder, nor I a developer.
But is there a way to get a working sample of multi-threaded С code compiled with TBB support for an end user?

silvanshade · 2025-04-03T14:16:15Z

@lelik107 There is an example at https://github.com/BLAKE3-team/BLAKE3/blob/master/c/example_tbb.c

You can compile and run it like this:

# from the BLAKE3 directory
cmake --fresh -S c -B c/build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=c/build/install -DBLAKE3_USE_TBB=ON -DBLAKE3_EXAMPLES=ON
cmake --build c/build --target install
c/build/bin/install/blake3-example-tbb FILES

lelik107 · 2025-04-03T16:58:03Z

@silvanshade Thank you but, If the developers specifically mentioned this PR in v.1.7.0 change log and as I see from here:
#457
they decided OpenMP isn't very appropriate for BLAKE3, why then they haven't provided the С binaries with TBB for the Windows users, who are not very used to build the code themselves to test the performance against Rust Rayon implementation, for example?
@sneves being a С professional is more than capable to do this, I know this exactly from using BLAKE2 for a decade.

silvanshade · 2025-04-03T17:49:02Z

@lelik107 Compiled libraries aren't provided for any of the platforms directly as far as I know. That's usually up to the package maintainers. It would probably be better to open up a separate issue if you'd like to discuss that in particular.

BurningEnlightenment · 2025-04-03T18:00:22Z

why then they haven't provided the С binaries with TBB for the Windows users, who are not very used to build the code themselves to test the performance against Rust Rayon implementation

There is no official end-user b3sum C implementation, you should always use the rust based one.

lelik107 · 2025-04-04T04:11:26Z

@silvanshade Thank you, I understood.
#465

silvanshade force-pushed the tbb-parallelism branch 5 times, most recently from 60ae84d to e890259 Compare February 9, 2025 21:26

BurningEnlightenment requested changes Feb 10, 2025

View reviewed changes

BurningEnlightenment reviewed Feb 10, 2025

View reviewed changes

c/blake3_tbb.cpp Outdated Show resolved Hide resolved

silvanshade force-pushed the tbb-parallelism branch from e890259 to 6225f9c Compare February 11, 2025 04:11

silvanshade requested a review from BurningEnlightenment February 11, 2025 04:19

silvanshade force-pushed the tbb-parallelism branch from 6225f9c to 05531b0 Compare February 11, 2025 18:31

silvanshade force-pushed the tbb-parallelism branch from 05531b0 to e06c0c5 Compare February 11, 2025 23:25

silvanshade force-pushed the tbb-parallelism branch 6 times, most recently from 9b98827 to 87412d4 Compare February 12, 2025 03:59

silvanshade mentioned this pull request Feb 12, 2025

Implement llfio-based memory-mapped IO for C lib #446

Closed

silvanshade force-pushed the tbb-parallelism branch from 87412d4 to d765804 Compare February 12, 2025 05:34

silvanshade requested a review from oconnor663 February 17, 2025 20:03

silvanshade mentioned this pull request Feb 17, 2025

Add BLAKE3 hashing algorithm NixOS/nix#12379

Merged

silvanshade force-pushed the tbb-parallelism branch from 3ff1fc4 to 10ed141 Compare March 4, 2025 20:45

silvanshade force-pushed the tbb-parallelism branch from 10ed141 to eba3b87 Compare March 4, 2025 21:35

oconnor663 reviewed Mar 9, 2025

View reviewed changes

c/CMakeLists.txt Outdated Show resolved Hide resolved

oconnor663 reviewed Mar 11, 2025

View reviewed changes

silvanshade force-pushed the tbb-parallelism branch from eba3b87 to 61ea63f Compare March 12, 2025 22:21

silvanshade and others added 2 commits March 12, 2025 16:25

Implement TBB-based parallelism for C lib

da95d24

tbb support in blake3_c_rust_bindings

58d13d6

silvanshade force-pushed the tbb-parallelism branch from 61ea63f to 58d13d6 Compare March 12, 2025 22:25

silvanshade requested a review from oconnor663 March 12, 2025 22:26

oconnor663 merged commit 057586a into BLAKE3-team:master Mar 13, 2025
63 checks passed

silvanshade mentioned this pull request Mar 16, 2025

libblake3: 1.6.1 -> 1.7.0; enable TBB multi-threading support NixOS/nixpkgs#390458

Merged

1 task

ibmibmibm mentioned this pull request Mar 19, 2025

Implement OpenMP-based parallelism for C lib #457

Open

samuelburnham mentioned this pull request Mar 21, 2025

perf: Blake3 C vs Rust impl argumentcomputer/ix#29

Closed

BurningEnlightenment mentioned this pull request Apr 2, 2025

Fix TBB transitive dependencies for libblake3 #460

Merged

check4game mentioned this pull request Oct 3, 2025

Multi-threaded С Binaries with TBB/OMP #525

Open

Implement oneTBB-based parallelism for C lib #445

Implement oneTBB-based parallelism for C lib #445

Uh oh!

Conversation

silvanshade commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Rationale for oneTBB

Summary of Changes

Design Considerations

Uh oh!

BurningEnlightenment left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silvanshade commented Feb 10, 2025

Uh oh!

Uh oh!

oconnor663 commented Feb 10, 2025

Uh oh!

sneves commented Feb 10, 2025

Uh oh!

silvanshade commented Feb 10, 2025

Uh oh!

oconnor663 commented Feb 11, 2025

Uh oh!

oconnor663 commented Feb 11, 2025

Uh oh!

silvanshade commented Feb 11, 2025

Uh oh!

silvanshade commented Feb 17, 2025

Uh oh!

BurningEnlightenment commented Feb 17, 2025

Uh oh!

silvanshade commented Mar 4, 2025

Uh oh!

silvanshade commented Mar 4, 2025

Uh oh!

Uh oh!

oconnor663 commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvanshade commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oconnor663 commented Mar 10, 2025

Uh oh!

oconnor663 Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

silvanshade commented Mar 11, 2025

Uh oh!

oconnor663 commented Mar 11, 2025

Uh oh!

oconnor663 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lelik107 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvanshade commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lelik107 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvanshade commented Apr 3, 2025

Uh oh!

BurningEnlightenment commented Apr 3, 2025

Uh oh!

lelik107 commented Apr 4, 2025

Uh oh!

Reviewers

silvanshade commented Feb 8, 2025 •

edited

Loading

oconnor663 commented Mar 9, 2025 •

edited

Loading

silvanshade commented Mar 10, 2025 •

edited

Loading

oconnor663 commented Mar 13, 2025 •

edited

Loading

lelik107 commented Apr 3, 2025 •

edited

Loading

silvanshade commented Apr 3, 2025 •

edited

Loading

lelik107 commented Apr 3, 2025 •

edited

Loading