Skip to content

Conversation

@silvanshade
Copy link
Contributor

@silvanshade silvanshade commented Feb 8, 2025

This PR implements oneTBB based parallelism for C.

The implementation is essentially the same as how the Rayon one works: we recursively create a task_group on the work stealing scheduler to run closures for the left and right calculations and wait on their results.

I haven't included benchmarks here but informal testing on a Zen4 7950X workstation shows nearly identical performance to the Rayon implementation for large memory-mapped input.

I have another PR I'll submit soon which also adds mmap support and will try to include benchmarks there.

Related Issues

Rationale for oneTBB

Rationale for choosing oneTBB versus some other multi-threading framework:

  • fast
  • easy to integrate
  • portable
  • flexible: oneTBB has a lot of additional functionality we could potentially use in future refinements
  • widely adopted: Clang and GCC use it for parallel C++ and it's also used for mold

The downside, if you consider it one, is that it's C++ where the rest of libblake3 is C.

Summary of Changes

Here is a summary of changes I've made to integrate TBB:

  • Added blake3_tbb.cpp which defines blake3_compress_subtree_wide_join_tbb
  • Removed static from blake3_compress_subtree_wide so it can be called from C++ TU
  • Declared those two functions as BLAKE3_PRIVATE in blake3_impl.h
  • Renamed blake3_hasher_update to blake3_hasher_update_base and added use_tbb param
  • Defined blake3_hasher_update and blake3_hasher_update_tbb which call the _base function
  • Updated the documentation to mention multi-threading and blake3_hasher_update_tbb
  • Refactored c/CMakeLists.txt:
    • Added support for finding (or fetching) oneTBB
    • Modified handling of BLAKE3_NO_* options to control compiled sources (avoids need for rm as with Makefile tests)
    • Added target for compiling example.c executable
    • Added target for compiling main.c executable
    • Added ctest test target for running c/test.py
  • Refactored CI c_tests job to use CMake instead of Makefile (for TBB compatibility)
  • Refactored CI cmake_build job:
    • Fixed matrix selection of different compilers (didn't do anything before)
    • Added more configurations for Windows

Design Considerations

One potential issue with these changes is that now the blake3_compress_subtree_wide is no longer static and has external visibility for linkage. If the user compiles libblake3 as a shared library, the BLAKE3_PRIVATE will cause the linker to complain about hidden visibility but it won't do that if they compile as a static library.

I'm not sure if this is really a problem or not.

One potentially radical solution would be to just make blake3.c into blake3.cpp and then we wouldn't need the separate TU for TBB support. This would have some impact on portability and may not be a good trade off given the project goals.

Another option could be to make blake3_compress_subtree_wide_join_tbb accept a function pointer to static blake3_compress_subtree_wide. But then blake3_compress_subtree_wide_join_tbb still has external visibility in a static library which we can't really eliminate.

@silvanshade silvanshade force-pushed the tbb-parallelism branch 5 times, most recently from 60ae84d to e890259 Compare February 9, 2025 21:26
Copy link
Collaborator

@BurningEnlightenment BurningEnlightenment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for contributing.

Aside from high-level decisions like the API shape or whether we want to add oneTBB as a dependency, there are a few technical and hygiene issues with the current changeset.

@silvanshade
Copy link
Contributor Author

Thanks for the feedback.

I have some additional changes incoming as part of some refactoring I've already been doing to clean things up a bit and to work better with the upcoming mmap PR I plan to submit. I'll also address some of the suggestions you made with those changes.

@oconnor663
Copy link
Member

We definitely need something like this. I've been kicking the can down the road for years, and I'm glad you've come along and done it :) Some scattered thoughts:

  • How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.

  • I care a lot about the "compile by hand" workflow described in https://github.com/BLAKE3-team/BLAKE3/tree/master/c#example and https://github.com/BLAKE3-team/BLAKE3/tree/master/c#building. Thanks for keeping that unchanged for folks who don't use threading. For folks who do, I'm curious how that's going to look. Does something like this seem right?

    $g++ -c -O3 blake3_tbb.cpp -o blake3_tbb.o
    $gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c \
        blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S \
        blake3_avx512_x86-64_unix.S blake3_tbb.o -ltbb -lstdc++
    

    Is there a way to do this in one command instead of two? (The intrinsics builds already require multiple commands, so that's not the end of the world.)

  • On the Rust side of things there's an update_mmap_rayon method, and it seems like something similar (blake3_hasher_update_mmap_tbb?) is what the vast majority of callers will want. Is that something like what you're planning?

@sneves
Copy link
Collaborator

sneves commented Feb 10, 2025

  • How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.

I wondered the same. With OpenMP I think the right way to do it is

   // Recurse! If this implementation adds multi-threading support in the
   // future, this is where it will go.
-  size_t left_n = blake3_compress_subtree_wide(input, left_input_len, key,
-                                               chunk_counter, flags, cv_array);
-  size_t right_n = blake3_compress_subtree_wide(
-      right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
-
+  size_t left_n = -1;
+  size_t right_n = -1;
+  #pragma omp taskgroup
+  {
+    #pragma omp task shared(left_n, cv_array)
+    left_n = blake3_compress_subtree_wide(input, left_input_len, key, chunk_counter, flags, cv_array);
+    #pragma omp task shared(right_n, right_cvs)                              
+    right_n = blake3_compress_subtree_wide(right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
+  }
   // The special case again. If simd_degree=1, then we'll have left_n=1 and
   // right_n=1. Rather than compressing them into a single output, return
   // them directly, to make sure we always have at least two outputs.
@@ -342,8 +347,13 @@ INLINE void compress_subtree_to_parent_node(
 #endif
 
   uint8_t cv_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN];
-  size_t num_cvs = blake3_compress_subtree_wide(input, input_len, key,
+  size_t num_cvs = -1;
+  #pragma omp parallel
+  {
+    #pragma omp single nowait
+    num_cvs = blake3_compress_subtree_wide(input, input_len, key,
                                                 chunk_counter, flags, cv_array);
+  }
   assert(num_cvs <= MAX_SIMD_DEGREE_OR_2);

and add -fopenmp to the compiler switches to enable it. The result is a mixed bag:

  • The baseline time to hash a 3-4 GB file on my laptop is ~0.27s with b3sum / TBB.
  • With Clang's OpenMP implementation I get ~0.32s.
  • With GCC's OpenMP implementation I get ~0.85s (!)

Long story short, I don't think OpenMP implementations are tuned for this kind of task-based parallelism? Perf traces show a lot of spinlock activity on the GCC implementation; I'm not sure why.

On another note, the TBB implementation could be simplified to

oneapi::tbb::parallel_invoke(
    [=] { *l_n = blake3_compress_subtree_wide(l_input, l_input_len, key, l_chunk_counter, flags, l_cvs, use_tbb); },
    [=] { *r_n = blake3_compress_subtree_wide(r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb); }
);

@silvanshade
Copy link
Contributor Author

  • How would you summarize the OpenMP vs oneTBB tradeoff. I see that oneTBB is used by common compilers and by mold, so that's a vote of confidence. On the other hand, bringing a C++ compiler into the build is kind of a drag.

Honestly I'm not an expert in either and only just learned oneTBB for this specific feature.

But my impression (which seems to be validated by the comment from @sneves), is that OpenMP is generally not suitable for nested parallelism, which is exactly how the problem is being solved here with Rayon and the oneTBB implementation.

Most of the performance recommendations I see regarding OpenMP suggest to avoid nested parallelism at all costs because it's known to be inefficient.

The general recommendation (again shown by @sneves below) seems to be to prefer tasks and related functionality instead of nested parallelism, but even then the performance isn't always great, where as the oneTBB implementation is within margin of error of Rayon.

But if we could make the OpenMP task approach somehow work, the other main problem is that OpenMP is poorly supported on MSVC/Windows since Microsoft only supports version 2.0 which doesn't even have the task functionality.

Support on macOS also a bit problematic since Apple disables OpenMP support in Apple Clang I believe, so users would also have to jump through extra hoops there.

  • Does something like this seem right?

Yes, that generally looks right. I can add some documentation for exactly what commands should work for compiling by hand.

Is there a way to do this in one command instead of two?

You might be able to do it with some additional flags like -x<language> and related but I'm not sure. It'd probably be more verbose and less portable though.

  • On the Rust side of things there's an update_mmap_rayon method, and it seems like something similar (blake3_hasher_update_mmap_tbb?) is what the vast majority of callers will want. Is that something like what you're planning?

Yes, this is exactly what I intend.

The upcoming mmap PR is based on llfio. Unfortunately another C++ library but again with similar criteria for choosing it as oneTBB: it's fast, portable, flexible, battle tested, etc.

@oconnor663
Copy link
Member

Yes, that generally looks right. I can add some documentation for exactly what commands should work for compiling by hand.

That would be awesome, thanks. I'd like to think that I have all our non-CMake downstream users in mind, but the sad truth is that I've just never learned CMake properly. (And @BurningEnlightenment enables my laziness by being so on top of things.)

@oconnor663
Copy link
Member

OpenMP is generally not suitable for nested parallelism, which is exactly how the problem is being solved here with Rayon and the oneTBB implementation...the other main problem is that OpenMP is poorly supported on MSVC/Windows...Support on macOS also a bit problematic

Got it. Sounds clear enough to me.

@silvanshade
Copy link
Contributor Author

I made some more changes renaming the CMake options to BLAKE3_USE_TBB and BLAKE3_FETCH_TBB and updated the build documentation.

@silvanshade
Copy link
Contributor Author

@BurningEnlightenment Are there any remaining changes requested? I think I've addressed all of the issues raised so far and marked them as resolved.

@BurningEnlightenment
Copy link
Collaborator

@silvanshade I haven't had time to re-review. I'll try to take a look tomorrow. Sorry to string you along.

@silvanshade
Copy link
Contributor Author

I rebased the PR to fix the merge conflict and made one additional change where I modified the manual build instructions to use pkg-config for tbb.

@silvanshade
Copy link
Contributor Author

If you remove reference to the implementation, that precludes the possibility of adding another alternate parallel backend in the future and being able to access both within the same application unless you complicate the design.

OTOH it basically enshrines an implementation detail in the public API. If Intel decides to discontinue oneTBB, we might be forced to switch to a different library. In turn our users would be forced to change their code for no practical gain.

For the record I think this is unlikely to happen given that TBB has been around a lot longer than BLAKE3 and is quite widespread in usage in several central pieces of software.

It's also included as part of the oneAPI specification under the direction of the UXL foundation with several steering members aside from intel.

I also see no reason why an application would want to explicitly refer to the parallelization library.

The main reason is to allow for dynamic selection of the parallel runtime.

If BLAKE3 were to, for example, add a GPU-accelerated or heterogeneous compute backend, it would not be hard to imagine an application that might choose to use different algorithms at different times where it's known that one will be more efficient than another based on the current workload.

One could also imagine a GUI application with a widget for selecting which algorithm to use.

If you make the selection build-configuration based you no longer allow for that possibility.

If different parallelization approaches exhibit sufficiently different trade-offs, it'd make more sense to name the trade-off or defining property in the API name. Switching between similar backends shouldn't be more involved than a recompilation with different options.

If you make the naming characteristic-based it still suffers from this issue of reduced granularity because you could have multiple backends of the same category.

I don't have a strong opinion about the specific naming per se, my concern is more about flexibility.

One option that might satisfy your concern would be to use an enum parameter to control the backend. This would allow for being precise and selecting the specific backend but also allow us to provide a default option that could be changed in the future with less possibility of affecting code downstream.

@oconnor663
Copy link
Member

oconnor663 commented Mar 9, 2025

I also see no reason why an application would want to explicitly refer to the parallelization library.

The reason you might care that for example the Rust *_rayon methods use Rayon internally, is that it's possible to configure the global Rayon thread pool, and it's also possible to create a local thread pool and run arbitrary code in that context. I don't know whether oneTBB has similar features, but my friendly neighborhood LLM suggests it does :)

@silvanshade
Copy link
Contributor Author

silvanshade commented Mar 10, 2025

I also see no reason why an application would want to explicitly refer to the parallelization library.

I don't know whether oneTBB has similar features, but my friendly neighborhood LLM suggests it does :)

Correct, it does. This is also what I was referring to with the backend specific set up. I think most such frameworks probably offer similar functionality.

From a code readability and maintainability perspective, given an example application which uses the parallel hashing, I think it would be clearer to see the parallelism framework being mentioned in the function name rather than trying to infer which framework is implicitly being selected by the BLAKE3 library based on the current build configuration, which will involve more non-local information.

@oconnor663
Copy link
Member

I've got an example branch that integrates this feature with the blake3_c_rust_bindings test harness in this repo. We could include that in this PR, or I could land it after, either is fine with me. Here's the commit: 6d05313. (Integrating the LLFIO feature is still a question mark for me, in terms of how to fetch it. That commit just assumes that TBB is installed globally.)

[=]() {
*r_n = blake3_compress_subtree_wide(
r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've copied what I did on the Rust/Rayon side of things, which is to keep spawning tasks "all the way down", until input_len <= blake3_simd_degree() * BLAKE3_CHUNK_LEN. In other words, each leaf task hashes 1024 bytes / 16 blocks per SIMD lane. When I benchmark this stuff, it that's in the neighborhood of 2 microseconds per leaf task, which does seem awfully small. On the other hand, I've never been able to measure any significant speedup from hardcoding single-threaded execution below some looser bound (e.g. 128 KiB). So I don't think this needs to change, but since we're looking at it I'm curious if any other folks have thoughts about task spawning overhead.

@silvanshade
Copy link
Contributor Author

I've got an example branch that integrates this feature with the blake3_c_rust_bindings test harness in this repo. We could include that in this PR, or I could land it after, either is fine with me. Here's the commit: 6d05313. (Integrating the LLFIO feature is still a question mark for me, in terms of how to fetch it. That commit just assumes that TBB is installed globally.)

Sounds good. I'll add that commit to the PR here.

@oconnor663
Copy link
Member

@silvanshade Update: 97ce1c8 (https://github.com/BLAKE3-team/BLAKE3/tree/rust_bindings_tbb) is a better commit, with functioning CI.

@oconnor663
Copy link
Member

oconnor663 commented Mar 13, 2025

LGTM! Thanks for putting so much work into this. I have some edits I want to make to the c/README.md (if you can forgive me, I'm going to move blake3_hasher_update_tbb to the "Less Common Functions" section), and I'll do those as a follow-up commit.

@oconnor663 oconnor663 merged commit 057586a into BLAKE3-team:master Mar 13, 2025
63 checks passed
oconnor663 added a commit that referenced this pull request Mar 18, 2025
Changes since 1.6.1:
- The C implementation has gained multithreading support, based on
  Intel's oneTBB library. This works similarly to the Rayon-based
  multithreading used in the Rust implementation. See c/README.md for
  details. Contributed by @silvanshade (#445).
- The Rust implementation has gained a WASM SIMD backend, gated by the
  `wasm32_simd` Cargo feature. Under Wasmtime on my laptop, this is a 6x
  performance improvement for large inputs. This backend is currently
  Rust-only. Contributed by @monoid (#341).
- Fixed cross-compilation builds targeting Windows with cargo-xwin.
  Contributed by @Sporif and @toothbrush7777777 (#230).
- Added `b3sum --tag`, which changes the output format. This is for
  compatibility with GNU checksum tools (which use the same flag) and
  BSD checksum tools (which use the output format this flag turns on).
  Contributed by @leahneukirchen (#453) and @dbohdan (#430).
@lelik107
Copy link

lelik107 commented Apr 3, 2025

@silvanshade @oconnor663
Sorry guys for disturbing you, but I'm neither a coder, nor I a developer.
But is there a way to get a working sample of multi-threaded С code compiled with TBB support for an end user?

@silvanshade
Copy link
Contributor Author

silvanshade commented Apr 3, 2025

@lelik107 There is an example at https://github.com/BLAKE3-team/BLAKE3/blob/master/c/example_tbb.c

You can compile and run it like this:

# from the BLAKE3 directory
cmake --fresh -S c -B c/build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=c/build/install -DBLAKE3_USE_TBB=ON -DBLAKE3_EXAMPLES=ON
cmake --build c/build --target install
c/build/bin/install/blake3-example-tbb FILES

@lelik107
Copy link

lelik107 commented Apr 3, 2025

@silvanshade Thank you but, If the developers specifically mentioned this PR in v.1.7.0 change log and as I see from here:
#457
they decided OpenMP isn't very appropriate for BLAKE3, why then they haven't provided the С binaries with TBB for the Windows users, who are not very used to build the code themselves to test the performance against Rust Rayon implementation, for example?
@sneves being a С professional is more than capable to do this, I know this exactly from using BLAKE2 for a decade.

@silvanshade
Copy link
Contributor Author

@lelik107 Compiled libraries aren't provided for any of the platforms directly as far as I know. That's usually up to the package maintainers. It would probably be better to open up a separate issue if you'd like to discuss that in particular.

@BurningEnlightenment
Copy link
Collaborator

why then they haven't provided the С binaries with TBB for the Windows users, who are not very used to build the code themselves to test the performance against Rust Rayon implementation

There is no official end-user b3sum C implementation, you should always use the rust based one.

@lelik107
Copy link

lelik107 commented Apr 4, 2025

@silvanshade Thank you, I understood.
#465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants