-
Notifications
You must be signed in to change notification settings - Fork 410
Implement oneTBB-based parallelism for C lib #445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
60ae84d to
e890259
Compare
BurningEnlightenment
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for contributing.
Aside from high-level decisions like the API shape or whether we want to add oneTBB as a dependency, there are a few technical and hygiene issues with the current changeset.
|
Thanks for the feedback. I have some additional changes incoming as part of some refactoring I've already been doing to clean things up a bit and to work better with the upcoming mmap PR I plan to submit. I'll also address some of the suggestions you made with those changes. |
|
We definitely need something like this. I've been kicking the can down the road for years, and I'm glad you've come along and done it :) Some scattered thoughts:
|
I wondered the same. With OpenMP I think the right way to do it is // Recurse! If this implementation adds multi-threading support in the
// future, this is where it will go.
- size_t left_n = blake3_compress_subtree_wide(input, left_input_len, key,
- chunk_counter, flags, cv_array);
- size_t right_n = blake3_compress_subtree_wide(
- right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
-
+ size_t left_n = -1;
+ size_t right_n = -1;
+ #pragma omp taskgroup
+ {
+ #pragma omp task shared(left_n, cv_array)
+ left_n = blake3_compress_subtree_wide(input, left_input_len, key, chunk_counter, flags, cv_array);
+ #pragma omp task shared(right_n, right_cvs)
+ right_n = blake3_compress_subtree_wide(right_input, right_input_len, key, right_chunk_counter, flags, right_cvs);
+ }
// The special case again. If simd_degree=1, then we'll have left_n=1 and
// right_n=1. Rather than compressing them into a single output, return
// them directly, to make sure we always have at least two outputs.
@@ -342,8 +347,13 @@ INLINE void compress_subtree_to_parent_node(
#endif
uint8_t cv_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN];
- size_t num_cvs = blake3_compress_subtree_wide(input, input_len, key,
+ size_t num_cvs = -1;
+ #pragma omp parallel
+ {
+ #pragma omp single nowait
+ num_cvs = blake3_compress_subtree_wide(input, input_len, key,
chunk_counter, flags, cv_array);
+ }
assert(num_cvs <= MAX_SIMD_DEGREE_OR_2);and add
Long story short, I don't think OpenMP implementations are tuned for this kind of task-based parallelism? Perf traces show a lot of spinlock activity on the GCC implementation; I'm not sure why. On another note, the TBB implementation could be simplified to oneapi::tbb::parallel_invoke(
[=] { *l_n = blake3_compress_subtree_wide(l_input, l_input_len, key, l_chunk_counter, flags, l_cvs, use_tbb); },
[=] { *r_n = blake3_compress_subtree_wide(r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb); }
); |
Honestly I'm not an expert in either and only just learned oneTBB for this specific feature. But my impression (which seems to be validated by the comment from @sneves), is that OpenMP is generally not suitable for nested parallelism, which is exactly how the problem is being solved here with Rayon and the oneTBB implementation. Most of the performance recommendations I see regarding OpenMP suggest to avoid nested parallelism at all costs because it's known to be inefficient. The general recommendation (again shown by @sneves below) seems to be to prefer tasks and related functionality instead of nested parallelism, but even then the performance isn't always great, where as the oneTBB implementation is within margin of error of Rayon. But if we could make the OpenMP task approach somehow work, the other main problem is that OpenMP is poorly supported on MSVC/Windows since Microsoft only supports version 2.0 which doesn't even have the task functionality. Support on macOS also a bit problematic since Apple disables OpenMP support in Apple Clang I believe, so users would also have to jump through extra hoops there.
Yes, that generally looks right. I can add some documentation for exactly what commands should work for compiling by hand.
You might be able to do it with some additional flags like
Yes, this is exactly what I intend. The upcoming mmap PR is based on llfio. Unfortunately another C++ library but again with similar criteria for choosing it as oneTBB: it's fast, portable, flexible, battle tested, etc. |
e890259 to
6225f9c
Compare
6225f9c to
05531b0
Compare
That would be awesome, thanks. I'd like to think that I have all our non-CMake downstream users in mind, but the sad truth is that I've just never learned CMake properly. (And @BurningEnlightenment enables my laziness by being so on top of things.) |
Got it. Sounds clear enough to me. |
05531b0 to
e06c0c5
Compare
|
I made some more changes renaming the CMake options to |
9b98827 to
87412d4
Compare
87412d4 to
d765804
Compare
|
@BurningEnlightenment Are there any remaining changes requested? I think I've addressed all of the issues raised so far and marked them as resolved. |
|
@silvanshade I haven't had time to re-review. I'll try to take a look tomorrow. Sorry to string you along. |
3ff1fc4 to
10ed141
Compare
|
I rebased the PR to fix the merge conflict and made one additional change where I modified the manual build instructions to use |
For the record I think this is unlikely to happen given that TBB has been around a lot longer than BLAKE3 and is quite widespread in usage in several central pieces of software. It's also included as part of the oneAPI specification under the direction of the UXL foundation with several steering members aside from intel.
The main reason is to allow for dynamic selection of the parallel runtime. If BLAKE3 were to, for example, add a GPU-accelerated or heterogeneous compute backend, it would not be hard to imagine an application that might choose to use different algorithms at different times where it's known that one will be more efficient than another based on the current workload. One could also imagine a GUI application with a widget for selecting which algorithm to use. If you make the selection build-configuration based you no longer allow for that possibility.
If you make the naming characteristic-based it still suffers from this issue of reduced granularity because you could have multiple backends of the same category. I don't have a strong opinion about the specific naming per se, my concern is more about flexibility. One option that might satisfy your concern would be to use an enum parameter to control the backend. This would allow for being precise and selecting the specific backend but also allow us to provide a default option that could be changed in the future with less possibility of affecting code downstream. |
10ed141 to
eba3b87
Compare
The reason you might care that for example the Rust |
Correct, it does. This is also what I was referring to with the backend specific set up. I think most such frameworks probably offer similar functionality. From a code readability and maintainability perspective, given an example application which uses the parallel hashing, I think it would be clearer to see the parallelism framework being mentioned in the function name rather than trying to infer which framework is implicitly being selected by the BLAKE3 library based on the current build configuration, which will involve more non-local information. |
|
I've got an example branch that integrates this feature with the |
| [=]() { | ||
| *r_n = blake3_compress_subtree_wide( | ||
| r_input, r_input_len, key, r_chunk_counter, flags, r_cvs, use_tbb); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've copied what I did on the Rust/Rayon side of things, which is to keep spawning tasks "all the way down", until input_len <= blake3_simd_degree() * BLAKE3_CHUNK_LEN. In other words, each leaf task hashes 1024 bytes / 16 blocks per SIMD lane. When I benchmark this stuff, it that's in the neighborhood of 2 microseconds per leaf task, which does seem awfully small. On the other hand, I've never been able to measure any significant speedup from hardcoding single-threaded execution below some looser bound (e.g. 128 KiB). So I don't think this needs to change, but since we're looking at it I'm curious if any other folks have thoughts about task spawning overhead.
Sounds good. I'll add that commit to the PR here. |
|
@silvanshade Update: 97ce1c8 (https://github.com/BLAKE3-team/BLAKE3/tree/rust_bindings_tbb) is a better commit, with functioning CI. |
eba3b87 to
61ea63f
Compare
61ea63f to
58d13d6
Compare
|
LGTM! Thanks for putting so much work into this. I have some edits I want to make to the |
Changes since 1.6.1: - The C implementation has gained multithreading support, based on Intel's oneTBB library. This works similarly to the Rayon-based multithreading used in the Rust implementation. See c/README.md for details. Contributed by @silvanshade (#445). - The Rust implementation has gained a WASM SIMD backend, gated by the `wasm32_simd` Cargo feature. Under Wasmtime on my laptop, this is a 6x performance improvement for large inputs. This backend is currently Rust-only. Contributed by @monoid (#341). - Fixed cross-compilation builds targeting Windows with cargo-xwin. Contributed by @Sporif and @toothbrush7777777 (#230). - Added `b3sum --tag`, which changes the output format. This is for compatibility with GNU checksum tools (which use the same flag) and BSD checksum tools (which use the output format this flag turns on). Contributed by @leahneukirchen (#453) and @dbohdan (#430).
|
@silvanshade @oconnor663 |
|
@lelik107 There is an example at https://github.com/BLAKE3-team/BLAKE3/blob/master/c/example_tbb.c You can compile and run it like this: # from the BLAKE3 directory
cmake --fresh -S c -B c/build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=c/build/install -DBLAKE3_USE_TBB=ON -DBLAKE3_EXAMPLES=ON
cmake --build c/build --target install
c/build/bin/install/blake3-example-tbb FILES |
|
@silvanshade Thank you but, If the developers specifically mentioned this PR in v.1.7.0 change log and as I see from here: |
|
@lelik107 Compiled libraries aren't provided for any of the platforms directly as far as I know. That's usually up to the package maintainers. It would probably be better to open up a separate issue if you'd like to discuss that in particular. |
There is no official end-user b3sum C implementation, you should always use the rust based one. |
|
@silvanshade Thank you, I understood. |
This PR implements oneTBB based parallelism for C.
The implementation is essentially the same as how the Rayon one works: we recursively create a
task_groupon the work stealing scheduler torunclosures for theleftandrightcalculations andwaiton their results.I haven't included benchmarks here but informal testing on a Zen4 7950X workstation shows nearly identical performance to the Rayon implementation for large memory-mapped input.
I have another PR I'll submit soon which also adds mmap support and will try to include benchmarks there.
Related Issues
b3sumcrate into a library (with an executable) #132Rationale for oneTBB
Rationale for choosing
oneTBBversus some other multi-threading framework:oneTBBhas a lot of additional functionality we could potentially use in future refinementsThe downside, if you consider it one, is that it's C++ where the rest of
libblake3is C.Summary of Changes
Here is a summary of changes I've made to integrate TBB:
blake3_tbb.cppwhich definesblake3_compress_subtree_wide_join_tbbstaticfromblake3_compress_subtree_wideso it can be called from C++ TUBLAKE3_PRIVATEinblake3_impl.hblake3_hasher_updatetoblake3_hasher_update_baseand addeduse_tbbparamblake3_hasher_updateandblake3_hasher_update_tbbwhich call the_basefunctionblake3_hasher_update_tbbc/CMakeLists.txt:oneTBBBLAKE3_NO_*options to control compiled sources (avoids need forrmas with Makefile tests)example.cexecutablemain.cexecutablec/test.pyc_testsjob to use CMake instead of Makefile (for TBB compatibility)cmake_buildjob:Design Considerations
One potential issue with these changes is that now the
blake3_compress_subtree_wideis no longerstaticand has external visibility for linkage. If the user compileslibblake3as a shared library, theBLAKE3_PRIVATEwill cause the linker to complain about hidden visibility but it won't do that if they compile as a static library.I'm not sure if this is really a problem or not.
One potentially radical solution would be to just make
blake3.cintoblake3.cppand then we wouldn't need the separate TU for TBB support. This would have some impact on portability and may not be a good trade off given the project goals.Another option could be to make
blake3_compress_subtree_wide_join_tbbaccept a function pointer tostatic blake3_compress_subtree_wide. But thenblake3_compress_subtree_wide_join_tbbstill has external visibility in a static library which we can't really eliminate.