-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Implement ZStandard Stream, Encoder, Decoder #119575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/native/external/zstd.patch
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be upstreamed. You can link the PR in -version.txt file (see libunwind-version.txt). We've avoided adding patch files like these to the repo in the past.
| internal SafeZstdCDictHandle? _dictionary; | ||
| internal MemoryHandle? _prefixHandle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unusual to see SafeHandles holding anything more than just the handle/pointer. What's the purpose of these fields, and how do they get used for interop? Same comment applies to all types in this file.
Typically the pattern for a safehandle is something very simple -- some IntPtr that represents data that has a special method required for closing. You only have the SafeHandle responsible for that data. If you have multiple different handles like this, create one for each unique type that has it's unique close method. If you need to group handles together, consider a higher level type that aggregates the handles rather than the handles themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ZstandardDictionaries, we share the underlying byte array between ZSTD_DDict and ZSTD_CDict to save some memory, so we need to ensure that the array will outlive both of these object, the way this is achieved is by each (C|D)Dict object holding onto its own GCHandle for the byte array that keeps it alive at least until we free the native objects. We have been using using similar pattern to keep some data alive in SslStream safe handles.
Similar story is happening for MemoryHandles stored in SafeZstd(Dec|C)ompressHandle, there the _prefixHandle is coming from SetPrefix method. Here we need to make sure the ReadOnlyMemory passed to SetPrefix does not move/deallocate while Encoder/Decoder is still referencing it.
That being said, I am not 100% sure if disposing MemoryHandle from ReleaseHandle is valid, as MemoryHandle may reference other Managed objects (IPinable) that may(?) be finalizable. But at the same time, the MemoryManager implementations we have seem to be resilient against this case....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to get an interop/GC expert to weigh in here. These sort of cases are hard to test and prove out and easy to miss some nuance in how they work. It's better if we can stick to some hard-fast rules which are known to be safe.
For ZstandardDictionaries, we share the underlying byte array between ZSTD_DDict and ZSTD_CDict to save some memory, so we need to ensure that the array will outlive both of these object,
Can't you make the wrapping managed object be responsible for that lifetime, so that it would ref-count both the handle and the byte array? Make the safe-handles only responsible for the zstandard-owned objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you make the wrapping managed object be responsible for that lifetime, so that it would ref-count both the handle and the byte array? Make the safe-handles only responsible for the zstandard-owned objects?
I think that does not solve the problem when somebody disposes e.g. Encoder while the compression is in progress. The ZSTD_CCTX handle would survive due to refcounting around P/Invoke stubs, but we would drop the GC handle for the Prefix and allow GC to move it around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to get an interop/GC expert to weigh in here. These sort of cases are hard to test and prove out and easy to miss some nuance in how they work. It's better if we can stick to some hard-fast rules which are known to be safe.
I'd feel more confident as well, but I am not sure who would be the best person to ask.
src/libraries/System.IO.Compression.Zstandard/Directory.Build.props
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression.Zstandard/src/System.IO.Compression.Zstandard.csproj
Outdated
Show resolved
Hide resolved
| /// </summary> | ||
| /// <param name="size">The size of the source data in bytes.</param> | ||
| /// <remarks> | ||
| /// This method must be called before writing data to the stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, if this always must be set for compression, should we make it a constructor parameter? Requiring up-front knowledge of size seems to limit possible streaming scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to be set, you can stream data without needing to know the length upfront, but then the uncompressed length is not available in the header.
If you want the compressed data to contain length information in the header, then you must specify it before you start compressing data.
We chose not to put it in the options/ctor parameter, because you can reuse options for multiple Encoders (as well as the encoder itself via Reset())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case please improve the remarks so that they don't use "must", or qualify it to make the consequences clear. Describe what happens if set after writing data, or if never setting at all.
| <UseSystemZlib Condition="'$(UseSystemZlib)' == '' and !Exists('$(IlcFrameworkNativePath)libz.a')">true</UseSystemZlib> | ||
| <!-- Use libbrotlicommon.a as the sentinel for the three brotli libs. --> | ||
| <UseSystemBrotli Condition="'$(UseSystemBrotli)' == '' and !Exists('$(IlcFrameworkNativePath)libbrotlicommon.a')">true</UseSystemBrotli> | ||
| <UseSystemZstd Condition="'$(UseSystemZstd)' == '' and !Exists('$(IlcFrameworkNativePath)libzstd.a')">true</UseSystemZstd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UseSystemXX is paired with CLR_CMAKE_USE_SYSTEM_XX support for distros to use system installed lib. That typically happens when we are confident that distro provided version is working on major distros like Fedora and Ubuntu. See git grep CLR_CMAKE_USE_SYSTEM_BROTLI. In cases where we have many functionality-related patches to make the lib work (like llvm-libunwind has tons of local / hard-to-upstream patches: #72655), distros cannot use the system provided variant and we use the regular in-tree version.
TODOs:
- Add
CLR_CMAKE_USE_SYSTEM_ZSTDsupport in src/native/libs and VMR'srepo-projects/runtime.proj(next toCLR_CMAKE_USE_SYSTEM_BROTLI) - Test with system provided libs at least on fedora and ubuntu
- If the test fails, remove UseSystemZstd in NAOT buildintegration and file an issue to follow up
- Otherwise, check-in those (
CLR_CMAKE_USE_SYSTEM_ZSTD) changes.
|
Edit: seem to be fixed now There seems to be some problems with linux x86 build that I am unable to debug on my own. The build fails because it fails to copy libzstd.a to the coreclr build output, because it attempts to do so before the library is built. The weird part for me is that on all other builds that I checked, the "Installing: ...." steps are happening at the very end of the build, but for this particular pipeline they are happening sometime in the middle, see e.g. https://dev.azure.com/dnceng-public/public/_build/results?buildId=1214394&view=logs&jobId=d82b64b5-3dbf-5e48-910e-ffa3eb07d628&j=d82b64b5-3dbf-5e48-910e-ffa3eb07d628&t=3aaa1282-8fdd-59be-7f23-95b71c611ad2 I think I have exhausted the more reasonable debugging Ideas, Can somebody assist here? Maybe @jkoritzinsky? Your name pops up a lot in CMakeLists git blame. |
| set_property(TARGET libzstd_static APPEND PROPERTY COMPILE_DEFINITIONS "ZSTD_DLL_EXPORT=1") | ||
| endif() | ||
|
|
||
| if (ANDROID) | ||
| # qsort_r is not available during android build and zstd's autodetection does not seem | ||
| # to handle this case correctly | ||
| # This should no longer be needed once we update to zstd 1.5.8 in the future | ||
| set_property(TARGET libzstd_static APPEND PROPERTY COMPILE_DEFINITIONS "ZSTD_USE_C90_QSORT=1") | ||
| endif() | ||
|
|
||
| # disable warnings that occur in the zstd library | ||
| target_compile_options(libzstd_static PRIVATE $<$<COMPILE_LANG_AND_ID:C,MSVC>:/wd4242>) | ||
| target_compile_options(libzstd_static PRIVATE $<$<COMPILE_LANG_AND_ID:C,Clang,AppleClang,GNU>:-Wno-implicit-fallthrough>) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't adding lib prefix in other places:
runtime/src/native/external/brotli.cmake
Line 32 in 59a29ee
| set_target_properties(brotli PROPERTIES EXCLUDE_FROM_ALL ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are referring to the target name (libzstd*), then those names are coming from zstd cmakefiles which we are not changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our cmake, I meant we should use target_compile_options(zstd_static PRIVATE $<$<COMPILE_LANG_AND_ID:C,Clang,AppleClang,GNU>:-Wno-implicit-fallthrough>) instead of libzstd_static etc. cmake automatically uses platform convention: libzstd_static on Unix and zstd_static on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the lib prefix, I get
CMake Error at /source/dotnet/runtime/src/native/external/zstd.cmake:34 (target_compile_options):
Cannot specify compile options for target "zstd_static" which is not built
by this project.
Call Stack (most recent call first):
System.IO.Compression.Native/CMakeLists.txt:24 (include)
so I think we need to keep it as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good. This is a problem with FetchContent approach that we have to inherit poorly authored config conventions. Projects which don't use this mechanism are libunwind and llvm-libunwind, our cmake for them are completely independent and we select exactly which files to compile and inherit our compiler configs out of eng/native.
Implements #59591
Ignore native code in src/native/external/zstd (verbatim copy of latest zstd release, included in the very first commit)
There are some patches applied to vendored zstd sources which are listed in src/native/external, neither of them should be necessary once we update to next release (maintainers didn't commit to a release date, but hopefully they will release something early next year).
Rest of the implementation mirrors the approach done for Brotli.