-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug
ZDICT_trainFromBuffer and respectively ZDICT_optimizeTrainFromBuffer_fastCover fails to select any dictionary on some inputs where ZDICT_trainFromBuffer_fastCover with default parameters works well without any warnings about samples or its sizes. Seems that it is a problem with COVER_selectDict which calls ZDICT_finalizeDictionary with dictBufferCapacity set to dictContentSize [1] [2]. Could this be a bug?
To Reproduce
Steps to reproduce the behavior:
- Download example.zip, files:
samples- flat buffer with samples,sizes- list of sizes of samples,data- single sample to compress - Run
ZDICT_trainFromBufferorZDICT_optimizeTrainFromBuffer_fastCoverwithdictCapacity == 1000and params:
ZDICT_fastCover_params_t {
steps = 4, f = 20, accel = 1, d = 8,
zParams = ZDICT_params_t {
notificationLevel = 4
}
}
- Check logs for 'Failed to select dictionary' warnings
- The result is ZSTD_error_GENERIC
Trace
[ZDICT_optimizeTrainFromBuffer_fastCover]
Trying 5 different sets of parameters
d=8
Training on 75 samples of total size 12225
Testing on 25 samples of total size 2825
Computing frequencies
k=50
[FASTCOVER_tryParameters]
[FASTCOVER_buildDictionary]
Breaking content into 20 epochs of size 610
dictBufferCapacity 1000, tail 850, dictBufferCapacity - tail == 150
[COVER_selectDict]
[ZDICT_finalizeDictionary]
dictBufferCapacity 150, dictContentSize 150, nbSamples 75 <==== (!)
Failed with dstSize_tooSmall (dictBufferCapacity < ZDICT_DICTSIZE_MIN)
Failed to select dictionary
20% k=537
Breaking content into 1 epochs of size 12218
Failed to select dictionary
40% k=1024
FASTCOVER parameters incorrect
k=1511
FASTCOVER parameters incorrect
k=1998
FASTCOVER parameters incorrect
Error (generic)
Expected behavior
Dictionary expected. Try run ZDICT_trainFromBuffer_fastCover with default compression level and such params:
ZDICT_fastCover_params_t {
k = 50, f = 20, d = 8,
zParams = ZDICT_params_t {
notificationLevel = 4
}
}
results to:
Training on 100 samples of total size 15050
Testing on 100 samples of total size 15050
Computing frequencies
Building dictionary
Breaking content into 20 epochs of size 752
statistics ...
Constructed dictionary of size 255
Raw data size 1000
Compressed data without dict size 138
Compressed data with dict size 29
Desktop
- OS: Win10 x64 2004 with WSL2 Ubuntu 20.04
- Compiler: x86_64-w64-mingw32-gcc
- Flags: -DZSTD_MULTITHREAD -pthread -static-libgcc -s
- Build system: make
- Zstd version: 1.4.5 or dev branch
Additional context
So I tried to pass the real dictBufferCapacity parameter to COVER_selectDict and ZDICT_finalizeDictionary and this works well on specified input data:
Dictionary size 251
Raw data size 1000
Compressed data without dict size 138
Compressed data with dict size 26
C# code to generate specified samples:
var buffer = Enumerable.Range(0, 100000).Select(i => unchecked((byte)(i * i))).ToArray();
var samples = Enumerable.Range(0, 100)
.Select(i => buffer.Skip(i).Take(200 - i).ToArray())
.ToArray();
var data = buffer.Skip(12345).Take(1000).ToArray();