Skip to content

ZDICT_optimizeTrainFromBuffer_[fast]Cover fails to select dictionary on some inputs #2371

@dscheg

Description

@dscheg

Describe the bug
ZDICT_trainFromBuffer and respectively ZDICT_optimizeTrainFromBuffer_fastCover fails to select any dictionary on some inputs where ZDICT_trainFromBuffer_fastCover with default parameters works well without any warnings about samples or its sizes. Seems that it is a problem with COVER_selectDict which calls ZDICT_finalizeDictionary with dictBufferCapacity set to dictContentSize [1] [2]. Could this be a bug?

To Reproduce
Steps to reproduce the behavior:

  1. Download example.zip, files: samples - flat buffer with samples, sizes - list of sizes of samples, data - single sample to compress
  2. Run ZDICT_trainFromBuffer or ZDICT_optimizeTrainFromBuffer_fastCover with dictCapacity == 1000 and params:
ZDICT_fastCover_params_t {
  steps = 4, f = 20, accel = 1, d = 8,
  zParams = ZDICT_params_t {
      notificationLevel = 4
  }
}
  1. Check logs for 'Failed to select dictionary' warnings
  2. The result is ZSTD_error_GENERIC

Trace

[ZDICT_optimizeTrainFromBuffer_fastCover]
  Trying 5 different sets of parameters
  d=8
  Training on 75 samples of total size 12225
  Testing on 25 samples of total size 2825
  Computing frequencies
  k=50
  [FASTCOVER_tryParameters]
    [FASTCOVER_buildDictionary]
      Breaking content into 20 epochs of size 610
      dictBufferCapacity 1000, tail 850, dictBufferCapacity - tail == 150
    [COVER_selectDict]
      [ZDICT_finalizeDictionary]
        dictBufferCapacity 150, dictContentSize 150, nbSamples 75    <====    (!)
        Failed with dstSize_tooSmall (dictBufferCapacity < ZDICT_DICTSIZE_MIN)
    Failed to select dictionary
  20%       k=537
      Breaking content into 1 epochs of size 12218
    Failed to select dictionary
  40%       k=1024
  FASTCOVER parameters incorrect
  k=1511
  FASTCOVER parameters incorrect
  k=1998
  FASTCOVER parameters incorrect
Error (generic)

Expected behavior
Dictionary expected. Try run ZDICT_trainFromBuffer_fastCover with default compression level and such params:

ZDICT_fastCover_params_t {
    k = 50, f = 20, d = 8,
    zParams = ZDICT_params_t {
        notificationLevel = 4
    }
}

results to:

Training on 100 samples of total size 15050
Testing on 100 samples of total size 15050
Computing frequencies
Building dictionary
Breaking content into 20 epochs of size 752
statistics ...
Constructed dictionary of size 255

Raw data size 1000
Compressed data without dict size 138
Compressed data with dict size 29

Desktop

  • OS: Win10 x64 2004 with WSL2 Ubuntu 20.04
  • Compiler: x86_64-w64-mingw32-gcc
  • Flags: -DZSTD_MULTITHREAD -pthread -static-libgcc -s
  • Build system: make
  • Zstd version: 1.4.5 or dev branch

Additional context
So I tried to pass the real dictBufferCapacity parameter to COVER_selectDict and ZDICT_finalizeDictionary and this works well on specified input data:

Dictionary size 251
Raw data size 1000
Compressed data without dict size 138
Compressed data with dict size 26

C# code to generate specified samples:

var buffer = Enumerable.Range(0, 100000).Select(i => unchecked((byte)(i * i))).ToArray();
var samples = Enumerable.Range(0, 100)
    .Select(i => buffer.Skip(i).Take(200 - i).ToArray())
    .ToArray();
var data = buffer.Skip(12345).Take(1000).ToArray();

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions