Releases: NVIDIA/kvpress
Releases · NVIDIA/kvpress
v0.2.2
v0.2.1
v0.2.0
Transformers v4.48 introduced breaking changes handled in this release. The release also features AdaKVPress, the first press allowing head-wise compression by patching the attention functions registered in ALL_ATTENTION_FUNCTIONS since v4.48. When combined with ExpectedAttentionPress, AdaKVPress achieved the best results observed yet on the RULER benchmark (see this post).
v0.1.1
What's Changed
- #33 by @SimJeg fixes a small bug in the pipeline
- #36 by @maxjeblick sets transformers <4.48 as a dependency
Full Changelog: v0.1.0...v0.1.1
v0.1.0
#24 by @maxjeblick and #29 by @SimJeg introduce a non-breaking refactoring:
- a press does not require the
compression_ratioinput argument anymore as some presses do not explicitly require it (e.g.ThinKPress,SimLayerKVPress). However every press must have acompression_ratioattribute after any forward pass (assertion added in tests) to allow average compression ratio measurement on a benchmark - the core compression logic has been moved from
BasePress.forward_hooktoBasePress.compress.BasePress.forward_hooknow only checks ifcompressmust be called (pre-filling vs decoding), de-quantize cache beforecompressand re-quantize it afterwards - the
BasePressdoes not implement ascoremethod anymore, this has been moved to theScorerPresswith the associatedScorerPress.compressmethod
Other features:
- Add
SimLayerKVPress, #28 by @SimJeg and @dame-cell - Add
ComposedPress, #29 by @SimJeg - Add
KeyReRotationPress, #31 by @maxjeblick and @giulio98 - Fix
QuantizedCache, #30 by @maxjeblick - Add new tests, including an integration test on a sample from RULER
v0.0.4
v0.0.3
v0.0.2
Initial release
v0.0.1 install poetry in workflows (#1)