{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 1: Basics on Compression and the blosc2.NDArray Object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import blosc2\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic compress2/decompress2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's create a simple NumPy tensor\n", "a = np.arange(1000_000).reshape(1000, 1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "compressed_a = blosc2.compress2(a)\n", "cbytes = len(compressed_a)\n", "nbytes = a.size * a.itemsize\n", "print(f\"cbytes: {cbytes}, cratio: {nbytes / cbytes:.2f}x\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a2 = blosc2.decompress2(compressed_a)\n", "print(f\"{type(a2)=}, {len(a2)=}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Restore (requires effort and metainfo supplement)\n", "a3 = np.frombuffer(a2, dtype=np.int64).reshape(1000, 1000)\n", "a3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More fancy compression for tensors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "compressed_t = blosc2.pack_tensor(a)\n", "cbytes = len(compressed_t)\n", "print(f\"cbytes: {cbytes}, cratio: {nbytes / cbytes:.2f}x\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Restore (fancier)\n", "a4 = blosc2.unpack_tensor(compressed_t)\n", "print(f\"{type(a4)=}, {len(a4)=}\")\n", "a4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compression params\n", "Let's see how to copy the NDArray data whilst altering the compression parameters. This may be useful in many contexts, for example testing how changing the codec of an existing array affects the compression ratio." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cparams = blosc2.CParams(\n", " codec=blosc2.Codec.LZ4,\n", " clevel=9,\n", " filters=[blosc2.Filter.SHUFFLE, blosc2.Filter.BYTEDELTA],\n", ")\n", "\n", "compressed_t = blosc2.pack_tensor(a, cparams=cparams)\n", "cbytes = len(compressed_t)\n", "print(f\"cbytes: {cbytes}, cratio: {nbytes / cbytes:.2f}x\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case the compression ratio is quite lower than before, since we have changed to a different codec that is optimised for compression speed, not compression ratio. In general there is a tradeoff between the two." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Native Blosc2 codecs\n", "Blosc2 supports many standard codecs, since there is no one-size-fits-all compression solution - one codec may be perfect for one context, but quite suboptimal in another.\n", "* ZLIB codec: uses the DEFLATE algorithm, is standard, and works well for images.\n", "* ZSTD codec: better compression ratio to ZLIB and faster compression/decompression (the default in Blosc2).\n", "* LZ4 codec: faster comp/decomp than ZSTD but reduced compression ratio.\n", "* BloscLZ: new implementation of the simple FASTLZ algorithm; similar tradeoff to LZ4, but the latter is generally better.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**: Use different codecs (BLOSCLZ, LZ4, LZ4HC, ZLIB and ZSTD), clevels (0-9) and filters (SHUFFLE, BITSHUFFLE, BYTEDELTA) above and see how cratios vary.\n", "\n", "Hint: the BYTEDELTA filter is meant to be applied after a SHUFFLE filter; otherwise, the results will be sub-optimal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your solution here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, via package extensions to Blosc2, one may access the JPEG2000 family of compression algorithms, which aim for a compromise between compression ratio and image quality; Blosc2 implements plugins for [GROK](https://github.com/Blosc/blosc2_grok) and [OPENHTJ2K](https://github.com/Blosc/blosc2_openhtj2k) for a convenient way to access JPEG2000 compression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NDArray: A NDim, Compressed Data Container\n", "\n", "NDArray objects let users perform different operations with arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate these NDArray arrays, which possess metadata and data. The data is *chunked* and *compressed*; the metadata gives information about the data itself, as well as the chunking and compression. Chunking and compression are features which make NDArray arrays very efficient for working with large data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an array\n", "Let's start by creating a 2D array with 100M elements filled with ``arange``. We can then print out the metadata, which contains information about: the array data (such as ``shape`` and ``dtype``); and how the data is compressed and stored, such as chunk- and block-shapes (``chunks`` and ``blocks``) and compression params (``CParams``). See [here](https://www.blosc.org/python-blosc2/getting_started/overview.html) for an explanation of chunking and blocking.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shape = (10_000, 10_000)\n", "array = blosc2.arange(np.prod(shape), shape=shape, dtype=np.int32)\n", "print(array.info)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``cratio`` parameter tells us how effective the compression is, since it gives the ratio between the number of bytes required to store the array in uncompressed and compressed form. Here we require almost 300x less space for the compressed array! Note that all the compression and decompression parameters are set to the default, and ``chunks`` and ``blocks`` have been selected automatically - playing around with them will affect the ``cratio`` (as well as compression and decompression speed).\n", "\n", "We can also create an NDArray by compressing a NumPy array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nparray = np.linspace(0, 100, np.prod(shape), dtype=np.float64).reshape(shape)\n", "b2array = blosc2.asarray(nparray)\n", "print(b2array.info)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or an iterator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = 1000_000\n", "rng = np.random.default_rng()\n", "it = ((-x + 1, x - 2, rng.normal()) for x in range(N))\n", "sa = blosc2.fromiter(it, dtype=\"i4,f4,f8\", shape=(N,))\n", "print(sa.info)\n", "print(f\"first 3 rows of sa: {sa[:3]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** Create a 2D NDArray with shape (1000, 10_000) and filled with sequential integers using the `range` iterator. Then use `blosc2.arange` to create the same array, and check that the two arrays are equal. Use `%time` magick tool to time the two operations. What do you notice about the time taken? Why do you think this is happening?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Code your solution here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Reading and modifying data\n", "NDArray arrays cannot be read directly, since they are compressed, and so must be decompressed first (to NumPy arrays, which are stored in memory). This can be done for the full array using the ``[:]`` operator, which returns a NumPy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp = array[:] # This will decompress the full array\n", "type(temp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "However it is often not necessary (or desirable) to load the whole array into memory. We can easily read just small parts of NDArray arrays to a NumPy array, quickly, via standard indexing routines." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res1 = array[0] # get first element\n", "res2 = array[6:10] # get slice\n", "print(f\"Got one element (of shape {res1.shape}) and slice of shape {res2.shape}.\")\n", "print(res1)\n", "print(res2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can modify the data in the array using standard NumPy indexing too, using either NumPy or NDArray arrays as the data source. For example, we can set the first row to zeros (using an NDArray array) and the first column to ones (using a NumPy array)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "array[0, :] = blosc2.zeros(10000, dtype=array.dtype)\n", "array[:, 0] = np.ones(10000, dtype=array.dtype)\n", "print(array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that ``array`` is still an NDArray array. Let's check that the entries were correctly modified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(array[0, 0])\n", "print(array[0, :])\n", "print(array[:, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enlarging the array\n", "Existing arrays can be enlarged. This is one operation that is greatly enhanced by the chunking procedure implemented in NDArray arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time array.resize((10_001, 10_000))\n", "print(array.shape)\n", "array[10_000, :] = 1\n", "array[10_000, :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time nparray2 = np.resize(nparray, (10_001, 10_001))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enlarging a NumPy array requires a full copy of the data, since underlying data are stored contiguously in memory, which is very costly: new memory to hold the extended array is allocated, the old data is copied to part of the new memory, and then the new data is written to the remaining new memory.\n", "Enlarging is a much faster operation for NDArray arrays because data is chunked, and the chunks may be stored non-contiguously in memory, so one may simply write the necessary new chunks to some arbitrary address in memory and leave the old chunks untouched. The references to the new chunk addresses are then added in the NDArray container, which is a very quick operation.\n", "\n", "You can also shrink the array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "array.resize((9_000, 10_000))\n", "print(array.shape)\n", "print(array[8_999]) # This works\n", "# array[9_000] # This will raise an exception" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persistent data\n", "We can use the `save()` method to store the array on disk. This is very useful when you are working with a large array but do not need to access it often.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "array.save(\"array_tutorial.b2nd\", mode=\"w\") # , contiguous=True)\n", "!ls -lh array_tutorial.b2nd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "For arrays, it is usual to use the `.b2nd` extension. Now let's open the saved array and check that the data saved correctly (decompressing first to be able to compare):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "array2 = blosc2.open(\"array_tutorial.b2nd\")\n", "np.array_equal(array2, array) # Make sure saved array matches original" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the ``urlpath`` parameter, we can also write directly to disk using the other constructors we saw previously." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Native constructor\n", "arr = blosc2.arange(np.prod(shape), shape=shape, dtype=np.int32, urlpath=\"arange.b2nd\", mode=\"w\")\n", "print(\"3 first rows of arr:\", arr[:3])\n", "\n", "# Via Python iterator\n", "it = ((-x + 1, x - 2, rng.normal()) for x in range(N))\n", "sa = blosc2.fromiter(it, dtype=\"i4,f4,f8\", shape=(N,), urlpath=\"sa-1M.b2nd\", mode=\"w\")\n", "print(\"\\n3 first rows of sa:\", sa[:3])\n", "\n", "# From a NumPy array\n", "b2array = blosc2.asarray(nparray, urlpath=\"linspace.b2nd\", mode=\"w\")\n", "print(\"\\n3 first rows of b2array:\", b2array[:3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 4 }