ESRF logo

hdf5plugin

Goal: Ease usage of HDF5 compression filters from the Python programming language with h5py:

  • HDF5 (Hierarchical Data Format v5) is a file format for large/complex data storage and a library to manage those files.
  • h5py is a thin, pythonic wrapper around the HDF5 library.
  • Most HDF5 data compression filters are available as third-party plugins.

hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.

 

Presenter: Thomas VINCENT

LEAPS-INNOV WP7 Meeting, October 11, 2021

In [2]:
from h5glance import H5Glance  # Browsing HDF5 files
H5Glance("data.h5")
Out[2]:
    • compressed_data_blosc [đź“‹]: 1969 Ă— 2961 entries, dtype: uint8
    • copyright [đź“‹]: scalar entries, dtype: UTF-8 string
    • data [đź“‹]: 1969 Ă— 2961 entries, dtype: uint8
In [3]:
import h5py  # Pythonic HDF5 wrapper: https://docs.h5py.org/

h5file = h5py.File("data.h5", mode="r")  # Open HDF5 file in read mode
data = h5file["/data"][()]               # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar()         # Display data
Out[3]:
<matplotlib.colorbar.Colorbar at 0x1135755f8>
In [4]:
data = h5file["/compressed_data_blosc"][()]  # Access compressed dataset
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-07c82b2002f5> in <module>
----> 1 data = h5file["/compressed_data_blosc"][()]  # Access compressed dataset

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
    760         mspace = h5s.create_simple(selection.mshape)
    761         fspace = selection.id
--> 762         self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
    763 
    764         # Patch up the output for NumPy

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)

hdf5plugin usage

Reading compressed datasets

To enable reading compressed datasets not supported by libHDF5 and h5py: Install hdf5plugin & import it.

In [ ]:
%%bash
pip3 install hdf5plugin

Or: conda install -c conda-forge hdf5plugin

In [5]:
import hdf5plugin
In [6]:
data = h5file["/compressed_data_blosc"][()]  # Access datset
plt.imshow(data); plt.colorbar()             # Display data
Out[6]:
<matplotlib.colorbar.Colorbar at 0x115d40828>
In [7]:
h5file.close()  # Close the HDF5 file

Writing compressed datasets

When writing datasets with h5py, compression can be specified with: h5py.Group.create_dataset

In [8]:
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
In [9]:
# Create a compressed dataset
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    compression=32001,  # blosc HDF5 filter identifier
    compression_opts=(0, 0, 0, 0, 5, 2, 1)  # options: level, shuffle, compression
)
h5file.close()

hdf5plugin provides some helpers to ease dealing with compression filter and options:

In [10]:
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    **hdf5plugin.Blosc(
        cname='lz4',
        clevel=5,
        shuffle=hdf5plugin.Blosc.BITSHUFFLE),
)
h5file.close()
In [11]:
hdf5plugin.Blosc?
In [12]:
H5Glance("new_file_blosc_bitshuffle_lz4.h5")
Out[12]:
    • compressed_data [đź“‹]: 1969 Ă— 2961 entries, dtype: uint8
In [13]:
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data"][()]); plt.colorbar()
h5file.close()
In [14]:
!ls -l new_file*.h5
-rw-r--r--  1 tvincent  staff  4330674 Oct 11 14:14 new_file_blosc_bitshuffle_lz4.h5
-rw-r--r--  1 tvincent  staff  5832257 Oct 11 14:14 new_file_uncompressed.h5

HDF5 compression filters

Available through h5py

Compression filters provided by h5py:

  • Provided by libhdf5: "gzip" and eventually "szip" (optional)
  • Bundled with h5py: "lzf"

Pre-compression filter: Byte-Shuffle

In [15]:
h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()

Provided by hdf5plugin

Additional compression filters provided by hdf5plugin: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.

6 out of the 25 HDF5 registered filter plugins as of October 2021.

In [16]:
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data_bitshuffle_lz4",
    data=data,
    **hdf5plugin.Bitshuffle()
)
h5file.close()

General purpose lossless compression

Equivalent filters

Blosc includes pre-compression filters and algorithms provided by other HDF5 compression filters:

  • HDF5 shuffle => Blosc with shuffle=hdf5plugin.Blosc.SHUFFLE
  • Bitshuffle() => Blosc("lz4", 5, hdf5plugin.Blosc.BITSHUFFLE)
  • LZ4() => Blosc("lz4", 9)
  • Zstd() => Blosc("zstd", 2)

Specific compression

A look at performances on a single use case

Benchmark

Multithreaded filter execution

Some filters can use multithreading:

  • Blosc:
    • Using a pool of threads
    • Disabled by default
    • Configurable with the BLOSC_NTHREADS environment variable
  • Bitshuffle, ZFP:
    • Using OpenMP
    • Enabled at compilation time
    • If enabled, configurable with OMP_NUM_THREADS environment variable

Comments on performances

Some limitations in the context of libhdf5/h5py:

  • Compressed data accessed by "chunks" even if compressor uses smaller blocks.
  • No multithreaded access to compressed data.
  • When reading compressed data, some memory copy could be spared.

Summary

Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.

Also to keep in mind availability/compatibility: "gzip" as included in libHDF5 is the most compatible one (and also "lzf" as included in h5py).

Using hdf5plugin filters with other applications

Note: With notebook, using ! enables running shell commands

In [17]:
!h5dump -d /compressed_data_blosc -s "0,0" -c "5,10" data.h5 
HDF5 "data.h5" {
DATASET "/compressed_data_blosc" {
   DATATYPE  H5T_STD_U8LE
   DATASPACE  SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 5, 10 );
      BLOCK ( 1, 1 );
      DATA {
      }
   }
}
}

A solution: Set HDF5_PLUGIN_PATH environment variable to: hdf5plugin.PLUGINS_PATH

In [18]:
# Directory where HDF5 compression filters are stored
hdf5plugin.PLUGINS_PATH
Out[18]:
'/Users/tvincent/venv/py37env/lib/python3.7/site-packages/hdf5plugin/plugins'
In [19]:
# Retrieve hdf5plugin.PLUGINS_PATH from the command line
!python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)" 
/Users/tvincent/venv/py37env/lib/python3.7/site-packages/hdf5plugin/plugins
In [20]:
!ls `python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
libh5blosc.dylib     libh5fcidecomp.dylib libh5zfp.dylib
libh5bshuf.dylib     libh5lz4.dylib       libh5zstd.dylib
In [21]:
# Set HDF5_PLUGIN_PATH environment variable to hdf5plugin.PLUGINS_PATH
!HDF5_PLUGIN_PATH=`python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"` h5dump -d /compressed_data_blosc -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" {
DATASET "/compressed_data_blosc" {
   DATATYPE  H5T_STD_U8LE
   DATASPACE  SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) }
   SUBSET {
      START ( 0, 0 );
      STRIDE ( 1, 1 );
      COUNT ( 5, 10 );
      BLOCK ( 1, 1 );
      DATA {
      (0,0): 53, 52, 53, 54, 54, 55, 55, 56, 56, 57,
      (1,0): 49, 50, 54, 55, 53, 54, 55, 56, 56, 58,
      (2,0): 50, 51, 54, 54, 53, 55, 56, 57, 58, 57,
      (3,0): 51, 54, 55, 54, 54, 55, 56, 57, 58, 59,
      (4,0): 53, 55, 54, 54, 56, 56, 58, 57, 57, 58
      }
   }
}
}

Note: Only works for reading compressed datasets, not for writing!

A word about hdf5plugin license

The source code of hdf5plugin itself is licensed under the MIT license...

It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib...) and copyrights.

Conlusion

hdf5plugin provides additional HDF5 compression filters (namely: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard) mainly for use with h5py.

Credits to hdf5plugin contributors: Thomas Vincent, Armando Sole, \@Florian-toll, \@fpwg, Jerome Kieffer, \@Anthchirp, \@mobiusklein, \@junyuewang and to all contributors of embedded libraries.

Partially funded by the PaNOSC EU-project.

 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.

Thanks for your attention! Questions?