ESRF logo

hdf5plugin

Goal: Ease usage of HDF5 compression filters from the Python programming language with h5py:

  • HDF5 (Hierarchical Data Format v5) is a file format for large/complex data storage and a library to manage those files.
  • h5py is a thin, pythonic wrapper around the HDF5 library.
  • Most HDF5 data compression filters are available as third-party plugins.

hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.

 

Presenter: Thomas VINCENT

LEAPS-INNOV WP7 Meeting, October 11, 2021

In [2]:
from h5glance import H5Glance  # Browsing HDF5 files
H5Glance("data.h5")
Out[2]:
    • compressed_data_blosc [📋]: 1969 × 2961 entries, dtype: uint8
    • copyright [📋]: scalar entries, dtype: UTF-8 string
    • data [📋]: 1969 × 2961 entries, dtype: uint8
In [3]:
import h5py  # Pythonic HDF5 wrapper: https://docs.h5py.org/

h5file = h5py.File("data.h5", mode="r")  # Open HDF5 file in read mode
data = h5file["/data"][()]               # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar()         # Display data
Out[3]:
<matplotlib.colorbar.Colorbar at 0x1135755f8>
In [4]:
data = h5file["/compressed_data_blosc"][()]  # Access compressed dataset
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-07c82b2002f5> in <module>
----> 1 data = h5file["/compressed_data_blosc"][()]  # Access compressed dataset

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype)
    760         mspace = h5s.create_simple(selection.mshape)
    761         fspace = selection.id
--> 762         self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
    763 
    764         # Patch up the output for NumPy

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)

hdf5plugin usage

Reading compressed datasets

To enable reading compressed datasets not supported by libHDF5 and h5py: Install hdf5plugin & import it.

In [ ]:
%%bash
pip3 install hdf5plugin

Or: conda install -c conda-forge hdf5plugin

In [5]:
import hdf5plugin
In [6]:
data = h5file["/compressed_data_blosc"][()]  # Access datset
plt.imshow(data); plt.colorbar()             # Display data
Out[6]:
<matplotlib.colorbar.Colorbar at 0x115d40828>
In [7]:
h5file.close()  # Close the HDF5 file

Writing compressed datasets

When writing datasets with h5py, compression can be specified with: h5py.Group.create_dataset

In [8]:
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
In [9]:
# Create a compressed dataset
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    compression=32001,  # blosc HDF5 filter identifier
    compression_opts=(0, 0, 0, 0, 5, 2, 1)  # options: level, shuffle, compression
)
h5file.close()

hdf5plugin provides some helpers to ease dealing with compression filter and options:

In [10]:
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
    "/compressed_data",
    data=data,
    **hdf5plugin.Blosc(
        cname='lz4',
        clevel=5,
        shuffle=hdf5plugin.Blosc.BITSHUFFLE),
)
h5file.close()
In [11]:
hdf5plugin.Blosc?
In [12]:
H5Glance("new_file_blosc_bitshuffle_lz4.h5")
Out[12]:
    • compressed_data [📋]: 1969 × 2961 entries, dtype: uint8
In [13]:
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data"][()]); plt.colorbar()
h5file.close()