Goal: Ease usage of HDF5 compression filters from the Python programming language with h5py:
hdf5plugin packages a set of HDF5 compression filters (namely: blosc, bitshuffle, lz4, FCIDECOMP, ZFP, Zstandard) and makes them usable from the Python programming language through h5py.
Presenter: Thomas VINCENT
LEAPS-INNOV WP7 Meeting, October 11, 2021
from h5glance import H5Glance # Browsing HDF5 files
H5Glance("data.h5")
import h5py # Pythonic HDF5 wrapper: https://docs.h5py.org/
h5file = h5py.File("data.h5", mode="r") # Open HDF5 file in read mode
data = h5file["/data"][()] # Access HDF5 dataset "/data"
plt.imshow(data); plt.colorbar() # Display data
<matplotlib.colorbar.Colorbar at 0x1135755f8>
data = h5file["/compressed_data_blosc"][()] # Access compressed dataset
--------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-4-07c82b2002f5> in <module> ----> 1 data = h5file["/compressed_data_blosc"][()] # Access compressed dataset h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() ~/venv/py37env/lib/python3.7/site-packages/h5py/_hl/dataset.py in __getitem__(self, args, new_dtype) 760 mspace = h5s.create_simple(selection.mshape) 761 fspace = selection.id --> 762 self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl) 763 764 # Patch up the output for NumPy h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5d.pyx in h5py.h5d.DatasetID.read() h5py/_proxy.pyx in h5py._proxy.dset_rw() OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
hdf5plugin
usage¶To enable reading compressed datasets not supported by libHDF5
and h5py
:
Install hdf5plugin & import it.
%%bash
pip3 install hdf5plugin
Or: conda install -c conda-forge hdf5plugin
import hdf5plugin
data = h5file["/compressed_data_blosc"][()] # Access datset
plt.imshow(data); plt.colorbar() # Display data
<matplotlib.colorbar.Colorbar at 0x115d40828>
h5file.close() # Close the HDF5 file
When writing datasets with h5py
, compression can be specified with: h5py.Group.create_dataset
# Create a dataset with h5py without compression
h5file = h5py.File("new_file_uncompressed.h5", mode="w")
h5file.create_dataset("/data", data=data)
h5file.close()
# Create a compressed dataset
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data",
data=data,
compression=32001, # blosc HDF5 filter identifier
compression_opts=(0, 0, 0, 0, 5, 2, 1) # options: level, shuffle, compression
)
h5file.close()
hdf5plugin
provides some helpers to ease dealing with compression filter and options:
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data",
data=data,
**hdf5plugin.Blosc(
cname='lz4',
clevel=5,
shuffle=hdf5plugin.Blosc.BITSHUFFLE),
)
h5file.close()
hdf5plugin.Blosc?
H5Glance("new_file_blosc_bitshuffle_lz4.h5")
h5file = h5py.File("new_file_blosc_bitshuffle_lz4.h5", mode="r")
plt.imshow(h5file["/compressed_data"][()]); plt.colorbar()
h5file.close()
!ls -l new_file*.h5
-rw-r--r-- 1 tvincent staff 4330674 Oct 11 14:14 new_file_blosc_bitshuffle_lz4.h5 -rw-r--r-- 1 tvincent staff 5832257 Oct 11 14:14 new_file_uncompressed.h5
h5py
¶Compression filters provided by h5py:
libhdf5
: "gzip" and eventually "szip" (optional)h5py
: "lzf"Pre-compression filter: Byte-Shuffle
h5file = h5py.File("new_file_shuffle_gzip.h5", mode="w")
h5file.create_dataset(
"/compressed_data_shuffle_gzip", data=data, shuffle=True, compression="gzip")
h5file.close()
hdf5plugin
¶Additional compression filters provided by hdf5plugin
: Bitshuffle, Blosc, FciDecomp, LZ4, ZFP, Zstandard.
6 out of the 25 HDF5 registered filter plugins as of October 2021.
h5file = h5py.File("new_file_bitshuffle_lz4.h5", mode="w")
h5file.create_dataset(
"/compressed_data_bitshuffle_lz4",
data=data,
**hdf5plugin.Bitshuffle()
)
h5file.close()
blosclz
, lz4
, lz4hc
, snappy
(optional, requires C++11), zlib
, zstd
Blosc
includes pre-compression filters and algorithms provided by other HDF5 compression filters:
Blosc
with shuffle=hdf5plugin.Blosc.SHUFFLE
Bitshuffle()
=> Blosc("lz4", 5, hdf5plugin.Blosc.BITSHUFFLE)
LZ4()
=> Blosc("lz4", 9)
Zstd()
=> Blosc("zstd", 2)
(u)int8
or (u)int16
float32
, float64
, (u)int32
, (u)int64
OMP_NUM_THREADS=1
)hdf5plugin
built from sourceSome filters can use multithreading:
BLOSC_NTHREADS
environment variableOMP_NUM_THREADS
environment variableSome limitations in the context of libhdf5/h5py:
Having different pre-compression filters and compression algorithms at hand offer different read/write speed versus compression rate (and eventually error rate) trade-offs.
Also to keep in mind availability/compatibility: "gzip"
as included in libHDF5
is the most compatible one (and also "lzf"
as included in h5py
).
hdf5plugin
filters with other applications¶Note: With notebook, using ! enables running shell commands
!h5dump -d /compressed_data_blosc -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" { DATASET "/compressed_data_blosc" { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) } SUBSET { START ( 0, 0 ); STRIDE ( 1, 1 ); COUNT ( 5, 10 ); BLOCK ( 1, 1 ); DATA { } } } }
A solution: Set HDF5_PLUGIN_PATH
environment variable to: hdf5plugin.PLUGINS_PATH
# Directory where HDF5 compression filters are stored
hdf5plugin.PLUGINS_PATH
'/Users/tvincent/venv/py37env/lib/python3.7/site-packages/hdf5plugin/plugins'
# Retrieve hdf5plugin.PLUGINS_PATH from the command line
!python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"
/Users/tvincent/venv/py37env/lib/python3.7/site-packages/hdf5plugin/plugins
!ls `python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"`
libh5blosc.dylib libh5fcidecomp.dylib libh5zfp.dylib libh5bshuf.dylib libh5lz4.dylib libh5zstd.dylib
# Set HDF5_PLUGIN_PATH environment variable to hdf5plugin.PLUGINS_PATH
!HDF5_PLUGIN_PATH=`python3 -c "import hdf5plugin; print(hdf5plugin.PLUGINS_PATH)"` h5dump -d /compressed_data_blosc -s "0,0" -c "5,10" data.h5
HDF5 "data.h5" { DATASET "/compressed_data_blosc" { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE { ( 1969, 2961 ) / ( 1969, 2961 ) } SUBSET { START ( 0, 0 ); STRIDE ( 1, 1 ); COUNT ( 5, 10 ); BLOCK ( 1, 1 ); DATA { (0,0): 53, 52, 53, 54, 54, 55, 55, 56, 56, 57, (1,0): 49, 50, 54, 55, 53, 54, 55, 56, 56, 58, (2,0): 50, 51, 54, 54, 53, 55, 56, 57, 58, 57, (3,0): 51, 54, 55, 54, 54, 55, 56, 57, 58, 59, (4,0): 53, 55, 54, 54, 56, 56, 58, 57, 57, 58 } } } }
Note: Only works for reading compressed datasets, not for writing!
hdf5plugin
license¶The source code of hdf5plugin
itself is licensed under the MIT license...
It also embeds the source code of the provided compression filters and libraries which are licensed under different open-source licenses (Apache, BSD-2, BSD-3, MIT, Zlib...) and copyrights.
hdf5plugin
provides additional HDF5 compression filters (namely: Bitshuffle
, Blosc
, FciDecomp
, LZ4
, ZFP
, Zstandard
) mainly for use with h5py.
Credits to hdf5plugin contributors: Thomas Vincent, Armando Sole, \@Florian-toll, \@fpwg, Jerome Kieffer, \@Anthchirp, \@mobiusklein, \@junyuewang and to all contributors of embedded libraries.
Partially funded by the PaNOSC EU-project.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823852.