gauteh / hidefix Goto Github PK

🐕 Concurrent HDF5 and NetCDF4 reader (experimental)

License: MIT License

Rust 93.30% Python 6.70%

hdf5 concurrency multi-threaded performance experimental netcdf rust

hidefix's Introduction

HIDEFIX

This Rust and Python library provides an alternative reader for the HDF5 file or NetCDF4 file (which uses HDF5) which supports concurrent access to data. This is achieved by building an index of the chunks, allowing a thread to use many file handles to read the file. The original (native) HDF5 library is used to build the index, but once it has been created it is no longer needed. The index can be serialized to disk so that performing the indexing is not necessary.

In Rust:

use hidefix::prelude::*;

let idx = Index::index("tests/data/coads_climatology.nc4").unwrap();
let mut r = idx.reader("SST").unwrap();

let values = r.values::<f32>(None, None).unwrap();

println!("SST: {:?}", values);

or with Python using Xarray:

import xarray as xr
import hidefix

ds = xr.open_dataset('file.nc', engine='hidefix')
print(ds)

See the example for how to use hidefix for regular, parallel or concurrent reads.

Motivation

The HDF5 library requires internal locks to be thread-safe since it relies on internal buffers which cannot be safely accessed/written to from multiple threads. This effectively causes multi-threaded applications to use sequential reads, while competing for the locks. And also apparently cause each other trouble, perhaps through dropping cached chunks which other threads still need. It can be safely used from different processes, but that requires potentially much more overhead than multi-threaded or asynchronous code.

Some basic benchmarks

hidefix is intended to perform better when concurrent reads are made either to the same dataset, same file or to different files from a single process. For basic benchmarks the performance is on-par or slightly better compared to doing standard sequential reads than the native HDF5 library (through its rust-bindings). Where hidefix shines is once the multiple threads in the same process tries to read in any way from a HDF5 file simultaneously.

This simple benchmark tries to read a small dataset sequentially or concurrently using the cached reader from hidefix and the native reader from HDF5. The dataset is chunked, shuffled and compressed (using gzip):

$ cargo bench --bench concurrency -- --ignored

test shuffled_compressed::cache_concurrent_reads  ... bench:  15,903,406 ns/iter (+/- 220,824)
test shuffled_compressed::cache_sequential        ... bench:  59,778,761 ns/iter (+/- 602,316)
test shuffled_compressed::native_concurrent_reads ... bench: 411,605,868 ns/iter (+/- 35,346,233)
test shuffled_compressed::native_sequential       ... bench: 103,457,237 ns/iter (+/- 7,703,936)

Inspiration and other projects

This work is based in part on the DMR++ module of the OPeNDAP Hyrax server. The zarr format does something similar, and the same approach has been tested out on HDF5 as swell.

hidefix's People

Contributors

Stargazers

Watchers

Forkers

magnusumet slevang weiji14 antoinerenaud91 nursenaydin28

hidefix's Issues

Using hidefix to determine byte ranges in HDF files?

I'm building VirtualiZarr, an evolution of kerchunk, that allows you to determine byte ranges of chunks in netCDF files, but then concatenate the virtual representation of those chunks using xarray's API.

This works by creating a ChunkManifest object in-memory (one per netCDF Variable per file initially), then defining ways to merge those manifests.

What I'm wondering is if hidefix's Index class could be useful to me as a way to generate the ChunkManifest for a netCDF file without using kerchunk/fsspec (see this issue). In other words I use hidefix only to determine the byte ranges, not for actually reading the data. (I plan to actually read the bytes later using the rust object-store crate, see zarr-developers/zarr-python#1661).

Q's:

Is this idea dumb?
Does hidefix.Index contain the byte range information I'm assuming it does?
Can hidefix read over S3?
Would I be better off just using h5py directly?

cc @norlandrhagen

xref pydata/xarray#7446

TypeError: HidefixBackendEntrypoint.open_dataset() got an unexpected keyword argument 'group'

Hi there,

Been following your work so far and it is looking really promising! Just reporting this 'bug' when attempting to use the 'group' parameter in xr.open_dataset.

Minimal Working Example

Using hidefix

Download this sample HDF5 file: https://github.com/suzanne64/ATL11/raw/125ee1a653d78e6b86864b35c9d0fcfd72d64a85/ATL11_test_case/ATL11_078805_0304_02_v002.h5, and read group pt2/corrected_h.

import hidefix
import xarray as xr

ds: xr.Dataset = xr.open_dataset(
    filename_or_obj="ATL11_078805_0304_02_v002.h5",
    engine="hidefix",
    group="pt2/corrected_h",
)
print(ds)

produces this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:2

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/backends/api.py:1012, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1009     open_ = open_dataset
   1010     getattr_ = getattr
-> 1012 datasets = [open_(p, **open_kwargs) for p in paths]
   1013 closers = [getattr_(ds, "_close") for ds in datasets]
   1014 if preprocess is not None:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/backends/api.py:1012, in <listcomp>(.0)
   1009     open_ = open_dataset
   1010     getattr_ = getattr
-> 1012 datasets = [open_(p, **open_kwargs) for p in paths]
   1013 closers = [getattr_(ds, "_close") for ds in datasets]
   1014 if preprocess is not None:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/backends/api.py:570, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    558 decoders = _resolve_decoders_kwargs(
    559     decode_cf,
    560     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    566     decode_coords=decode_coords,
    567 )
    569 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 570 backend_ds = backend.open_dataset(
    571     filename_or_obj,
    572     drop_variables=drop_variables,
    573     **decoders,
    574     **kwargs,
    575 )
    576 ds = _dataset_from_backend_dataset(
    577     backend_ds,
    578     filename_or_obj,
   (...)
    588     **kwargs,
    589 )
    590 return ds

TypeError: HidefixBackendEntrypoint.open_dataset() got an unexpected keyword argument 'group'

Using h5netcdf engine (expected results)

ds: xr.Dataset = xr.open_dataset(
    filename_or_obj="ATL11_078805_0304_02_v002.h5",
    engine="h5netcdf",
    group="pt2/corrected_h",
)
print(ds)

produces

<xarray.Dataset>
Dimensions:                  (cycle_number: 2, ref_pt: 1404)
Coordinates:
  * cycle_number             (cycle_number) int64 3 4
  * ref_pt                   (ref_pt) float64 6.016e+05 6.016e+05 ... 6.086e+05
Data variables:
    delta_time               (ref_pt, cycle_number) timedelta64[ns] ...
    h_corr                   (ref_pt, cycle_number) float64 ...
    h_corr_sigma             (ref_pt, cycle_number) float64 ...
    h_corr_sigma_systematic  (ref_pt, cycle_number) float64 ...
    latitude                 (ref_pt) float64 ...
    longitude                (ref_pt) float64 ...
    quality_summary          (ref_pt, cycle_number) float64 ...

Library versions

Using hidefix=0.6.5. Other library versions from xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.167-147.601.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.0
libnetcdf: 4.9.2

xarray: 2023.7.0
pandas: 1.5.1
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.1.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2022.11.0
distributed: 2022.11.0
matplotlib: 3.6.2
cartopy: 0.21.1
seaborn: 0.12.1
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.7.2
pip: 22.3.1
conda: None
pytest: 7.2.0
mypy: None
IPython: 8.6.0
sphinx: 4.5.0

Would be keen to help implement the 'group' reading feature somehow, but I might need a lot of pointers on how to handle the Rust bindings to figure things out!

blosc2 decompression

The dataset needs to be compressed with blosc2 first.

The reader may not need to cache the decompressed chunks since blosc doesn't need to decompress the entire chunk (I think).

zero-copy appears abandoned: maybe swap to serde-bytes

Change master to main

When is a good time..?

Can't compile as dependencies

Compiling an empty project with the following Cargo.toml

[package]
name = "hidefix-test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
hidefix = { version = "0.8", git = "https://github.com/gauteh/hidefix.git"}

yield a compile error:

..:~/.../hidefix-test$ cargo +nightly build -r
...
error[E0599]: no method named `chunks_visit` found for reference `&hdf5::Dataset` in the current scope
   --> /home/arenaud/.cargo/git/checkouts/hidefix-8eedc5ee6b6656b5/ecfb111/src/idx/dataset/dataset.rs:106:24
    |
106 |                     ds.chunks_visit(|ci| {
    |                     ---^^^^^^^^^^^^ help: there is a method with a similar name: `chunk_info`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `hidefix` (lib) due to previous error

Support HDF5 groups

A more proper solution as part of #24 is to support groups properly:

Create a Group struct
Move most of Index and the hashmap of datasets to Group
Put a hashmap of groups in index
Deref<Target = Group> on Index to the root group ("/").

Chunk slicer gets confused on norkyst file: probably related to files where chunk size does not align with dataset dimensions

NorKyst-800m_ZDEPTHS_his.an.2022122500.nc

Index of a compressed file: can maybe be used to avoid chunking

https://github.com/mattgodbolt/zindex

If efficient would possibly avoid the entire chunking / transpose issue. Still requires the index to store zlib initialization-data at intervals, so some form of chunking is still there. But it would be independent of the data. Maybe it is easier to store the original data in https://www.blosc.org/, maybe chunking is not needed for that.