hdfgroup / h5pyd Goto Github PK

View Code? Open in Web Editor NEW

108.0 108.0 39.0 1.82 MB

h5py distributed - Python client library for HDF Rest API

License: Other

Python 99.92% Shell 0.03% Dockerfile 0.05%

h5pyd's People

Contributors

Stargazers

Watchers

Forkers

drkenho hyperplaneorg drancom anlab kotfic alistairwalsh rsignell-usgs aashish24 shwetagopaul92 zuomatthew matteodg t-sommer powersym paulmueller murlock crerecombinase yarikoptic jacobduckett kmuehlbauer shifu-engineer qiuwei jjaraalm openedi untereiner joeykleingers redesignscience jjq-asti janilazar hickey jonrkarr grantbuster loichuder riizeron sangwonx ideaconsult simentha mattjala jwsblokland assaron

h5pyd's Issues

Implement Sessions

Use of session objects should improve performance by allow re-use of SSL connections. See: http://docs.python-requests.org/en/master/user/advanced/#session-objects.

Determining open status of remote file

In h5py, I use the file id to determine if a file is open or not. So defining this function:

def isopen(f):
    if f:
        print('File is open')
    else:
        print('File is closed')

I get the following with a local HDF5 file.

>>> a=h5.File('chopper.nxs')
>>> isopen(a)
File is open
>>> a.close()
>>> isopen(a)
File is closed

With h5pyd, I get the following:

>>> a=h5d.File('chopper.exfac', mode='r', endpoint='http://some.server:5000')
>>> isopen(a)
File is open
>>> a.close()
>>> isopen(a)
File is open

I can use the File id.uuid property instead, since that is set to 0 when the file is closed, but the current behavior is not fully compatible with h5py.

Add releases

Even something preliminary, like release candidates or alpha or beta or whatever, would be useful.

Reasons for:

it's a prerequisite to #11
it's useful for maintaining dockerfiles that install this

Get dimension scales working for netcdf4 access

In order to use h5netcdf as a netcdf4 interface on top of h5pyd, we first need dimensions scales working, which will enable the shared dimensions in netcdf4.

Support fill values

fill_value dataset creation properties are not implemented.

Release?

Seems there is mention of pip install as a mechanism of installation. Though I don't see it on PyPI and there are tags on the repo. Can a release be made?

Fancy indexing is not supported

@jreadey , I think you were aware of it since you mentioned Coordinate list (dset[(x,y,z),:]) not being supported yet (not sure if you actually meant list). But I considered it as the one of most used selection type that is worth the efforts to be added.

ds_local[1,[1,3,5]]
Out[64]: array([ 0.,  0.,  0.], dtype=float32)
ds_remote[1,[1,3,5]]
Traceback (most recent call last):
  File "/home/wjiang2/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-65-bc569229dc91>", line 1, in <module>
    ds_remote[1,[1,3,5]]
  File "/home/wjiang2/.local/lib/python3.6/site-packages/h5pyd/_hl/dataset.py", line 796, in __getitem__
    raise ValueError("selection type not supported")
ValueError: selection type not supported

Support ACL ops

Add support for ACL operations.

hsls , but hsdel ?

This is just a minor annoyance, but thought I'd mention it anyway:

Why Linux style hsls, but Windows style hsdel?

Would have thought Linux like: hsls and hsrm
or Windows like: hsdir and hsdel
or would be more consistent...

Round-tripping with hsload and hsget

@jreadey, you used hsload to put our Hurricane Sandy netcdf4 file on HSDS:

(IOOS) rsignell@0e6be50c3dc2:~$ hsls /home/john/sandy.nc/

john                            domain   2017-09-07 22:11:07 /home/john/sandy.nc
1 items

If I try to use hsget to get that dataset back, I get errors:

(IOOS) rsignell@0e6be50c3dc2:~$ hsget /home/john/sandy.nc sandy.nc
2017-10-14 14:00:39,424 ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options
ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options
2017-10-14 14:01:50,324 ERROR: failed to create dataset: Scalar datasets don't support chunk/filter options

And although I do end up with a sandy.nc file, if I try to ncdump it, it doesn't work (see below). I guess that is not too surprising in light of #32, right?

But do you think one day we will be able to round-trip a dataset using hsload and hsget?


(IOOS) rsignell@0e6be50c3dc2:~$ ncdump -h sandy.nc
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 140414440146688:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 708 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: H5Gstab.c line 566 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: H5B.c line 1221 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: H5B.c line 1177 in H5B_iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: H5Gnode.c line 1039 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 140414440146688:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 708 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: H5Gstab.c line 566 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: H5B.c line 1221 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: H5B.c line 1177 in H5B_iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: H5Gnode.c line 1039 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
ncdump: sandy.nc: NetCDF: HDF error
(IOOS) rsignell@0e6be50c3dc2:~$

Internal Server Error when reading boolean values

I have an HDF5 file that contains several datasets containing boolean values, both scalar and arrays, along with many other datasets and groups. Trying to read these using h5pyd returns an Internal Server Error, which doesn't seem to happen with datasets of other types. Here is a trace:

>>> import h5pyd as h5
>>> a=h5.File('mullite_300K.mullite.exfac', mode='r', endpoint='http://some.server:5000')
>>> a['/f1/instrument/detector/pixel_mask']
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-16-2a29cb1cdcaa> in <module>()
----> 1 a['/f1/instrument/detector/pixel_mask_applied']

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/group.pyc in __getitem__(self, name)
    314         if link_class == 'H5L_TYPE_HARD':
    315             #print "hard link, collection:", link_json['collection']
--> 316             tgt = getObjByUuid(link_json['collection'], link_json['id'])
    317         elif link_class == 'H5L_TYPE_SOFT':
    318             h5path = link_json['h5path']

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/group.pyc in getObjByUuid(collection_type, uuid)
    287             elif link_json['collection'] == 'datasets':
    288                 req = "/datasets/" + uuid
--> 289                 dataset_json = self.GET(req)
    290                 tgt = Dataset(DatasetID(self, dataset_json))
    291             else:

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/base.pyc in GET(self, req, format)
    522 
    523         if rsp.status_code != 200:
--> 524             raise IOError(rsp.reason)
    525         if rsp.headers['Content-Type'] == "application/octet-stream":
    526             self.log.info("returning binary content, length: " +

IOError: Internal Server Error

Compression not handled

I get the impression that 'gzip' compression is not taking effect. I set the compression when creating the dataset.
My file size has doubled after adding one entry, which would be about 1/20 the size of my entire file.

grpOut.create_dataset("C_PEAT",data=PEAT_emissions, compression="gzip")

This is the same method I used to create my entire file originally.

Setup travis CI

Create Travis test script.

"pip install h5pyd" installs v0.3.3 instead of 0.4.0

Running pip install h5pyd installs version 0.3.3 instead of the current 0.4.0.

$ pip --no-cache-dir install h5pyd
Collecting h5pyd
  Using cached https://files.pythonhosted.org/packages/4e/00/513f05db05e5dc3b599f541b042d8b47f9ec7c4ca62312f92fd33e11b607/h5pyd-0.3.3.tar.gz
Requirement already satisfied: ...
...
...
Building wheels for collected packages: h5pyd
  Building wheel for h5pyd (setup.py) ... done
...
Successfully built h5pyd
Installing collected packages: h5pyd
Successfully installed h5pyd-0.3.3

Other useful information:

$ python --version
Python 3.6.8
$ pip --version
pip 19.1.1
$ pip search h5pyd
h5pyd (0.4.0)  - h5py compatible client lib for HDF REST API
  INSTALLED: 0.3.3
  LATEST:    0.4.0

I know pip has had its issues over time but this is for a fairly fresh Python virtual environment and the latest pip.

How do I write the configuration file after installing h5pyd

What is the profile name, under which file, and what is the save type of the file

Implement object refs

Implement Obj refs.

The `-` operator is not supported in numpy (kind of)

For me, code like the following:

# specify a chunk layout
f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
dset = f["chunked_data"]
dset.chunks

Is producing the following numpy type error:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-1e2d51a33867> in <module>()
      1 # specify a chunk layout
----> 2 f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
      3 dset = f["chunked_data"]
      4 dset.chunks

~/src/h5pyd/h5pyd/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    148 
    149         with phil:
--> 150             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    151             dset = dataset.Dataset(dsid)
    152 

~/src/h5pyd/h5pyd/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
    116     tmp_shape = maxshape if maxshape is not None else shape
    117     # Validate chunk shape
--> 118     if isinstance(chunks, tuple) and (-numpy.array([ i>=j for i,j in zip(tmp_shape,chunks) if i is not None])).any():
    119         errmsg = "Chunk shape must not be greater than data shape in any dimension. "\
    120                  "{} is not compatible with {}".format(chunks, shape)

TypeError: The numpy boolean negative, the `-` operator, is not supported, use the `~` operator or the logical_not function instead.

I am running python 3.6.1 and numpy 1.13.1 on archlinux inside of a virtual environment.

Use of the '-' operator is apparently being debated. Either way it looks like numpy is trying to move away from its usage.

Selection consistency between distributed and local h5py?

Working with the tall data distributed with h5pyd, a simple selection generates invalid point argument error; below, the same operation succeeds with local hdf5 resource. Is there a reference on essential discrepancies between the two approaches?

%vjcair> python
Python 2.7.12 (default, Nov 17 2016, 17:26:31)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import h5pyd as h5py
f = h5py.File("tall.data.hdfgroup.org", "r", endpoint="https://data.hdfgroup.org:7258")
g2 = f['g2']
dset22 = g2['dset2.2']
dset22[[1,2]]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/h5pyd-0.1.0-py2.7.egg/h5pyd/_hl/dataset.py", line 664, in getitem
raise ValueError("invalid point argument")
ValueError: invalid point argument
h5py.version.version
'0.0.1'

import h5py
f = h5py.File("tall.h5")
g2 = f['g2']
dset22 = g2['dset2.2']
dset22[[1,2]]
array([[ 0. , 0.2 , 0.40000001, 0.60000002, 0.80000001],
[ 0. , 0.30000001, 0.60000002, 0.89999998, 1.20000005]], dtype=float32)
h5py.version.version
'2.7.0'

Unable to open object (Component not found)

have a file f3.h5 with s single uint8 dataset called '/raw'
I start the h5serv docker image and it finds f3.h5 while building the toc
when I try to access the dataset in python, I get a KeyError:

>>> import h5pyd as h5py
>>> a = h5py.File('f3.hdfgroup.org',mode='r',endpoint='http://localhost:5001')
>>> a['/raw']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/h5pyd/_hl/group.py", line 397, in __getitem__
    parent_uuid, link_json = self.get_link_json(name)
  File "/usr/local/lib/python2.7/dist-packages/h5pyd/_hl/group.py", line 303, in get_link_json
    raise KeyError("Unable to open object (Component not found)")
KeyError: 'Unable to open object (Component not found)'

...am I missing something? I looked at the read_example.py in the examples folder of this repo and it seemed to me it should work like above.

When I try to get the items that are in there I get

>>> a.items()
[(u'f3', <HDF5 group "/" (1 members)>)]

but both seem to be infinite loops:

>>> a.items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['f3'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['f3']['f3'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['f3']['f3']['f3'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['f3']['f3']['f3']['f3'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]

>>> a.items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['/'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['/']['/'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['/']['/']['/'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>> a['/']['/']['/']['/'].items()
[(u'f3', <HDF5 group "/" (1 members)>)]
>>>

region selection failed on size of 10000

It worked ok for the smaller region size.

(xstart, xend)
Out[155]: (15986, 25986)
(ystart, yend)
Out[156]: (59448, 69448)
vals3 = ds_remote[xstart:xend, ystart:yend]
Traceback (most recent call last):
  File "/home/wjiang2/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-153-2ab7845b65c0>", line 1, in <module>
    vals3 = ds_remote[xstart:xend, ystart:yend]
  File "/home/wjiang2/.local/lib/python3.6/site-packages/h5pyd/_hl/dataset.py", line 759, in __getitem__
    page_arr = numpy.reshape(arr1d, page_mshape)
  File "/app/python3/3.6.0/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 232, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/app/python3/3.6.0/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 10498 into shape (10000,10000)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wjiang2/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/wjiang2/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2927, in run_code
    self.showtraceback(running_compiled_code=True)
TypeError: showtraceback() got an unexpected keyword argument 'running_compiled_code'

Incorrect domains on external links

We have an h5serv server running, and loading regular HDF5 files works well. However, if the file contains an external link, it cannot access the external file because the file path is not converted to a valid domain.

Here's some output when accessing a file with path mullite/mullite_300K.nxs with respect to the h5serv datapath, with the root domain name of exfac (the config file sets the file extension to be .nxs on this server):

>>> b=h5pyd.File('mullite_300K.mullite.exfac',  mode='r',  endpoint='http://some.server:5000')
>>> c=b['/entry/transform/v']

KeyErrorTraceback (most recent call last)
<ipython-input-5-610e75ab34f0> in <module>()
----> 1 c=b['/entry/transform/v']

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/group.pyc in __getitem__(self, name)
    327             except IOError:
    328                 # unable to find external link
--> 329                 raise KeyError("Unable to open file: " + link_json['h5domain'])
    330             return f[link_json['h5path']]
    331 

KeyError: u'Unable to open file: 300K/transform.nxs'

Presumably, h5pyd should convert the external file path to a valid domain string. In this case, the file path is relative to the parent HDF5 file - I'm not sure what a correct domain name would be if the file path was absolute.

Recreating a dataset name doesn't raise error, results in orphaned dataset at `/db/{datasets}/`

If I create a dataset with create_dataset, write to it, and then call create_dataset again with the same name, then (a) no error occurs and (b) it seems like the old one is moved to /__db__/{datasets}/. Behavior (a) differs from h5py, which I think throws a KeyError when a dataset already exists of the name given to create_dataset.

To me (b) seems like a "dataset leak". Is this a feature or a bug?

I'd expect one of two behaviors:

Keep the pre-existing dataset and throw a KeyError (like h5py, I think)
delete the pre-existing dataset, create a new one

FillValue no longer working

I upgraded to the latest h5pyd, and now the fillvalues in my dataset are not working:
https://gist.github.com/rsignell-usgs/c3555fd60c391699197d53dcd0cb007c

The fill values in the original netcdf4 file were 1.0e37 but after writing with h5load the new h5pyd thinks the fillvalue is 0.0.

This same notebook worked at the ESIP summer meeting.

Code ran ok on Sep 21 commit but display IOError: [Errno 403] Forbidden with latest build

Dear John,

I have encountered an error after pulling the latest codes.

File "/usr/local/lib/python2.7/dist-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/files.py", line 161, in init
raise IOError(rsp.status_code, rsp.reason)
IOError: [Errno 403] Forbidden

When I reverted it back to b03d59b commit, there is no problem.

Not sure what exactly changed and I don't seem to find any changes on line 161 in _hl/files.py.

Would you be able to give me some clue or suggestion to resolve this?

Thanks
Ken

IOError "Bad Request: invalid linkname, '/' not allowed" for create_dataset

My input dataset does not yet have a group called 2015. I am not able to create a dataset inside of it. In h5py, I was able to do so without first creating the dataset.

fileOut.create_dataset('2015/newdset',data=newdata,compression='gzip')

  File "...\h5pyd\_hl\group.py", line 148, in create_dataset
    self[name] = dset
  File "...\h5pyd\_hl\group.py", line 420, in __setitem__
    self.PUT(req, body=body)
  File "...\h5pyd\_hl\base.py", line 418, in PUT
    raise IOError(rsp.reason)
IOError: Bad Request: invalid linkname, '/' not allowed

create_group clears file

I ran the following code on my file. The new group gets added, however all of the old data in the file disappears. The file size is now 8kb from previous 50mb, so it definitely got wiped.

fileOut = h5py.File("My_File.hdfgroup.org","w")
fileOut.create_group('2015')

This is a file that previously had ACL authentication, which I removed. Before removing it I got an IOError and no changes were made to the file.

Use == for comparisons

You can't compare tuples with is since they will be given a new ID every time you construct a tuple.

>>> (Ellipsis,) is (Ellipsis,)
False
>>> (Ellipsis,) == (Ellipsis,)
True

The error is here in the code:

h5pyd/h5pyd/_hl/dataset.py

Line 581 in 8f92ac1

if args is (Ellipsis,) or args is tuple():

Implement RegionRefs

Implement regionrefs

Try dask on top of h5pyd

Try dask on h5pyd instead of h5py to see if there are issues.

setup.py chokes on required pkgconfig

I'm trying to add h5pyd to a project's requirements.txt, however, when pip install -r requirements.txt I get the following:

Obtaining h5pyd from git+http://github.com/HDFGroup/[email protected]#egg=h5pyd (from -r requirements.txt (line 61))
  Updating ./env/src/h5pyd clone (to v0.3.0)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.python.org/simple/pkgconfig/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
    Couldn't find index page for 'pkgconfig' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
    No local packages or working download links found for pkgconfig
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/src/h5pyd/setup.py", line 43, in <module>
        'hsconfigure = h5pyd._apps.hsconfigure:main']
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/setuptools/__init__.py", line 128, in setup
        _install_setup_requires(attrs)
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/setuptools/__init__.py", line 123, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/setuptools/dist.py", line 455, in fetch_build_eggs
        replace_conflicting=True,
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 866, in resolve
        replace_conflicting=replace_conflicting
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1146, in best_match
        return self.obtain(req, installer)
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1158, in obtain
        return installer(requirement)
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/setuptools/dist.py", line 522, in fetch_build_egg
        return cmd.easy_install(req)
      File "/Users/nlaws/projects/reopt/webtool/reopt_api/env/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 667, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pkgconfig')

Does the version of pkgconfig need to be specified in setup.py (for the setup_requires arg)? If so, which version?

Creating a new dataset containing byte strings does not work w/ data param

array of byte strings to be loaded into h5pyd
time_index = pd.date_range('2016-01-01 00:30:00', '2016-12-31 23:30:00', freq='h')
time_index = np.array(time_index.astype(str), dtype='S20')
Loading using the data param in create_dataset
with h5pyd.File('/home/mrossol/nsrdb_tmy.h5', 'w') as f:
f.create_dataset('time_index', time_index.shape, dtype=time_index.dtype, data=time_index)
Produces the following error:
/anaconda/lib/python3.6/json/encoder.py in default(self, o)
178 “”"
179 raise TypeError(“Object of type ‘%s’ is not JSON serializable” %
--> 180 o.class.name)
181
182 def encode(self, o):

TypeError: Object of type ‘bytes’ is not JSON serializable

If you create the dataset and then load the array it works:
with h5pyd.File('/home/mrossol/nsrdb_tmy.h5', 'w') as f:
t_index = f.create_dataset('time_index', time_index.shape, dtype=time_index.dtype)

t_index[...] = time_index

xarray can't identify time units in HSDS dataset

In this notebook
https://gist.github.com/rsignell-usgs/07143a5ab54afb8ad6eb1af255d025c9
we use xarray to open a local netcdf4 file and then the same dataset that was 'hsload'ed to hsds.

xarray automatically recognized the CF-compliant time units and converts the time coordinate to datetime so that the plot is correctly labeled in cell [6].

But time is not recognized for the the HSDS dataset plot in cell [5].

Any idea what the problem is?

typo in latest release

v0,2.8 should be renamed v0.2.8

mode ='r' seems to have no effect

f_remote = h5pyd.File(domain, "r")
ds_remote = f_remote["/data"]

ds_remote is still writable and I've verified it by f_remote.close() and reopening it.

Shape inconsistency with h5py

If I load a two-dimensional slab from a three-dimensional array, I get a numpy array with ndim=2 in h5py, but ndim=3 in h5pyd, with one of the dimensions of size 1.

The following accesses the same file on the remote server and stored locally:

>>> import h5pyd as h5d
>>> a=h5d.File('mullite_300K.mullite.exfac', mode='r', endpoint='http://some.server:5000')
>>> a['/entry/transform/v'][400].shape
(1, 901, 901)
>>> import h5py as h5
>>> b=h5.File('mullite/mullite_300K.nxs')
>>> b['/entry/transform/v'][400].shape
(901, 901)

[Errno 503] Service Unavailable

My set-up is:

Windows 10 with "Bash on Windows" (WSL), Ubuntu Xenial.
HDF Server (h5serv) running on WSL on default port 5000, exposing "Novartis" dataset.
HDF Server is accessible from Windows browser:
http://localhost:5000/
{"root": "ddfa84c2-d5bc-11e7-bcec-d43d7e31e165", "lastModified": "2017-11-30T10:54:41Z", "created": "2017-11-30T10:54:41Z", "hrefs": [{"href": "http://localhost:5000/", "rel": "self"}, {"href": "http://localhost:5000/datasets", "rel": "database"}, {"href": "http://localhost:5000/groups", "rel": "groupbase"}, {"href": "http://localhost:5000/datatypes", "rel": "typebase"}, {"href": "http://localhost:5000/groups/ddfa84c2-d5bc-11e7-bcec-d43d7e31e165", "rel": "root"}]}
HDF Server is accessible via h5pyd from WSL:

>>> f = h5pyd.File('', 'r')
>>> print(list(f))
['Novartis']

The HDF Server is inaccessible via h5pyd from Windows:

File "C:\Program Files\Python 3.5\lib\site-packages\h5pyd\_hl\files.py", line 161, in __init__
    raise IOError(rsp.status_code, rsp.reason)
OSError: [Errno 503] Service Unavailable

Any pointers how to debug and eliminate the issue are appreciated!

Link for "Reporting Issues" on Readme is invalid

Requests fail for some firewall configurations

Some firewall software will alter the host header in requests sent to the server, resulting in the operation to fail.

Cannot access compound dataset which contains array of enum

I am trying to access a dataset which contains an enum array via h5serv, however h5pyd throws the following exception:

  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/group.py", line 335, in __getitem__
    tgt = getObjByUuid(link_json['collection'], link_json['id'])
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/group.py", line 311, in getObjByUuid
    tgt = Dataset(DatasetID(self, dataset_json))
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/dataset.py", line 416, in __init__
    self._dtype = createDataType(self.id.type_json)
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 725, in createDataType
    dt = createDataType(field['type'])  # recursive call
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 732, in createDataType
    dtRet = createBaseDataType(typeItem)  # create non-compound dt
  File "$HOME/project/venv/lib/python2.7/site-packages/h5pyd-0.2.6-py2.7.egg/h5pyd/_hl/h5type.py", line 638, in createBaseDataType
    raise TypeError("Array Type base type must be integer, float, or string")
TypeError: Array Type base type must be integer, float, or string

We can create a minimal dataset to reproduce the error using h5py as follows:

import h5py
import numpy as np

f = h5py.File('test.h5', 'w')
enum_type = h5py.special_dtype(enum=('i', {"FOO": 0, "BAR": 1, "BAZ": 2}))
comp_type = np.dtype([('my_enum_array', enum_type, 10), ('my_int', 'i'), ('my_string', np.str_, 32)])
dataset = f.create_dataset("test", (4,), comp_type)
f.close()

We then put it in h5serv's data directory and try to access it:

import h5pyd
f = h5pyd.File("test.hdfgroup.org", endpoint="http://127.0.0.1:5000")
print(f['test'])

This yields the above exception. Note that we are able to access the dataset as expected using regular h5py.

Applying the following patch to h5pyd prevents the exception and returns a dataframe, however it doesn't seem to give the correct behavior (the enum array seems to be treated as an int array):

diff --git a/h5pyd/_hl/h5type.py b/h5pyd/_hl/h5type.py
index 4ce6cb4..10ce562 100644
--- a/h5pyd/_hl/h5type.py
+++ b/h5pyd/_hl/h5type.py
@@ -637 +637 @@ def createBaseDataType(typeItem):
-            if arrayBaseType["class"] not in ('H5T_INTEGER', 'H5T_FLOAT', 'H5T_STRING'):
+            if arrayBaseType["class"] not in ('H5T_INTEGER', 'H5T_FLOAT', 'H5T_STRING', 'H5T_ENUM'):

I'm not sure how to properly proceed in working around this. Thanks in advance for your advice.

Speed improvements to loading HDF5 trees

In the nexusformat API, we load the entire HDF5 file tree by recursively walking through the groups in h5py, without reading in data values except for scalars and small arrays. On a local file, we can load files containing hundreds of objects without a significant time delay. For example, a file with 80 objects (groups, datasets, and attributes) takes 0.05s to load on my laptop. However, on h5pyd, the same load takes over 20s.

A call to load all the items in an HDF5 group requires two GET requests, and sometimes three, for each object, so there could be an improvement if all the metadata (shape, dtype, etc.) for each object were returned in a single call, and an even more significant one if all the items in a group could be returned with one GET request. Loading one group of 10 objects took 29 requests in my tests.

Binary data reads are fast, though.

And env based endpoint/user/etc.

Pull endpoint, user, password, etc. from env variable or config file if not set explicitly.

Submit package to pypi

Add package to cheeseshop!

Try xarray/dask/h5netcdf on top of h5pyd

h5netcdf is a pythonic interface to netcdf4 files using h5py.

It would be super cool to try h5netcdf on top of h5pyd instead.

If that worked we could try xarray with dask on top of h5pyd.

And if that worked, it would be amazing....

Client parameter 'domain' does not match server parameter 'host'

According to server restful api spec, parameter 'host' is needed. but in h5pyd, parameter seems to modified as 'domain'. It does now work.

pip install fails

Hi
I am trying to get h5pyd up and running on the h5serv docker image available here:

https://hub.docker.com/r/hdfgroup/h5serv/

Running the pip install command as documented gives the following:

# pip install h5pyd
Collecting h5pyd
  Could not find a version that satisfies the requirement h5pyd (from versions: )
No matching distribution found for h5pyd

This looks like it's trying to use versions from a local requirements.txt file, but it does not exist. Not quite sure whether this is a pip or h5pyd issue.

Thanks

handle 413 errors in point selection

coords[1:3]
Out[100]: [(441, 82852), (441, 88209)]
len(coords)
Out[101]: 2500

data = ds_remote[coords]
Traceback (most recent call last):
  File "/home/wjiang2/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-99-6ac77acf88d2>", line 1, in <module>
    data = ds_remote[coords]
  File "/home/wjiang2/.local/lib/python3.6/site-packages/h5pyd/_hl/dataset.py", line 848, in __getitem__
    rsp = self.POST(req, body=body)
  File "/home/wjiang2/.local/lib/python3.6/site-packages/h5pyd/_hl/base.py", line 477, in POST
    raise IOError(rsp.reason)
OSError: Request Entity Too Large

Dask tests

As requested here by @mrocklin: pangeo-data/pangeo#75 (comment)

Try XArray + Dask locally on the HSDS data to verify that it can be accessed concurrently from multiple threads
Try XArray + Dask.distributed locally on the HSDS data to verify that the h5pyd objects can survive being serialized
Try everything on a distributed cluster using KubeCluster and then look at the performance of scalable computing
Try this all again on a cluster on S3, where presumably we would expect 100-200MB/s network access from each node.

requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:600)