hdfgroup / hsds Goto Github PK

View Code? Open in Web Editor NEW

121.0 20.0 52.0 7.39 MB

Cloud-native, service based access to HDF data

Home Page: https://www.hdfgroup.org/solutions/hdf-kita/

License: Apache License 2.0

Python 88.55% HTML 10.23% Shell 0.77% Dockerfile 0.08% Batchfile 0.01% C 0.36%

data-analysis aws docker hdf5 python asyncio scientific-data multi-dimensional

hsds's Introduction

HSDS (Highly Scalable Data Service) - REST-based service for HDF5 data

Introduction

HSDS is a web service that implements a REST-based web service for HDF5 data stores. Data can be stored in either a POSIX files system, or using object-based storage such as AWS S3, Azure Blob Storage, or MinIO. HSDS can be run a single machine with or without Docker or on a cluster using Kubernetes (or AKS on Microsoft Azure).

Quick Start

With Github codespaces

Launch a Codespaces environment by clicking the banner "Open in GitHub Codespaces". Once the codespace is ready, type: python testall.py in the terminal window to run the test suite.

On your desktop/laptop

Make sure you have Python 3 and Pip installed, then:

Run install: $ ./build.sh --nolint from source tree OR install from pypi: $ pip install hsds
Create a directory the server will use to store data, example: $ mkdir ~/hsds_data
Start server: $ hsds --root_dir ~/hsds_data
Run the test suite. In a separate terminal run:
- Set user_name: $ export USER_NAME=$USER
- Set user_password: $ export USER_PASSWORD=$USER
- Set admin name: $ export ADMIN_USERNAME=$USER
- Set admin password: $ export ADMIN_PASSWORD=$USER
- Run test suite: $ python testall.py --skip_unit
(Optional) Install the h5pyd package for an h5py compatible api and tool suite: https://github.com/HDFGroup/h5pyd
(Optional) Post install setup (test data, home folders, cli tools, etc): docs/post_install.md

To shut down the server, and the server is not running in Docker, just control-C.

If using docker, run: $ ./stopall.sh

Note: passwords can (and should for production use) be modified by changing values in hsds/admin/config/password.txt and rebuilding the docker image. Alternatively, an external identity provider such as Azure Active Directory or KeyCloak can be used. See: docs/azure_ad_setup.md for Azure AD setup instructions or docs/keycloak_setup.md for KeyCloak.

Detailed Install Instructions

On AWS

For complete instructions to install on a single Azure VM with Docker:

See: docs/docker_install_aws.md

For complete instructions to install on AWS Kubernetes Service (EKS):

See: docs/kubernetes_install_aws.md

For complete instructions to install on AWS Lambda:

See: docs/aws_lambda_setup.md.

On Azure

For complete instructions to install on a single Azure VM with Docker:

See: docs/docker_install_azure.md

For complete instructions to install on Azure Kubernetes Service (AKS):

See: docs/kubernetes_install_azure.md

On Prem (POSIX-based storage)

For complete instructions to install on a desktop or local server:

See: docs/docker_install_posix.md

On DCOS (BETA)

For complete instructions to install on DCOS:

See: docs/docker_install_dcos.md

General Install Topics

Setting up docker:

See docs/setup_docker.md

Post install setup and testing:

See docs/post_install.md

Authorization, ACLs, and Role Based Access Control (RBAC):

See docs/authorization.md

Running serverless with h5pyd:

See https://github.com/HDFGroup/h5pyd/blob/master/README.rst

Writing Client Applications

As a REST service, clients be developed using almost any programming language. The test programs under: hsds/test/integ illustrate some of the methods for performing different operations using Python and HSDS REST API (using the requests package).

The related project: https://github.com/HDFGroup/h5pyd provides a (mostly) h5py-compatible interface to the server for Python clients.

For C/C++ clients, the HDF REST VOL is a HDF5 library plugin that enables the HDF5 API to read and write data using HSDS. See: https://github.com/HDFGroup/vol-rest. Note: requires v1.12.0 or greater version of the HDF5 library.

Uninstalling

HSDS only modifies the storage location that it is configured to use, so to uninstall just remove source files, Docker images, and S3 bucket/Azure Container/directory files.

Reporting bugs (and general feedback)

Create new issues at http://github.com/HDFGroup/hsds/issues for any problems you find.

For general questions/feedback, please use the HSDS forum: https://forum.hdfgroup.org/c/hsds.

License

HSDS is licensed under an APACHE 2.0 license. See LICENSE in this directory.

Azure Marketplace

VM Offer for Azure Marketplace. HSDS for Azure Marketplace provides an easy way to setup a Azure instance with HSDS. See: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/thehdfgroup1616725197741.hsdsazurevm?tab=Overview for more information.

Websites

Main website: https://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/
Source code: https://github.com/HDFGroup/hsds
Forum: https://forum.hdfgroup.org/c/hsds
Documentation: http://h5serv.readthedocs.org (For REST API)

Other useful resources

Papers

restfulSE: A semantically rich interface for cloud-scale genomics with Bioconductor: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6392152
RESTful HDF5 White Paper: https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf

hsds's People

Contributors

Stargazers

Watchers

Forkers

tverbeiren paulmueller pjedwards whigg murlock loricooperhdf pcrespov jonosgit qiuwei t20100 loichuder jjaraalm s004pmg johnluck77 kadriumay bluequartzsoftware alai-arpas derobins fanszoro tysen untereiner brucevault-forks k3-communications olshansk farnoldius jeffreyjharris janilazar cjnygard redesignscience biosimulations bilalshaikh42 benfre riizeron lkurz holzman mbruggs sangwonx webclinic017 tmanhente almahmoud hyoklee jrhodes-gst trmt sofdesk ozanerturk sensemore mattjala harryjmoss dimak415 alisonaquinas assaron

hsds's Issues

Tool to update references to link files

Often times after a file is load with hsload -link, the storage location of the linked files changes (e.g. moved to a different bucket). Need a tool to update the HSDS domain without re-loading the entire file.

Don't write to bucket on startup

Rather than writing the cluster id to the "storeinfo" object, just have the head node maintain the cluster id. Move the logic to write to storeinfo into the DN nodes on first write.
This will remove the need to have an additional bucket ("SYS_BUCKET_NAME") when the primary bucket ("BUCKET_NAME") is a read-only resource.

Support Domain Checksums

Enable domain checksums - an aggregation of all ETag values for all objects within a domain.
These will be created asynchronously using the same process as with domain info.

ACLs are only set at the domain level

As far as I can tell, ACLs are only settable at the domain and not object level. This seems like a regression compared to h5serv where I believe they were settable at the object level. This is also not very obvious based on the h5pyd API were ACLs are settable/gettable on objects.

Are there any architectural reasons we can't have object-level ACLs?

HSDS Authentication via Keycloak (OpenID)

I'm trying to use OpenID (via Keycloak) as an authentication provider for our HSDS instance.
According to the feature list this should work but i dont' know how. Is there an example or tutorial on how to implement that?

Update base image

Update base image for Docker build (currently hdfgroup/python:3.7) to use Python 3.8 and include additional Python packages needed for the service.

Use IAM roles for S3 instead of AWS Access Keys

Our group is not able to make use of AWS access keys for S3 CRUD ops. However, we do make use of IAM roles at the EC2 level to achieve the same goals.

runall.sh does check that access keys are provided. However, does the software mandate using access keys if they are not provided? Will an EC2 level IAM role be sufficient?

Login does not work

Hi,

I was attempting to setup hsds on AWS k8s instance but for some odd reason the default password admin:admin does not seem to work even when set in passwd.txt file.

When trying to run $ hsconfigure it keeps giving the error Unauthorized (username/password or api key not valid)

My deployment yaml looks as follows (have some same for dn container too)

containers:
        -
          name: sn
          image: hdfgroup/hsds:latest
          ports:
          .....
          env:
          - name: HSDS_ENDPOINT
            value: http://localhost:5101

I did ssh into the sn container to check if the passwd.txt file was present and indeed it was.
I also created my own docker image with password file set as suggested in the document but with no luck.

Kindly let me know how to resolve this issue.

Thanks,
Akshay

Cannot keep another pod in same space as the one running hsds

Hi,

I observed that the following line throws an error if we have any pod apart from hsds running in our namespace
https://github.com/HDFGroup/hsds/blob/master/hsds/basenode.py#L305

Kindly look into the same.

Thanks,
Akshay

Implement smooth scale downs of a DCOS cluster

In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.

Incompatibility with aiobotocore v0.12.0

There is an API break in version 0.12.0 of aiobotocore: the get_session function no longer takes a loop argument.
This makes hsds not working with it.

Remove passwords from build image

Including usernames and passwords as part of the Docker image is problematic - for one it makes it impossible to distribute images via Docker hub or other public repo.

Rather than building in the password file, either mount the file (for Docker deployments) or create a secret (for Kubernetes).

Use config file for settings

Rather than relying on environment variables, use config file for settings. This would be mounted (for docker) or loaded as a secret (Kubernetes).

Add support for running HSDS on DCOS

runall.sh fails when SN_PORT is not 80 and there is no load balancer

To my understanding, without load balancer, the SN is responding to front-end requests (https://github.com/HDFGroup/hsds/blob/master/runall.sh#L125).

Also, the check of STATUS_CODE (https://github.com/HDFGroup/hsds/blob/master/runall.sh#L160) uses the URL ${HSDS_ENDPOINT}/about which interrogates on default HTTP port 80 (or 443 if HSDS_ENDPOINT is HTTPS ?).

Therefore, changing SN_PORT makes this check fail while the server is running fine on SN_PORT.

I could make it work by changing the URL in the check to ${HSDS_ENDPOINT}:${SN_PORT}/about but it may not be the desirable behaviour with a load balancer.

Enable Azure Blob Storage

Enable Azure blob storage (which doesn't support the S3 API) as an option.

Enable Posix storage

Enable posix storage to be used rather than object storage.

0.6.0 Release Target

This issue is to document the set of features/bug fixes for the next (v0.6) HSDS release

AKS support

Enable HSDS to run on Azure Kubernetes

AttributeError on invoking lambda

Hi, we're testing out the lambda submission feature of HSDS and we're getting 500 errors on our read requests. I checked the SN logs and confirmed that it's trying to invoke the lambda but is running into the following AttributeError. A quick search for the error suggests that it may be related to aiobotocore?

ERROR> Unexpected exception for lamdea invoke: 'ClientCreatorContext' object has no attribute 'invoke', type: <class 'AttributeError'>
INFO> read_chunk_hyperslab, chunk_id: c-338aa60e-7a07b72a-80c1-34d2e1-85921c_249_0_0, slices: (slice(0, 500, 1), slice(2, 3, 1), slice(0, 3, 1)), bucket: redesign-hsds-prod, serverless: True
INFO> invoking lambda function chunk_read with payload: {'select': '[0:2,2:3,0:3]', 'bucket': 'redesign-hsds-prod', 'chunk_id': 'c-338aa60e-7a07b72a-80c1-34d2e1-85921c_249_0_0', 'dset_json': {'id': 'd-338aa60e-7a07b72a-80c1-34d2e1-85921c', 'root': 'g-338aa60e-7a07b72a-bb02-2e86f2-8f3fa2', 'created': 1602012530, 'lastModified': 1602012530, 'type': {'class': 'H5T_FLOAT', 'base': 'H5T_IEEE_F32LE'}, 'shape': {'class': 'H5S_SIMPLE', 'dims': [500, 46576, 3], 'maxdims': [0, 46576, 3]}, 'creationProperties': {'layout': {'class': 'H5D_CHUNKED', 'dims': [2, 46576, 3]}, 'fillTime': 'H5D_FILL_TIME_ALLOC', 'filters': [{'class': 'H5Z_FILTER_SHUFFLE', 'id': 2, 'name': 'shuffle'}, {'class': 'H5Z_FILTER_DEFLATE', 'id': 1, 'level': 1, 'name': 'gzip'}]}, 'layout': {'class': 'H5D_CHUNKED', 'dims': [2, 46576, 3]}, 'attributes': {'CLASS': {'type': {'class': 'H5T_STRING', 'charSet': 'H5T_CSET_ASCII', 'length': 6, 'strPad': 'H5T_STR_NULLPAD'}, 'shape': {'class': 'H5S_SCALAR'}, 'value': 'EARRAY', 'created': 1602012532.4391222}, 'EXTDIM': {'type': {'class': 'H5T_INTEGER', 'base': 'H5T_STD_I32LE'}, 'shape': {'class': 'H5S_SCALAR'}, 'value': 0, 'created': 1602012532.4724793}, 'TITLE': {'type': {'class': 'H5T_STRING', 'charSet': 'H5T_CSET_ASCII', 'length': 1, 'strPad': 'H5T_STR_NULLPAD'}, 'shape': {'class': 'H5S_SCALAR'}, 'value': 'E', 'created': 1602012532.5056005}, 'VERSION': {'type': {'class': 'H5T_STRING', 'charSet': 'H5T_CSET_ASCII', 'length': 3, 'strPad': 'H5T_STR_NULLPAD'}, 'shape': {'class': 'H5S_SCALAR'}, 'value': '1.1', 'created': 1602012532.5390193}, 'units': {'type': {'class': 'H5T_STRING', 'charSet': 'H5T_CSET_ASCII', 'length': 10, 'strPad': 'H5T_STR_NULLPAD'}, 'shape': {'class': 'H5S_SCALAR'}, 'value': 'nanometers', 'created': 1602012532.5726192}}}} start: 1602018839.911248

HSDS Posix on openshift, missing bucket?

I'm trying to deploy an HSDS server onto an openshift instance. The HSDS server should store its data in a POSIX way onto a "persistant volume claim". When I try to get the domains, do an hsinfo or try to create an hsds file I get responses from the server that seem to point to a missing bucket:

Request GET Domains
Response 404 Not found
Server Response

REQ> GET: /domains [#URL#]
DEBUG> num tasks: 15 active tasks: 7
DEBUG> no Authorization in header
INFO> get_domains for: / verbose: False
DEBUG> get_domains - no limit
INFO> get_domains - prefix: / bucket: hsds
DEBUG> get_domains - listing S3 keys for
DEBUG> _getStorageClient getting FileClient
INFO> getStorKeys('','/','', include_stats=False
INFO> list_keys('','/','', include_stats=False, bucket=hsds
DEBUG> fileClient listKeys for directory: /data/hsds
WARN> listkeys - /data/hsds not found

Command hsinfo
Response:

server name: Highly Scalable Data Service (HSDS)
server state: READY
endpoint: #URL#
username: #USER#
password: #PW#
Error: [Errno 404] Not Found

Server Response:

DEBUG> info request
INFO RSP> <200> (OK): /info
REQ> GET: /about [#URL#]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
INFO RSP> <200> (OK): /about
REQ> GET: / [hsds/home]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home bucket: hsds
INFO> got domain: hsds/home
INFO> getDomainJson(hsds/home, reload=True)
DEBUG> LRU ChunkCache node hsds/home removed from ChunkCache
DEBUG> ID hsds/home resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 200
DEBUG> setitem, key: hsds/home
DEBUG> LRU ChunkCache adding 1024 to cache, mem_size is now: 1024
DEBUG> LRU ChunkCache added new node: hsds/home [1024 bytes]
DEBUG> got domain_json: {'owner': '#USER#', 'acls': {'#USER#': {'create': True, 'read': True, 'update': True, 'delete': True, 'readACL': True, 'updateACL': True}, 'default': {'create': False, 'read': True, 'update': False, 'delete': False, 'readACL': False, 'updateACL': False}}, 'created': 1605698579.8284595, 'lastModified': 1605698579.8284595}
INFO> aclCheck: read for user: #USER#
DEBUG> href parent domain: hsds/
INFO RSP> <200> (OK): /
REQ> GET: /domains [hsds/home/]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
INFO> get_domains for: /home/ verbose: True
DEBUG> get_domains - using Limit: 1000
INFO> get_domains - prefix: /home/ bucket: hsds
DEBUG> get_domains - listing S3 keys for home/
DEBUG> _getStorageClient getting FileClient
INFO> getStorKeys('home/','/','', include_stats=False
INFO> list_keys('home/','/','', include_stats=False, bucket=hsds
DEBUG> fileClient listKeys for directory: /data/hsds/home/
WARN> listkeys - /data/hsds/home/ not found

Command hstouch -u #USER# -p #PW# -u #USER# /home/#USER#/test.h5
Server Response:

REQ> GET: / [hsds/home/#USER#]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER# bucket: hsds
INFO> got domain: hsds/home/#USER#
INFO> getDomainJson(hsds/home/#USER#, reload=True)
DEBUG> LRU ChunkCache node hsds/home/#USER# removed from ChunkCache
DEBUG> ID hsds/home/#USER# resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 200
DEBUG> setitem, key: hsds/home/#USER#
DEBUG> LRU ChunkCache adding 1024 to cache, mem_size is now: 2048
DEBUG> LRU ChunkCache added new node: hsds/home/#USER# [1024 bytes]
DEBUG> got domain_json: {'owner': '#USER#', 'acls': {'#USER#': {'create': True, 'read': True, 'update': True, 'delete': True, 'readACL': True, 'updateACL': True}, 'default': {'create': False, 'read': True, 'update': False, 'delete': False, 'readACL': False, 'updateACL': False}}, 'created': 1605702901.1704721, 'lastModified': 1605702901.1704721}
INFO> aclCheck: read for user: #USER#
DEBUG> href parent domain: hsds/home
INFO RSP> <200> (OK): /
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500

I tried the fix as described in #13, this fails with the following response. I do think this may be because the script is meant for S3 storage though.
Command: python create_toplevel_domain_json.py --user=#USER# --domain=/home
Server Response:

got environment override for config-dir: ../#USER#/config/
checking config path: ../#USER#/config/config.yml
_load_cfg with '../#USER#/config/config.yml'
got env value override for hsds_endpoint
got env value override for root_dir
got env value override for bucket_name
got env value override for log_level
domain: /home
domain: hsds/home
s3_key: home/.domain.json
DEBUG> _getStorageClient getting FileClient
DEBUG> isStorObj hsds/home/.domain.json
INFO> is_key - filepath: /data/hsds/home/.domain.json
DEBUG> isStorObj home/.domain.json returning False
INFO> writing domain
DEBUG> _getStorageClient getting FileClient
INFO> putS3JSONObj(hsds/home/.domain.json)
WARN> fileClient.put_object - bucket at path: /data/hsds not found
Traceback (most recent call last):
File "create_toplevel_domain_json.py", line 181, in
main()
File "create_toplevel_domain_json.py", line 170, in main
loop.run_until_complete(createDomains(app, usernames, default_perm, domain_name=domain))
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "create_toplevel_domain_json.py", line 87, in createDomains
await createDomain(app, domain, domain_json)
File "create_toplevel_domain_json.py", line 104, in createDomain
await putStorJSONObj(app, s3_key, domain_json)
File "/usr/local/lib/python3.8/site-packages/hsds/util/storUtil.py", line 307, in putStorJSONObj
rsp = await client.put_object(key, data, bucket=bucket)
File "/usr/local/lib/python3.8/site-packages/hsds/util/fileClient.py", line 154, in put_object
raise HTTPNotFound()
aiohttp.web_exceptions.HTTPNotFound: Not Found

Support OpenShift Deployments

When deployed on OpenShift, HSDS containers fail to register. This looks to be due to this line: https://github.com/HDFGroup/hsds/blob/master/hsds/basenode.py:L314 (the list_pod_for_all_namespaces call). OpenShift doesn't permit listing of pods not managed by the deployer.

connection failures when async is busy

When async node is busy (e.g. when the service has recently started), client apps may fail with connection being dropped.

Enable AWS Lambda for chunk reads

For data selection cases where 100's of chunks need to be quieted, it will likely be more efficient to invoke a lamda function for each chunk rather than dispatching to the DN nodes. AWS supports up to 1000-way parallelism for Lambda, which will provide better scaling than most server configurations.

Support aiobotocore >= 1.0

There are some forward compatibility issues with the new aiobotorcore package. I've temporarily restricted the hsds build to use the old version (see: HDFGroup/hdf-docker@e17a346), but code should be updated to use the newer version.

hsload corrupted files during linked load

During an overnight load of large (750MB to 6GB) h5 files to HSDS using the hsload --link command, 8 out of 63 files were afterward found to be corrupted. Attempts to access those files resulted in the following error:

HDF5 REST VOL-DIAG: Error detected in HDF5 REST VOL (1.0.0) thread 139748614063872:
  #000: vol-rest/src/rest_vol_dataset.c line 321 in RV_dataset_open(): can't locate dataset by path
    major: Dataset
    minor: Problem with path to object
  #001: vol-rest/src/rest_vol.c line 1834 in RV_find_object_by_path(): can't locate parent group for object of unknown type
    major: Symbol table
    minor: Problem with path to object
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 139748614063872:
  #000: ../../src/H5D.c line 296 in H5Dopen2(): unable to open dataset
    major: Dataset
    minor: Can't open object
  #001: ../../src/H5VLcallback.c line 1974 in H5VL_dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: ../../src/H5VLcallback.c line 1941 in H5VL__dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object

Subsequent reloads of the corrupted files using the same hsload command, fixed the problem. During the load process, the following error (output from hsload) has been observed multiple times (requiring manual reloading of the file). It is assumed that this is the cause of the problem, but we have not been able to conclusively demonstrate it.

ERROR 2020-11-02 14:22:00,575 utillib.py:455 ERROR: failed to create dataset: Gateway Timeout
Traceback (most recent call last):
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/group.py", line 600, in proxy
    return func(name, self[name])
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 658, in object_create_helper
    create_dataset(obj, ctx)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 457, in create_dataset
    return dset
UnboundLocalError: local variable 'dset' referenced before assignment

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./.pyenv/versions/3.8.3/bin/hsload", line 11, in <module>
    load_entry_point('h5pyd==0.8.0', 'console_scripts', 'hsload')()
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/hsload.py", line 309, in main
    load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, compression=compression, compression_opts=compression_opts)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_apps/utillib.py", line 698, in load_file
    fin.visititems(object_create_helper)
  File "/home/ubuntu/.pyenv/versions/3.8.3/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/group.py", line 601, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
SystemError: <built-in function visit> returned a result with an error set

Throw Error when invalid username is given

There are a few operations (PUT domain with owner param, PUT ACL) where a username is provided that is not the requesting user.
Currently these requests succeed even when the username does not refer to an existing user.
The server should instead return an error (404?) in these cases.

"Edge Chunk" issues.

Re this conversation from zarr-developers/zarr-python#233 (comment) on using full size chunks along edges vs. using smaller chunks.

HSDS is using smaller chunks but we are not resizing the chunks during resize operations either.

Need to create test case to demonstrate the bug and figure out how to address.

Parent domain: / not found

Hi,

I am trying the docker installation instructions.

When I run hsinfo, I get:

2019-01-01 12:00:48,172 set log_level to 10
2019-01-01 12:00:48,173 GET: http://hsds.hdf.test/about [None]
2019-01-01 12:00:48,178 Starting new HTTP connection (1): hsds.hdf.test:80
2019-01-01 12:00:48,185 http://hsds.hdf.test:80 "GET /about HTTP/1.1" 200 165
2019-01-01 12:00:48,186 status: 200
server name: hsdstest
server state: READY
endpoint: http://hsds.hdf.test
username: admin
password: *****
2019-01-01 12:00:48,187 GET: http://hsds.hdf.test/ [/home]
2019-01-01 12:00:48,189 Starting new HTTP connection (1): hsds.hdf.test:80
2019-01-01 12:00:48,292 http://hsds.hdf.test:80 "GET /?domain=%2Fhome HTTP/1.1" 404 14
2019-01-01 12:00:48,293 status: 404
2019-01-01 12:00:48,294 status_code: 404
Error: [Errno 404] Not Found

And following that when I do hstouch /home/, I get

2019-01-01 12:02:16,324 set log_level to 10
2019-01-01 12:02:16,324 GET: http://hsds.hdf.test/domains [None]
2019-01-01 12:02:16,329 Starting new HTTP connection (1): hsds.hdf.test:80
2019-01-01 12:02:16,408 http://hsds.hdf.test:80 "GET /domains HTTP/1.1" 404 14
2019-01-01 12:02:16,411 status: 404
2019-01-01 12:02:16,411 status_code: 404
Parent domain: / not found

I feel like I am skipping some step to create the root domain.
Any help would be appreciated.
Thanks.

Typo 'clening' in documentation.

https://github.com/HDFGroup/hsds/blob/d1c19c54ce223bacc169364d0fa69f9a6e29446b/tests/load/kubeptwrite/README.md#clening-up

sh compatible installation scripts

The operator [[ is a bash specific improvement of [. It does not allow the use of others shells like Zsh.

Make hsds a python package

As it is hsds is not a python package, it is copied as scripts to the docker.

Wouldn't that make sense to turn it into a true python package so it can also be used outside docker and make it easier to get the structure?

If you are ready to review it, I can give it a try.

Make hsds an application

It would be convenient if hsds could "scale down" to a simple application so as to use it locally simply as an application without docker.

Support attribute names with '#' sign

This is not working because '#' in interpreted as a http fragment character.

[Question] is hsds suitable for data that requires frequent updates?

I have read the documents and examples.
From understanding, hsds is excellent for serving a huge static hdf5 file and enables slicing, downsampling etc operations efficiently.

In our use cases, we need to update the hdf5 file frequently, e.g., on a daily basis.
I am wondering whether we can use hsds for this scenario?

Thanks in advance!

support ARRAY of ARRAY types

Array types with an Array base type are not working.

E.g.:
{
"type": {
"class": "H5T_ARRAY",
"base": {
"class": "H5T_ARRAY",
"base": {
"class": "H5T_STRING",
"charSet": "H5T_CSET_ASCII",
"strPad": "H5T_STR_NULLPAD",
"length": 20
},
"dims": [ 49, 60, 124, 51, 48 ]
},
"dims": [ 49, 60, 124, 51, 48 ]
}

Timestamp all log messages

In trying to debug a cluster wide failure, I'm hampered by only some of the log messages having timestamps. I'm unable to associate what's happening across processes to do a proper postmortem.

This includes logs at all log levels. A common pattern in python is to use https://docs.python.org/3/library/logging.html, which allows sites to do configuration of what information is in the log message prefixes. Common enterprise patterns use log aggregation and indexing software, with site configurable prefixes for local logging and traceability requirements.

Update travis.yml to Python 3.8

See #44, I believe that this might be introducing some errant pyflakes messages because of the mismatch.

Enable HTTP Compression

Many dataset hyperslab and point selections would benefit from using HTTP Compression (c.f. https://en.wikipedia.org/wiki/HTTP_compression). For example, if 1024x1024 dataset is mostly zero's, reading the entire dataset would need to transmit relatively few bytes if the response is compressed compared to 10241024element_size for the uncompressed response.

Azure RBAC using AD groups

Support the use of AD groups for RBAC rather than relying on group list managed by the server.

[Question] Some configuration entries must be set as env variables ?

Seeing #50, I thought that I could set all parameters in config.yml.

I was however proven wrong when running runall.sh. Checks are made on ROOT_DIR and HSDS_ENDPOINT that therefore must be set as environment variables.

Is this the expected behaviour or am I missing something to use the root_dir and hsds_endpoint entries in the config.yml ?

Wrong root id with dataset post

This sequence:

Create domain
Post dataset with link to root
Delete domain
Create domain
Post dataset with link to root

Results in the response of the last POST having the root id of the original domain.

CORS problem when serving posix files on localhost

Hi there,

I have a local running instance of hsds serving POSIX files.
Requests using the Python module requests are succeeding.

However, requests from a browser (using a React App running also in local in my case) fail due to CORS (The Same Origin Policy disallows reading the remote resource).

Any thoughts to resolve this ?

Implement RBAC

Implement Role Based Access Control (RBAC) so that ACLs can reference roles in addition to individual user names.

Serving datasets with bitshuffle compression in POSIX files

I am trying to serve POSIX files that contain datasets compressed with the bitshuffle filter.

hsload --link works without any trouble as do requests to metadata and uncompressed datasets. However, requests to the compressed datasets fail with the following errors in the datanode:

Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/chunk_dn.py", line 259, in GET_Chunk
    chunk_arr = await get_chunk(app, chunk_id, dset_json, bucket=bucket, s3path=s3path, s3offset=s3offset, s3size=s3size, chunk_init=False)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 386, in get_chunk
    chunk_arr = bytesToArray(chunk_bytes, dt, dims)
  File "/usr/local/lib/python3.8/site-packages/hsds/util/arrayUtil.py", line 453, in bytesToArray
    arr = np.frombuffer(data, dtype=dt)
ValueError: buffer size must be a multiple of element size

and

Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/chunk_dn.py", line 259, in GET_Chunk
    chunk_arr = await get_chunk(app, chunk_id, dset_json, bucket=bucket, s3path=s3path, s3offset=s3offset, s3size=s3size, chunk_init=False)
  File "/usr/local/lib/python3.8/site-packages/hsds/datanode_lib.py", line 386, in get_chunk
    chunk_arr = bytesToArray(chunk_bytes, dt, dims)
  File "/usr/local/lib/python3.8/site-packages/hsds/util/arrayUtil.py", line 459, in bytesToArray
    arr = arr.reshape(shape)
ValueError: cannot reshape array of size 6987 into shape (66,123)

How should I proceed to be able to request such datasets ? Given that I do hsload --link, should I look into HSDS rather than h5pyd ?

Wrong IP addresses registered when using Docker overlay network

Using 170: peername = request.transport.get_extra_info('peername') in headnode.py fetches the node IP of the overlay network, and not the IP of the registering service/data node.

I replaced line 170 with a few lines of code into the register function to accept an 'addr' field from the registering node:

   if 'addr' not in body:
       peername = request.transport.get_extra_info('peername')
       if peername is None:
           raise HTTPBadRequest(reason="Can not determine caller IP")
   elif len(body['addr']) == 2:
       peername = body['addr']
   else:
       raise HTTPBadRequest(reason="Can not determine caller IP")

and I added the following function to basenode.py to be invoked while sending the register signal:

def getNodeIp(node_type):
    """ Gets IP of local host (container) for the default route """
    log.info("Node type: %s" % node_type)
    if node_type == 'sn':
        if config.get("sn_port"):
            port = config.get("sn_port")
    elif node_type == 'dn':
        if config.get("dn_port"):
            port = config.get("dn_port")
    else:
        port = 6101
    IP = ((([ip for ip in socket.gethostbyname_ex(socket.gethostname())[2] if not ip.startswith("127.")] or [[(s.connect(("8.8.8.8", 53)), s.getsockname()[0], s.close()) for s in [socket.socket(socket.AF_INET, socket.SOCK_DGRAM)]][0][1]]) + ["no IP found"])[0], port)
    log.info("sending node ip: %s:%s" % IP)
    return IP

async def register(app):
    """ register node with headnode
    OK to call idempotently (e.g. if the headnode seems to have forgotten us)"""
    head_url = getHeadUrl(app)
    if not head_url:
        log.warn("head_url is not set, can not register yet")
        return
    req_reg = head_url + "/register"
    log.info("register: {}".format(req_reg))
    addr = getNodeIp(app["node_type"])
    if addr:
        body = {"id": app["id"], "port": app["node_port"], "node_type": app["node_type"], "addr": addr}
    else:
        body = {"id": app["id"], "port": app["node_port"], "node_type": app["node_type"]}
.... etc

Blosc compressor support

Enable use of any compressors available through the numcodecs package: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'

Enable Azure Active Directory Authentication

Enable requests to be authenticated using Active Directory rather than using HTTP Basic Auth.

Setup for throughput optimization ?

I've set up an instance of HSDS on EC2 running in docker containers as per these instructions. The S3 bucket is the NREL bucket like so:

BUCKET_NAME=nrel-pds-hsds
AWS_REGION=us-west-2
AWS_S3_GATEWAY=http://s3.us-west-2.amazonaws.com

I'm running some performance tests to see how much throughput I can squeeze out of HSDS. I've got it running on a m5a.12xlarge (48 cores, 192GB) and I started it with 44 nodes using ./runall.sh 44. Running docker ps confirms that all the expected containers are running. I set up a load test script to hit HSDS with concurrent requests simulating multiple users. As I scale up the load HSDS starts returning 503 errors. What's interesting is that through monitoring the server performance during the test I can see that the CPU and memory are barely being touched. I suspect I might be running docker in a way that is preventing the EC2 instance from really doing it's thing, or some other similar setup issue.

with 1200 total requests at a concurrency rate of 120 HSDS requests at a time my tests result in 97% request failure, all with 503 responses
with 200 total requests at a concurrency rate of 40 HSDS requests at a time my tests result in 63% failure, all with 503 responses
with 200 total requests at a concurrency rate of 20 HSDS requests at a time my tests result in 3% failure, all with 503 responses

Considering that minimal level of detail, are there any suggestions that spring to mind for me to try to either improve results, or research where the bottleneck is happening?

HSDS service node crash

The HSDS service node (started from the runall.sh script running in AWS), has crashed on three separate occasions out of the blue during normal use. The most recent two crashes has occurred using the master branch as of this morning (11/6/20).

Here is a capture of the area in the service node log where the crash occurs:

INFO> data_sel: (slice(25000000, 25217134, 1),)
INFO> node_state: READY
INFO> http_get('http://hsds_head:5100/nodestate')
INFO> http_get status: 200
INFO> health check ok
INFO> chunk_arr shape: (12500000,)
INFO> data_sel: (slice(0, 12500000, 1),)
INFO> chunk_arr shape: (12500000,)
INFO> data_sel: (slice(12500000, 25000000, 1),)
/entrypoint.sh: line 28:     6 Killed                  hsds-servicenode
hsds entrypoint
node type:  sn
running hsds-servicenode
INFO> Service node initializing
INFO> Application baseInit
INFO> baseInit - node_id: sn-eee6d node_port: 80
INFO> using bucket: ####
INFO> aws_iam_role set to: #####
INFO> aws_secret_access_key not set
INFO> aws_access_key_id not set
INFO> aws_region set to: us-west-2