Comments (10)
I tried a notebook example with dask
on h5pyd
here:
https://github.com/rsignell-usgs/hsds_examples/blob/dask/nrel/notebooks/nrel_dask_example.ipynb
and it mostly worked, but with some messages like:
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15
and also top of plot looks incorrect with repeated values in rows.
I don't really know what I'm doing here, so likely doing something bad with dask
or h5pyd
or both.
cc: @mrocklin
from h5pyd.
Very cool. Some things to try:
- Increase chunk size in dask, perhaps doubling in each dimension
- It looks like h5pyd might not like multiple concurrent connections, you might try the
lock=True
option toda.from_array
from h5pyd.
from h5pyd.
Increasing the chunk size fixed the connection pool problems. It looks like with the original chunk size, dask was sending 1000's of http request to the server, which overwhelmed the http connection pool.
Still not having the correct data display though. Tried the lock=True, it made the code run slower, but still had the messed up data.
I'll see if I can get a trace of the http requests.
from h5pyd.
Sorry, still haven't had a chance to try out Dask yet.
@mrocklin - there's a beta for HSDS that you can join if you would like to experiment with Dask & HSDS. See: https://www.hdfgroup.org/solutions/hdf-cloud.
from h5pyd.
I would be surprised to see Dask send 1000s of concurrent connections. By default we only run as many tasks as there are logical cores on a machine. I recommend trying your service with multiple threads, perhaps using some standard library like concurrent.futures or multiprocessing.pool.ThreadPool and seeing how it works. You might also try setting dask to run in single-threaded mode:
import dask
dask.set_options(get=dask.local.get_sync)
Just to set expectations, all Dask is doing here is running computations like x[:1000, :1000]
and x[1000:2000, :1000]
in multiple threads. We're pretty low-tech when it comes to data ingestion. I recommend stress testing concurrent access from a single process.
from h5pyd.
Ok - I'll try out your suggestions. My plan is to devote some time in 2018Q1 to stress testing HSDS, so this course of action will fit in nicely with that.
from h5pyd.
As an FYI I'll be giving a talk about cloud-deployed Dask/XArray workloads at AMS on January 8th. If you make progress by then it would be interesting to discuss this as an option. https://ams.confex.com/ams/98Annual/webprogram/Paper337859.html
Although to be clear we're not just talking about a single machine reading in this case. We're talking about several machines on the cloud reading the same dataset simultaneously.
from h5pyd.
I'll see if I can cook something up. Would it be possible for you to send me a draft of your presentation?
from h5pyd.
Once I have such a draft, sure. I'm unlikely to have anything solid before the actual presentation though. I'll be talking about Dask, XArray, and HPC/Cloud. Some topic of interest are in this github repository: https://github.com/pangeo-data/pangeo/issues
from h5pyd.
Related Issues (20)
- h5pyd dataset.chunks not compatible with h5py HOT 2
- hsload fails decoding ASCII encoded attributes HOT 10
- hsload fails with compact datasets HOT 1
- `hsload` fails when an attribute has type `Reference` HOT 4
- hsload fails with datasets using scale offset filter HOT 1
- apply source compression filter in hsload HOT 1
- Show filters applied to any datasets in hsls HOT 1
- h5pyd not evaluating environment variables HOT 1
- An error related to hsrm HOT 12
- Error with 1D chunk sizes HOT 2
- `logging.info` call forces downstream package loggers to emit messages twice
- Recursive Domain Deletion Flag for `hsrm`
- git tags missing HOT 1
- hsload doesn't allow linkpath and fastlink options to be used together
- Update build process to use toml HOT 1
- CI Testing in Github HOT 1
- Support numpy-style broadcasting
- Support field selection from compound types HOT 1
- Support ordering links/attribute by creation index/name HOT 1
- Attributes in root group not displayed by hsls --showattrs HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from h5pyd.