I tried a notebook example with dask on <code class="

Very cool. Some things to try: Increase chunk size in dask, p

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Sorry, still haven't had a chance to try out Dask yet. <a class="user-mention notr

Try dask on top of h5pyd about h5pyd HOT 10 OPEN

hdfgroup commented on September 26, 2024

Try dask on top of h5pyd

from h5pyd.

Comments (10)

rsignell-usgs commented on September 26, 2024

I tried a notebook example with dask on h5pyd here:
https://github.com/rsignell-usgs/hsds_examples/blob/dask/nrel/notebooks/nrel_dask_example.ipynb
and it mostly worked, but with some messages like:

WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: 52.25.101.15

and also top of plot looks incorrect with repeated values in rows.

I don't really know what I'm doing here, so likely doing something bad with dask or h5pyd or both.

cc: @mrocklin

from h5pyd.

mrocklin commented on September 26, 2024

Very cool. Some things to try:

Increase chunk size in dask, perhaps doubling in each dimension
It looks like h5pyd might not like multiple concurrent connections, you might try the lock=True option to da.from_array

from h5pyd.

mrocklin commented on September 26, 2024

cc @martindurant @jjhelmus

from h5pyd.

jreadey commented on September 26, 2024

Increasing the chunk size fixed the connection pool problems. It looks like with the original chunk size, dask was sending 1000's of http request to the server, which overwhelmed the http connection pool.

Still not having the correct data display though. Tried the lock=True, it made the code run slower, but still had the messed up data.

I'll see if I can get a trace of the http requests.

from h5pyd.

jreadey commented on September 26, 2024

Sorry, still haven't had a chance to try out Dask yet.
@mrocklin - there's a beta for HSDS that you can join if you would like to experiment with Dask & HSDS. See: https://www.hdfgroup.org/solutions/hdf-cloud.

from h5pyd.

mrocklin commented on September 26, 2024

I would be surprised to see Dask send 1000s of concurrent connections. By default we only run as many tasks as there are logical cores on a machine. I recommend trying your service with multiple threads, perhaps using some standard library like concurrent.futures or multiprocessing.pool.ThreadPool and seeing how it works. You might also try setting dask to run in single-threaded mode:

import dask
dask.set_options(get=dask.local.get_sync)

Just to set expectations, all Dask is doing here is running computations like x[:1000, :1000] and x[1000:2000, :1000] in multiple threads. We're pretty low-tech when it comes to data ingestion. I recommend stress testing concurrent access from a single process.

from h5pyd.

jreadey commented on September 26, 2024

Ok - I'll try out your suggestions. My plan is to devote some time in 2018Q1 to stress testing HSDS, so this course of action will fit in nicely with that.

from h5pyd.

mrocklin commented on September 26, 2024

As an FYI I'll be giving a talk about cloud-deployed Dask/XArray workloads at AMS on January 8th. If you make progress by then it would be interesting to discuss this as an option. https://ams.confex.com/ams/98Annual/webprogram/Paper337859.html

Although to be clear we're not just talking about a single machine reading in this case. We're talking about several machines on the cloud reading the same dataset simultaneously.

from h5pyd.

jreadey commented on September 26, 2024

I'll see if I can cook something up. Would it be possible for you to send me a draft of your presentation?

from h5pyd.

mrocklin commented on September 26, 2024

Once I have such a draft, sure. I'm unlikely to have anything solid before the actual presentation though. I'll be talking about Dask, XArray, and HPC/Cloud. Some topic of interest are in this github repository: https://github.com/pangeo-data/pangeo/issues

from h5pyd.

Try dask on top of h5pyd about h5pyd HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent