rapidsai / dask-cuda Goto Github PK

View Code? Open in Web Editor NEW

273.0 26.0 87.0 1.87 MB

Utilities for Dask and CUDA interactions

Home Page: https://docs.rapids.ai/api/dask-cuda/stable/

License: Apache License 2.0

Python 98.08% Shell 1.92%

dask-cuda's Introduction

Dask CUDA

Various utilities to improve deployment and management of Dask workers on CUDA-enabled systems.

This library is experimental, and its API is subject to change at any time without notice.

Example

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

Documentation is available here.

What this is not

This library does not automatically convert your Dask code to run on GPUs.

It only helps with deployment and management of Dask workers in multi-GPU systems. Parallelizing GPU libraries like RAPIDS and CuPy with Dask is an ongoing effort. You may wish to read about this effort at blog.dask.org for more information. Additional information about Dask-CUDA can also be found in the docs.

dask-cuda's People

Contributors

Stargazers

Watchers

Forkers

lesteve gnuwind raydouglass jakirkham pentschev dillon-cullinan paulhendricks scarita mrocklin s-gr madsbk rlratzel jccarrigan aphilipnv galipremsagar matthieubulte okoskinen mluukkainen ksangeek yammyone rjzamora quasiben randerzander ajunlonglive robertdigital gareytwin1 trxcllnt liupion efajardo-nv jacobtomlinson donniekim411 sean-frye jrbourbeau necaris h2oai garanews manopapad msadang vfdev-5 cssprad1 batprem kkraus14 dehasara charlesbluca jjacobelli joegang2021 arunraman aiwenbeauty ajschmidt8 akaanirban shaunstoltz hercules261188 jglaser karam333 ayodeawe standardgalactic janilbols-w vibhujawa shwina python-repository-hub ayushdg ntabris wence- drzoom0 twilight-shuxin dennis23314063 bdice mattf rhadi2005 ajaythorve trivialfis vyasr arpitjain799 hmacdope dream3d-ai rocm demchenkoalex77 branddole augustodamiao douglasmartins1232 mikesable93 jameslamb kylefromnvidia alisabunina vsriram11 mroeschke

dask-cuda's Issues

dask-cuda should automatically register GPU memory resource tracking

It would be nice for dask-cuda workers to automatically add GPU memory tracking like this.

Nicer still would be for the ${dask_scheduler_ip}:8787/status dashboard to show GPU memory bytes stored in addition to host memory bytes stored.

[QST] Can dask-cuda-workers work with dask-yarn?

I'm able to use the dask-cuda-worker CLI commands to start one worker per GPU. but am wondering if there's a way to expose them to a YARN resource manager?

Would this depend on the dask-yarn package?

In my environment, GPU resource consumption isn't actually monitored, but I would still like to register my dask cluster as a YARN application, so that dashboards and other YARN monitoring tools are aware of its existence.

[FEA] Document how to use dask-cuda with Kubernetes

dask-cuda's LocalCUDACluster Python API and dask-cuda-worker CLI utility are easy to use when manually starting clusters.

The natural question arises, how would I use this with Kubernetes?

The answer could be dask-cuda is not necessary.. that users should:

Specify, in a kubernetes config somehow, that each container should be allocated only one GPU
The app should instantiate a "plain" Dask (non dask-cuda) cluster
import dask_cudf, use it, and everything works as if you're running from a dask-cuda-worker spawned cluster

That would be a fine answer.

However, dask-cuda, or dask-cudf, or cudf, or other RAPIDS project should have a step by step guide for how to set that up, and point back to relevant Kubernetes documentation for additional options.

Is there someone well qualified to write such a document soon?

Bug: Expected keyword env in init for LocalCUDACLuster

Posted in notebooks:

Run out-of-band function 'initialize_rmm_pool' / Event loop was unresponsive

Hi all, I am trying to configure Rapids-Dask in Multi-gpu Multi-node mode. Here is what I am doing and the errors I got, could you please assist please to fix these?

IP of the primary compute node hosting the dask-scheduler: <ip_primary_node>
number of GPU's in primary compute node: 4

IP of the seconday compute node: <ip_secondary_node>
number of GPU's in secondary compute node: 4
Total GPU's: 8

Step 1: Launch the dask-scheduler on the primary compute node (which has 4 GPU's) :
dask-scheduler --port=8888 --bokeh-port 8786
output:

distributed.scheduler - INFO - Remove client Client-0616844c-83ba-11e9-8225-246e96b3e316
distributed.scheduler - INFO - Close client connection: Client-0616844c-83ba-11e9-8225-246e96b3e316
distributed.scheduler - INFO - Receive client connection: Client-9ad22140-83bd-11e9-823c-246e96b3e316
distributed.core - INFO - Starting established connection

Step 2: Launch dask-cuda-worker to start workers on the same machine as the scheduler
dask-cuda-worker tcp://<ip_primary_node>:8888
output:
..... a bunch of messages with successful connection

Step 3: Launch dask-cuda-worker on the secondary compute node (with dditional 4 GPU's):
dask-cuda-worker tcp://<ip_primary_node>:8888
output:
..... a bunch of messages with successful connection

Step 4: Run the Client Python API (jupyter notebook) on the secondary compute node (using all compute node GPUs in distributed mode)
client = Client('tcp://<ip_primary_node>:8888')
output

Client
Scheduler: tcp://<ip_primary_node>:8888
Dashboard: http://<ip_primary_node>:8786/status
Cluster
Workers: 8
Cores: 8
Memory: 67.47 GB

Errors:
Primary compute node worker:

distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.core - INFO - Event loop was unresponsive in Worker for 5.24s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 5.28s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 5.29s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 5.34s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((<function apply at 0x7f2aaead7510>, <function run_gpu_workflow at 0x7f2a66c06ea0>, [], (<class 'dict'>, [['quarter', 1], ['year', 2000], ['perf_file', '/home/rapids/data/perf/Performance_2000Q1.txt']])))
kwargs:    {}
Exception: FileNotFoundError()

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((<function apply at 0x7f2aaead7510>, <function run_gpu_workflow at 0x7f2a8c4f66a8>, [], (<class 'dict'>, [['quarter', 2], ['year', 2000], ['perf_file', '/home/rapids/data/perf/Performance_2000Q2.txt']])))
kwargs:    {}
Exception: FileNotFoundError()

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((<function apply at 0x7f2aaead7510>, <function run_gpu_workflow at 0x7f2a9c329598>, [], (<class 'dict'>, [['quarter', 3], ['year', 2000], ['perf_file', '/home/rapids/data/perf/Performance_2000Q3.txt']])))
kwargs:    {}
Exception: FileNotFoundError()

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((<function apply at 0x7f2aaead7510>, <function run_gpu_workflow at 0x7f2a8c6437b8>, [], (<class 'dict'>, [['quarter', 4], ['year', 2000], ['perf_file', '/home/rapids/data/perf/Performance_2000Q4.txt']])))
kwargs:    {}
Exception: FileNotFoundError()

distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
^Cdistributed.dask_worker - INFO - Exiting on signal 2
/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda-0.0.0.dev0-py3.7.egg/dask_cuda/dask_cuda_worker.py:263: UserWarning: Worker._close has moved to Worker.close
  yield [n._close(timeout=2) for n in nannies]
distributed.nanny - INFO - Closing Nanny at 'tcp://<ip_primary_node>:41254'
distributed.nanny - INFO - Closing Nanny at 'tcp://<ip_primary_node>:42505'
distributed.nanny - INFO - Closing Nanny at 'tcp://<ip_primary_node>:33813'
distributed.nanny - INFO - Closing Nanny at 'tcp://<ip_primary_node>:41822'
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-4, started daemon)>

Secondary compute node worker:

distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_pool'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'finalize_rmm'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'
distributed.worker - INFO - Run out-of-band function 'initialize_rmm_no_pool'

[DOC] Dask benchmark notebooks

@jakirkham @quasiben and I spent a bit of time a couple days ago putting together some benchmarking notebooksh for Dask workloads here: https://github.com/mrocklin/dask-gpu-benchmarks

Not sure if these are appropriate for this repository, but I thought that I'd mention them.

dask_cuda seems not using GPU(s)

I have a code working dask and want run over GPU(s).
Tried to import required libraries:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

Then initialize with:
if __name__ == "__main__":
cluster = LocalCUDACluster()
client = Client(cluster)

But it seems is still using the CPU(s)
I found #32 , devices are seeing by system and with nvidia-smi I see python processes passed to GPU:

But GPU(s) load remains at 0%:

FYI if I try tensorflow I am able to see load over all GPU(s)

[BUG] Deprecation warning at E2E.ipynb

Describe the bug
Hi,

When running the E2.ipynb mortages demo, the following warning is displayed:

Steps/Code to reproduce bug
Run the first two cells.

Expected behavior
No warning is displayed.

Environment details (please complete the following information):
DGX-1

Additional context

Spill quickly from device to host

I'm curious how fast we are at moving data back and forth between device and host memory. I suspect that currently we end up serializing with standard Dask serialization, which may not be aware of device memory. We should take a look at using the "cuda" serialization family within Dask, that was added as part of the UCX work.

from distributed.protocol import serialize, deserialize
x = cupy.arange(100)
header, frames = serialize(x, serializers=("cuda", "dask", "pickle"))
y = deserialize(header, frames, deserializers=("cuda", "dask", "pickle",    "error"))

(more here https://github.com/dask/distributed/tree/master/distributed/protocol/tests)

And then we should probably do a bit of profiling. I would focus this on just the DeviceHostDisk class, rather than engaging the full dask.distributed cluster machinery.

This is relevant for a couple of applications that people care about today

Larger than memory image processing (a bit)
Out-of-core sorts and joins (cc @randerzander @kkraus14) (more relevant)

@pentschev , you seem like the obvious person to work on this if you have some time. I think it'd be useful to know how fast we are today, then look at serialization with the new "cuda" Dask serializer, and then see how fast we can get things relative to theoretical hardware speeds.

[FEA] Multi-machine example

Is your feature request related to a problem? Please describe.
I was trying to run RAPIDSAI on multiple machines, but I couldn't find a working example.

Describe the solution you'd like

The ports that are used by RAPIDSAI and dask-xgboost (Rabit, etc) could be better documented.
A best practice for using LocalCUDACluster with multiple machines.

Describe alternatives you've considered

I ended up using docker's host networking so I didn't have to think about port forwarding between the two machines. nvidia-docker run --rm -it --network host rapidsai/rapidsai
I ended up looking into LocalCUDACluster's source code to set up a remote worker and ended up with something like this:

from dask_cuda.local_cuda_cluster import cuda_visible_devices
from distributed import Worker, Nanny
from tornado.ioloop import IOLoop

w = Nanny(scheduler_ip="172.31.23.94", scheduler_port="36531", ncores=1, env={"CUDA_VISIBLE_DEVICES": cuda_visible_devices(0)}, memory_limit=30e9)
w.start()
loop = IOLoop.current()

_ncores not found Exception

When I try running dask-worker we get following exception:

(rapids) root@dt07:/rapids/notebooks/distributed# dask-cuda-worker --scheduler-file=/home/nfs/pgali/newfile --no-bokeh --host=10.136.7.107
Traceback (most recent call last):
  File "/conda/envs/rapids/bin/dask-cuda-worker", line 11, in <module>
    load_entry_point('dask-cuda==0.9.0a0', 'console_scripts', 'dask-cuda-worker')()
  File "/conda/envs/rapids/lib/python3.7/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/conda/envs/rapids/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2843, in load_entry_point
    return ep.load()
  File "/conda/envs/rapids/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2434, in load
    return self.resolve()
  File "/conda/envs/rapids/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2440, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/dask_cuda_worker.py", line 17, in <module>
    from distributed.worker import _ncores, parse_memory_limit
ImportError: cannot import name '_ncores' from 'distributed.worker' (/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py)

LocalCUDACluster doesn't use multiple GPUs when cuDF is import

Only the first GPU is used when importing cudf. If I remove that import, all GPUs get assigned work to do.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import numpy as np
import cupy
import dask
import dask.array as da
import cudf

if __name__ == '__main__':
    cluster = LocalCUDACluster()
    client = Client(cluster)

    x = cupy.random.random((100000, 1000))
    d = da.from_array(x, chunks=(10000, 1000), asarray=False)
    u, s, v = np.linalg.svd(d)
    s, v = dask.compute(s, v)

Obviously, this example is not really using cudf for anything, just serving for demonstration purposes.

The only workaround I found for this is to start dask-scheduler and dask-cuda-worker outside of my python script. Am I missing something or doing something wrong?

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory

We still have workflows that are limited by better spill over with cudf as it currently only works with limited workflows.

A example where spill over fails is: #65 (comment)

According to #65 (comment), we need more changes on cudf side to support this.

We now expose the device memory used with nvstrings: rapidsai/custrings#395

Can you please list the changes we still require so that we can track and get them completed asap to unblock these workflows and enable a better spill over.

CC: @pentschev
CC: @randerzander

Takes very long to launch all the workers.

I am trying to launch one scheduler and 16 workers (one worker per GPU for DGX2, DGX2 has 16 NVIDIA V100 GPUs), but it takes minutes for this launch to finish.

To be more specific, I launched the scheduler with

dask-scheduler --scheduler-file ~/cluster.json

And I launched the workers with

mpirun -np 16 dask-mpi --no-nanny --nthreads 4 --no-scheduler --scheduler-file ~/cluster.json

I will paste the scheduler and worker outputs at the end, and in the worker output,

"distributed.core - INFO - Starting established connection"

appears only 13 times, and it took quite some additional time for this to get printed 16 times (I am assuming that this gets printed once per worker). Any clues?

Scheduler output

seunghwak@dgx202:~$ dask-scheduler --scheduler-file ~/cluster.json
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.22.1.27:8786
distributed.scheduler - INFO - bokeh at: :8787
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-qc55upke
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://172.22.1.27:34801
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:34801
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:43403
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:43403
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:36585
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:36585
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:34323
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:34323
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:41637
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:41637
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:39909
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:39909
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:41111
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:41111
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:43857
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:43857
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:34575
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:34575
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:32879
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:32879
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://172.22.1.27:33513
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.22.1.27:33513
distributed.core - INFO - Starting established connection

worker output

seunghwak@dgx202:~$ mpirun -np 16 dask-mpi --no-nanny --nthreads 4 --no-scheduler --scheduler-file ~/cluster.json
--------------------------------------------------------------------------
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

Local host: dgx202
--------------------------------------------------------------------------
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-dn_tchgi', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-vg031frf', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-0odsrkvu', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-gtq1_xnv', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-l7fdve30', purging
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:34801
distributed.worker - INFO - Listening to: tcp://:34801
distributed.worker - INFO - bokeh at: :8789
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-4suamqwu
distributed.worker - INFO - -------------------------------------------------
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:43403
distributed.worker - INFO - Listening to: tcp://:43403
distributed.worker - INFO - bokeh at: :34033
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-c8xxwc02
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-v5_a8j9_', purging
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:36585
distributed.worker - INFO - Listening to: tcp://:36585
distributed.worker - INFO - bokeh at: :43451
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-jyyr2kzj
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-ledklclp', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-4qn86vr6', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-4v_k1uxo', purging
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:34323
distributed.worker - INFO - Listening to: tcp://:34323
distributed.worker - INFO - bokeh at: :44115
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-lhwaa73x
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:41637
distributed.worker - INFO - Listening to: tcp://:41637
distributed.worker - INFO - bokeh at: :38581
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-8f0b7jk0
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-qr5qf9fn', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-qhxtj56l', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-k1gsxecp', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-bby3v9z5', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-x1wv0iop', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/seunghwak/worker-m8_e_vhq', purging
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:39909
distributed.worker - INFO - Listening to: tcp://:39909
distributed.worker - INFO - bokeh at: :44549
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-_id7lb88
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:41111
distributed.worker - INFO - Listening to: tcp://:41111
distributed.worker - INFO - bokeh at: :42467
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-ojft0vl2
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:43857
distributed.worker - INFO - Listening to: tcp://:43857
distributed.worker - INFO - bokeh at: :45701
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-hsxqxrjq
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:34575
distributed.worker - INFO - Listening to: tcp://:34575
distributed.worker - INFO - bokeh at: :33713
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-aeydua5l
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[dgx202:50083] 15 more processes have sent help message help-btl-vader.txt / cma-permission-denied
[dgx202:50083] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:32879
distributed.worker - INFO - Listening to: tcp://:32879
distributed.worker - INFO - bokeh at: :40603
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-4u2s31xy
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:33513
distributed.worker - INFO - Listening to: tcp://:33513
distributed.worker - INFO - bokeh at: :36567
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-5f8ston8
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:35155
distributed.worker - INFO - Listening to: tcp://:35155
distributed.worker - INFO - bokeh at: :46721
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-ggdh6nqc
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/cli/dask_mpi.py:102: UserWarning: The dask-mpi command line utility in the distributed package is deprecated. Please install the dask-mpi package instead. More information is available at https://mpi.dask.org
warn("The dask-mpi command line utility in the distributed "
/home/nfs/seunghwak/Program/anaconda3/envs/pagerank_demo_dgx2/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn('\n' + msg)
distributed.worker - INFO - Start worker at: tcp://172.22.1.27:39829
distributed.worker - INFO - Listening to: tcp://:39829
distributed.worker - INFO - bokeh at: :40967
distributed.worker - INFO - Waiting to connect to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 4
distributed.worker - INFO - Memory: 67.58 GB
distributed.worker - INFO - Local Directory: /home/nfs/seunghwak/worker-3v0cxd_6
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://172.22.1.27:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

Out of Memory Sort Fails even with Spill over

Out of Memory Sort still seems to be failing even with device spill PR merged. (#51) .

The memory still seems to linearly grow which causes RuntimeError: parallel_for failed: out of memory.

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf, dask_cudf

# Use dask-cuda to start one worker per GPU on a single-node system
# When you shutdown this notebook kernel, the Dask cluster also shuts down.
cluster = LocalCUDACluster(ip='0.0.0.0',n_workers=1, device_memory_limit='10000 MiB')
client = Client(cluster)
# # print client info
print(client)

# Code to simulate_data

def generate_file(output_file,rows=100):
    with open(output_file, 'wb') as f:
        f.write(b'A,B,C,D,E,F,G,H,I,J,K\n')
        f.write(b'22,697,56,0.0,0.0,0.0,0.0,0.0,0.0,0,0\n23,697,56,0.0,0.0,0.0,0.0,0.0,0.0,0,0\n'*(rows//2))
        f.close()

# generate the test file 
output_file='test.csv'
# Uncomment below
generate_file(output_file,rows=100_000_000)

# reading it using dask_cudf
df = dask_cudf.read_csv(output_file,chunksize='100 MiB')
print(df.head(10).to_pandas())


# reading it using dask_cudf
df = df.sort_values(['A','B','C'])

Error Trace :

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-62d876400539> in <module>
     30 
     31 # reading it using dask_cudf
---> 32 df = df.sort_values(['A','B','C'])

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/dask_cudf/core.py in sort_values(self, by, ignore_index)
    440         """
    441         parts = self.to_delayed()
--> 442         sorted_parts = batcher_sortnet.sort_delayed_frame(parts, by)
    443         return from_delayed(sorted_parts, meta=self._meta).reset_index(
    444             force=not ignore_index

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/dask_cudf/batcher_sortnet.py in sort_delayed_frame(parts, by)
    133         list(map(delayed(lambda x: int(x is not None)), parts[:valid]))
    134     )
--> 135     valid = compute(valid_ct)[0]
    136     validparts = parts[:valid]
    137     return validparts

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    396     keys = [x.__dask_keys__() for x in collections]
    397     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 398     results = schedule(dsk, keys, **kwargs)
    399     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    400 

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2566                     should_rejoin = False
   2567             try:
-> 2568                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2569             finally:
   2570                 for f in futures.values():

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1820                 direct=direct,
   1821                 local_worker=local_worker,
-> 1822                 asynchronous=asynchronous,
   1823             )
   1824 

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    751             return future
    752         else:
--> 753             return sync(self.loop, func, *args, **kwargs)
    754 
    755     def __repr__(self):

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    329             e.wait(10)
    330     if error[0]:
--> 331         six.reraise(*error[0])
    332     else:
    333         return result[0]

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/utils.py in f()
    314             if timeout is not None:
    315                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 316             result[0] = yield future
    317         except Exception as exc:
    318             error[0] = sys.exc_info()

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    727 
    728                     try:
--> 729                         value = future.result()
    730                     except Exception:
    731                         exc_info = sys.exc_info()

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    734                     if exc_info is not None:
    735                         try:
--> 736                             yielded = self.gen.throw(*exc_info)  # type: ignore
    737                         finally:
    738                             # Break up a reference to itself

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1651                             six.reraise(CancelledError, CancelledError(key), None)
   1652                         else:
-> 1653                             six.reraise(type(exception), exception, traceback)
   1654                     if errors == "skip":
   1655                         bad_keys.add(key)

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/dask/compatibility.py in apply()
     91     def apply(func, args, kwargs=None):
     92         if kwargs:
---> 93             return func(*args, **kwargs)
     94         else:
     95             return func(*args)

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/dask_cudf/batcher_sortnet.py in _compare_frame()
     72     if a is not None and b is not None:
     73         joint = gd.concat([a, b])
---> 74         sorten = joint.sort_values(by=by)
     75         # Split the sorted frame using the *max_part_size*
     76         lhs, rhs = sorten[:max_part_size], sorten[max_part_size:]

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in sort_values()
   1279         return self._sort_by(self[by].argsort(
   1280             ascending=ascending,
-> 1281             na_position=na_position)
   1282         )
   1283 

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in _sort_by()
   1225         # Perform out = data[index] for all columns
   1226         for k in self.columns:
-> 1227             df[k] = self[k].take(sorted_indices.to_gpu_array())
   1228         df.index = self.index.take(sorted_indices.to_gpu_array())
   1229         return df

~/anaconda3/envs/py_36_rapids/lib/python3.6/site-packages/cudf/dataframe/series.py in take()
    324             return self[indices]
    325 
--> 326         col = cpp_copying.apply_gather_array(self.data.to_gpu_array(), indices)
    327 
    328         if self._column.mask:

cudf/bindings/copying.pyx in cudf.bindings.copying.apply_gather_array()

cudf/bindings/copying.pyx in cudf.bindings.copying.apply_gather_column()

cudf/bindings/copying.pyx in cudf.bindings.copying.apply_gather()

RuntimeError: parallel_for failed: out of memory

Fix spilling test assertion that fails on rare occasions

See link below for an example:

https://gpuci.gpuopenanalytics.com/view/rapids-branch-gpu-matrix/job/gpuCI/job/dask-cuda/job/branches/job/dask-cuda-gpu-matrix-branch-0.8/CC_VERSION=5,CUDA=9.2,LINUX_VERSION=ubuntu16.04,PYTHON=3.6/37/console

cc @raydouglass

AttributeError: 'LocalCUDACluster' object has no attribute 'status' when using Visual Profiler

I would use nvidia visual profiler (from remote) to analyze my code running on a server with multiple gpu.
On a server I started "nvidia-cuda-mps-control -d". (tried both with same user run the script inside virtualenv and with sudo, same result)
On my workstation (following a guide ) I am able to connect and set everythink ready for the analysis (setting profiling all processes).
Returning on the server and laucning manually my app I receive this error:

(tf_gpu) dask@server:~$ python zarr0.py
Traceback (most recent call last):
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 227, in initialize
    self.cuInit(0)
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 290, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 325, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [999] Call to cuInit results in CUDA_ERROR_UNKNOWN

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "zarr0.py", line 135, in <module>
    cluster = LocalCUDACluster(n_workers=8, threads_per_worker=8)
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/dask_cuda/local_cuda_cluster.py", line 75, in __init__
    self.device_memory_limit = get_device_total_memory(0)
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/dask_cuda/utils.py", line 22, in get_device_total_memory
    with cuda.gpus[index]:
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 40, in __getitem__
    return self.lst[devnum]
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/devices.py", line 26, in __getattr__
    numdev = driver.get_device_count()
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 336, in get_device_count
    self.cuDeviceGetCount(byref(count))
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 270, in __getattr__
    self.initialize()
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 230, in initialize
    raise CudaSupportError("Error at driver init: \n%s:" % e)
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init:
[999] Call to cuInit results in CUDA_ERROR_UNKNOWN:
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/dask/miniconda3/envs/tf_gpu/lib/python3.6/site-packages/distributed/deploy/spec.py", line 260, in __del__
AttributeError: 'LocalCUDACluster' object has no attribute 'status'

Thsi error appears if /usr/bin/nvidia-cuda-mps-control is active, but to profile multi-process apps this must be started, right?

Any suggestion?

Explain "experimental" aspect of LocalCudaCluster

Dask and cuDF users are starting to use LocalCudaCluster in their work.

This repo's README calls it experimental. Can we better explain what that means so users understand the risks?

Extend memory spilling to multiple storage media

Currently in the works of #35, we will have the capability of spilling CUDA device memory to host, and that to disk. However, as pointed out by @kkraus14 here, it would be beneficial to allow spilling host memory to multiple user-defined storage media.

I think we could follow the same configuration structure of Alluxio, as suggested by @kkraus14. Based on the current structure suggested in #35 (still subject to change), it would look something like the following:

cuda.worker.dirs.path=/mnt/nvme,/mnt/ssd,/mnt/nfs
cuda.worker.dirs.quota=16GB,100GB,1000GB

@mrocklin FYI

dask-cuda broken with master distributed

As of dask/distributed#269, distributed no longer supports self.scheduler_port. Dask-CUDA should probably support self.scheduler_port and self.scheduler.port

cc @mrocklin

[BUG] Conda vs Pypi dask dependency issue

Since the conda dependency is dask-core, but the pypi one is dask, this prevents the conda build from working.

The conda build delegates to setup.py and this process will attempt to download dask. conda-build errors on any setuptools downloads.

Workaround for dask default worker-space directory

When the E2E notebook is run from a read-only filesystem (e.g., a Singularity container image), dask does not start because it cannot write its dask-worker-space directory to the (read-only) current working directory.

The workaround is to modify the LocalCUDACluster call in cell 3 to add the local_dir parameter:

cluster = LocalCUDACluster(ip=IPADDR, local_dir='/tmp/dask_cuda')

Until this is fixed upstream, the workaround should be added to the notebook.

See dask/distributed#2496.

[FEA] LocalCUDACluster should support passing a list of GPU IDs to use as workers

Users often share the same server or workstation with multiple GPUs. If they want to use LocalCUDACluster but don't want to use all GPUs (e.g. they're nice enough to leave some for others), they can currently specify n_workers.

However, on an 4 GPU system, if no CUDA_VISIBLE_DEVICES environment variables were set manually, and user1 creates a LocalCUDACluster with 2 workers, then user2 creates another LocalCUDACluster with 2 workers, both users will be using the same first two GPUs.

It would be nice if users could pass a list of GPU_IDs they want their LocalCUDACluster to use.

Incorrect conda pkg naming

https://anaconda.org/rapidsai-nightly/dask-cuda for conda pkgs uses dask-cuda and I'm pretty sure that a - in the pkg name is not to spec

this is why we have https://anaconda.org/rapidsai-nightly/dask_cudf

can we fix this for the release?

[BUG] Error importing LocalCUDACluster

Using the dask-cuda nightly conda package:

from dask_cuda import LocalCUDACluster

Result:

/usr/local/lib/python3.6/site-packages/dask_cuda/local_cuda_cluster.py in <module>()
      8 from distributed.nanny import Nanny
      9 from distributed.worker import Worker, TOTAL_MEMORY
---> 10 from distributed.utils import parse_bytes, warn_on_duration
     11 
     12 from .device_host_file import DeviceHostFile

ImportError: cannot import name 'warn_on_duration'

Multi-node CUDA cluster

As cuML is beginning the venture into algorithms that span multiple nodes, it would be very useful if we had a mechanism for starting, or attaching workers to, a multi-node Dask CUDA cluster.

cp.RawKernel and LocalCUDACluster

I have code running with dask LocalCUDACluster and cupy.
I would improve speed using rawkernel from cupy.
I have code with cupy and rawkernel working but don't know why running it with dask LocalCUDACluster is giving me wrong results (and or errors).
Basically I would use dask LocalCUDACluster because have multiple GPU and because the data in input don't fit in RAM.
Some basic question:
with cp.RawKernel must the input matrixt be square?
how manage the return of cp.RawKernel function in map_partitions ? I think the value will be overwritten every chunk processed.
There is someone already tried something like that?

[QST] dask-cuda memory management

Hello. First of all thank you very much for the great work on this project, it's really very useful!

I'm currently exploring different options for fitting some simple (and later more complex) statistical models on GPUs and one of the issue I was having was concerning memory management.

When using Dask + CuPy, I often ran into OOM errors on the GPU. This can probably be explained by the fast that Dask has no idea how much memory is available on the GPU and thus how much chunks it can load at once. This is quite an issue because it would require to manually handle loading/unloading of chunk, which isn't very fun.

I came across this project and realized that you have implemented a 2 level cache system to avoid filling the GPU's memory, which is exactly what I'm looking for. However, you also mentioned something else that confused me a little bit in #43

If using Yarn for one-worker-per-gpu workloads then I would probably just not use dask-cuda-worker, and would instead ask for a single GPU in your Yarn [...]

How would this solve the memory issue? If one worker loads too much data on the GPU by creating too many cupy arrays, then the problem will still be there.

As a more general question: while I understand this is still very much in development (and I would be very happy to contribute if necessary), are there some general guidelines for using Dask with GPUs?

And a last question: It seems to me that the only thing that I currently need from this project is the memory management offered by the CUDA worker. Would it be reasonable to only use this part of the library without using the whole CUDA cluster? This would be useful for deployment on a yarn cluster for instance.

Thanks!

[FEA] Support RMM arguments for cluster initialization

See how this example notebook is setting up RMM using client.run.

Is this something we can have LocalCudaCluster handle during startup?

nvprof only shows profiling data for GPU 0

It has been reported by @pradghos in #74 (comment) that nvprof only shows profiling data for GPU 0. Below follows a verbatim copy of the report:

if __name__ == '__main__':
    #cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
    cluster = LocalCUDACluster()
    print("cluster status ",cluster.status)
    print("cluster infomarion ", cluster)
    client = Client(cluster)

    import cudf as gd
    import dask_cudf as dgd

...
...
    got = gd.concat(dask.compute(*delays))

nvidia-smi usages are showing that all 4 GPUs are getting used -

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     74287      C   python                                       405MiB |
+-----------------------------------------------------------------------------+
Mon Jun 17 23:20:41 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.03    Driver Version: 418.40.03    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   44C    P0    67W / 300W |   1130MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   45C    P0    69W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   42C    P0    68W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   48C    P0    69W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     74287      C   python                                       805MiB |
|    0     74316      C   /opt/anaconda3/bin/python                    315MiB |
|    1     74314      C   /opt/anaconda3/bin/python                    315MiB |
|    2     74313      C   /opt/anaconda3/bin/python                    315MiB |
|    3     74315      C   /opt/anaconda3/bin/python                    315MiB |
+-----------------------------------------------------------------------------+

But When I use - nvprof --print-summary-per-gpu --profile-child-processes python test_cluster2.py

the result is same as before -


==5018== Device "Tesla V100-SXM2-16GB (0)"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   42.34%  41.962ms       183  229.30us  1.4710us  4.2174ms  [CUDA memcpy DtoH]
                   41.68%  41.305ms       101  408.96us  1.0240us  6.8770ms  [CUDA memcpy HtoD]
                    9.50%  9.4118ms         1  9.4118ms  9.4118ms  9.4118ms  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14BlockSortAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEESL_EEbS5_S5_lS5_S5_SI_EEvT0_T1_T2_T3_T4_T5_T6_
                    1.31%  1.2946ms        12  107.88us  102.75us  122.37us  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort10MergeAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEEEEbS5_S5_lS5_S5_SI_PllEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
                    1.30%  1.2842ms         5  256.84us  249.44us  270.91us  cudapy::cudf::utils::cudautils::gpu_gather$243(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>)
                    0.71%  702.33us        15  46.822us  45.567us  50.720us  [CUDA memcpy DtoD]
                    0.62%  618.85us        23  26.906us  1.1840us  219.30us  void kernel_v_v<char, long, long, Equal>(int, char*, long*, long*)
                    0.61%  607.71us        23  26.422us  1.9520us  139.26us  void _GLOBAL__N__56_tmpxft_00002643_00000000_7_reductions_compute_70_cpp1_ii_c1104e96::gpu_reduction_op<cudf::detail::wrapper<char, gdf_dtype=7>, cudf::detail::wrapper<char, gdf_dtype=7>, cudf::DeviceMin, cudf::reductions::IdentityLoader>(char const *, unsigned char const *, int, gdf_dtype=7*, cudf::detail::wrapper<char, gdf_dtype=7>, unsigned char const *, cudf::detail::wrapper<char, gdf_dtype=7>)
                    0.58%  571.77us        10  57.177us  1.2160us  95.423us  cudapy::cudf::utils::cudautils::gpu_arange$241(__int64, __int64, __int64, Array<__int64, int=1, A, mutable, aligned>)
....
....

==5018== Device "Tesla V100-SXM2-16GB (1)"
No kernels were profiled.

==5018== Device "Tesla V100-SXM2-16GB (2)"
No kernels were profiled.

==5018== Device "Tesla V100-SXM2-16GB (3)"
No kernels were profiled.

I can see only GPU(0) is getting profiled and No kernels were profiled for rest of the three GPUs in the system.

am I missing something on nvprof usage side ? Do I need use any other option in nvprof to profile all the GPU activities.

Thanks!

Experiment rechunking cupy array on DGX

Using the DGX branch, and the tom-ucx distributed branch, I'm playing with rechunking a large 2d array from by row to by column

from dask_cuda import DGX
cluster = DGX(CUDA_VISIBLE_DEVICES=[0,1,2,3])
from dask.distributed import Client
client = Client(cluster)
import cupy, dask.array as da, numpy as np
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((40000, 40000), chunks=(None, '1 GiB')).persist()
y = x.rechunk(('1 GiB', -1)).persist()

This is a fun experiment because it's a common operation, stresses UCX a bit, and is currently quite fast (when it works).

I've run into the following problems:

Spilling to disk when I run out of device memory (I don't have any spill to disk things on at the moment)

Sometimes I get this error from the dask comm ucx code

  File "/home/nfs/mrocklin/distributed/distributed/comm/ucx.py", line 134, in read
    nframes, = struct.unpack("Q", obj[:8])  # first eight bytes for number of frames

Sometimes CURAND seems to dislike me

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x10cupy.cuda.curand\x94\x8c\x0bCURANDError\x94\x93\x94\x8c!CURAND_STATUS_PREEXISTING_FAILURE\x94\x85\x94R\x94}\x94\x8c\x06status\x94K\xcasb.'
Traceback (most recent call last):
  File "/home/nfs/mrocklin/distributed/distributed/worker.py", line 3193, in apply_function
    result = function(*args, **kwargs)
  File "/home/nfs/mrocklin/dask/dask/array/random.py", line 411, in _apply_random
    return func(*args, size=size, **kwargs)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 516, in random_sample
    out = self._random_sample_raw(size, dtype)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 505, in _random_sample_raw
    func(self._generator, out.data.ptr, out.size)
  File "cupy/cuda/curand.pyx", line 155, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 159, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 83, in cupy.cuda.curand.check_status
cupy.cuda.curand.CURANDError: CURAND_STATUS_PREEXISTING_FAILURE

I don't plan to invesigate these personally at the moment, but I wanted to record the experiment somewhere (and this seems to currently be the best place?). I think that it might be useful to have someone like @madsbk or @pentschev look into this after the UCX and DGX work gets cleaned up a bit more.

Performance issue - High transfer costs

I am currently working on implementing a simple logistic regression model on a multi-gpu machine with dask-cuda and cupy, but I'm currently running in some performance issues due to high communication costs (transfer-* tasks).

The setup

I'm running the code on a EC2 g3.8xlarge instance using a LocalCUDA cluster with the default configuration:

cluster = LocalCUDACluster()
client = Client(cluster)

With

dask: 2.1.0
dask_cuda: latest 0.9.0
cupy: 6.1.0

The data

I use synthetic data, and persist the results (and wait until the cluster is done computing them!) to make sure that my I'm only considering the computation cost of the gradient/loss evaluation:

rs = da.random.RandomState()

X = rs.normal(10, 1, size=(10**7, 99), chunks=(8*10**5, 99))
y = rs.normal(10, 1, size=(10**7,), chunks=(8*10**5,))
beta = rs.normal(10, 1, size=(X.shape[1],))

X = X.map_blocks(cp.array, dtype=np.float64)
y = y.map_blocks(cp.array, dtype=np.float64)
beta = beta.map_blocks(cp.array, dtype=np.float64)

X, y, beta = client.persist([X, y, beta])

The task

I'm currently investigating the cost of a single gradient / loss evaluation. The code for computing it is the following:

p = 1 / (1 + da.exp(-X.dot(beta)))
loss = -da.log(1.0 - y + (2.0*y - 1.0)*p).sum()
grad = (p - y).dot(X)

The problem

Most of the time needed for the computation described above is spent transferring data around, to the point that there is no performance gain between using this 2 GPU instance and a similar machine with 1 GPU (g3s.xlarge), and it's slower than using the same machine with a LocalCluster(n_workers=2).

The problem is obvious when looking at the task section of the dashboard:

My question

Is this a dask_cuda issue?
Am I doing anything wrong?

More importantly: what would you suggest me to do or investigate to improve this issue?

Thanks!

Misc

Here's the optimized graph associated with the above computation:
Here's the tasks timeline when using 2 workers and 1 thread per worker:

Build proof of concept of multi-node join computation on Kubernetes

It would be useful for the RAPIDS effort to have a multi-node join computation deployed from Kubernetes. Until UCX arrives this will likely be slow, but we can probably work on deployment and configuration issues in the meantime.

I suspect that this involves the following steps:

Obtain access to a Kubernetes cluster with GPUs
Use either dask-kubernetes or the Dask helm chart to deploy Dask workers onto that cluster, doing whatever is necessary to specify GPUs in the pod specification
Run a computation similar to https://blog.dask.org/2019/01/29/cudf-joins , but presumably larger in scale
Quantify the computational costs, possibly using the profile and task_stream diagnostic utilities from the client to capture information

I suspect that in going through this effort manually that we will expose a number of small issues that we'll then have to fix

Spill device memory to host memory or disk

Currently Dask spills memory to disk once it feels some pressure. Docs, Code

It would be useful to build another system like this that separately tracks device and host memory, and spills from one to the other and then to disk as needed. This is tricky because the worker can't special case different types ("oh, this is a cudf/cupy object, we should treat it as device memory") but will instead need to figure out some nice dispatching system.

@sklam I think that you had a solution to this at one point I think. Can you share? It would be helpful to see other implementations.

Create dask-cuda-worker CLI utility

The logic in LocalCUDACluster might be also encoded into a command line utility like dask-worker. This would presumably start many workers on the same machine, assigning environment variables as necessary.

This would be useful because it might compose nicely with the other vanilla deployment solutions like dask-yarn, dask-jobqueue, and dask-kubernetes because we could then simply have them point to dask-cuda-worker rather than dask-worker and hopefully reuse most of the existing code with minimal modification.

Implement dask_cuda.version

It'd be great for reproducibility if dask_cuda.__version__ was implemented.

import dask_cuda  # ; print('Dask CUDA Version:', dask_cuda.__version__)

[BUG] Error during launch of dask-cuda-worker

When I launch dask-cuda-worker from command line using following command I get error:

(rapids) root@dt07:/# dask-cuda-worker --scheduler-file=/home/nfs/pgali/del31 --no-bokeh --host=10.136.7.107
/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/nanny.py:138: UserWarning: The local_dir keyword has moved to local_directory
  warnings.warn("The local_dir keyword has moved to local_directory")
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/bin/dask-cuda-worker", line 11, in <module>
    load_entry_point('dask-cuda==0.9.0a0+20.g357ec2c.dirty', 'console_scripts', 'dask-cuda-worker')()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/dask_cuda_worker.py", line 323, in go
    main()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/dask_cuda_worker.py", line 314, in main
    loop.run_sync(run)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/dask_cuda_worker.py", line 307, in run
    yield [n._start(addr) for n in nannies]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda/dask_cuda_worker.py", line 307, in <listcomp>
    yield [n._start(addr) for n in nannies]
AttributeError: 'Nanny' object has no attribute '_start'

[BUG] dask-cuda(0.7.0) is broken with latest distributed(2.0.1)

dask-cuda is not compatible with latest distributed

Log:-

(base) builder@5ab0d442c86f:~$ dask-cuda-worker
Traceback (most recent call last):
  File "/opt/anaconda3/bin/dask-cuda-worker", line 11, in <module>
    load_entry_point('dask-cuda==0.7.0', 'console_scripts', 'dask-cuda-worker')()
  File "/opt/anaconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/opt/anaconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2843, in load_entry_point
    return ep.load()
  File "/opt/anaconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2434, in load
    return self.resolve()
  File "/opt/anaconda3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2440, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/opt/anaconda3/lib/python3.6/site-packages/dask_cuda/dask_cuda_worker.py", line 12, in <module>
    from distributed.worker import _ncores
ImportError: cannot import name '_ncores'
(base) builder@5ab0d442c86f:~$

Conda package :-

(base) builder@5ab0d442c86f:~$ conda list | grep distributed
distributed               2.0.1                      py_0
(base) builder@5ab0d442c86f:~$

(base) builder@5ab0d442c86f:~$ conda list | grep dask
dask                      2.0.0                      py_0
dask-core                 2.0.0                      py_0
dask-cuda                 0.7.0           py36_489.g8bce79e    
dask-cudf                 0.9.0a          py36_493.g2167909    
(base) builder@5ab0d442c86f:~$

Explicitly scaling up LocalCUDACluster() creates Dask workers that don't use all available GPUs.

I am trying to set up a Dask multi-GPU cluster on my 48-core machine with 2 GPUs.

Using LocalCUDACluster() gives me 2 Dask workers (2 cores).

For my pipeline, I’d prefer to use more processes than threads, to avoid the GIL issue; to solve this, when I start more workers explicitly using cluster.start_worker(), Dask seems to start more workers, but these additional workers do not use the second GPU, i.e., only one GPU gets used.

Any ideas on how I can fix this?

[BUG] Attribute 'protocol' not found in LocalCUDACluster in 0.7

When running the very common

from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()

I get the error

AttributeError: 'LocalCUDACluster' object has no attribute 'protocol'

This call is used throughout many existing and upcoming notebooks. Can someone please help fix this? Thanks!

Support CUDA use in dask-ssh

The dask-ssh tool currently in https://github.com/dask/distributed/blob/master/distributed/deploy/ssh.py makes it easy to deploy Dask across a cluster of many machines with ssh. It would be nice to find a way to extend this tool to handle the same sorts of considerations as we did in LocalCUDACluster.

Note that we might also want to split off dask-ssh to a different repository and refactor it a bit. See dask/distributed#2497

Query on LocalCUDACluster usage

Hi,

I want to create local cuda dask cluster using LocalCUDACluster

python script is mentioned below -

$ cat test_cluster.py
import os

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)

print("cluster status ",cluster.status)
print("cluster infomarion ", cluster)
client = Client(cluster)

print("client information ",client)

$

When I am using python command prompt - it is working for me .

$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:34:02)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>>
>>> from dask.distributed import Client
>>> from dask_cuda import LocalCUDACluster

>>>
>>> cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
>>> print("cluster status ",cluster.status)
cluster status  running
>>> print("cluster infomarion ", cluster)
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:12347', workers=2, ncores=2)
>>> client = Client(cluster)
>>> print("client information ",client)
client information  <Client: scheduler='tcp://127.0.0.1:12347' processes=2 cores=2>
>>>

However, when I am trying to run python scripts - using python test_cluster.py

$ python test_cluster.py
cluster status  running
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:12347', workers=0, ncores=0)
client information  <Client: scheduler='tcp://127.0.0.1:12347' processes=0 cores=0>
Traceback (most recent call last):
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    code = spawn._main(child_r)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 114, in _main
    prepare(preparation_data)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/pradghos/anaconda3/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/pradghos/anaconda3/lib/python3.6/runpy.py", line 96, in _run_module_code
Traceback (most recent call last):
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    mod_name, mod_spec, pkg_name, script_name)
  ....
  ....
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/utils.py", line 316, in f
    self.listener.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/comm/tcp.py", line 421, in start
    result[0] = yield future
    self.port, address=self.ip, backlog=backlog
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/netutil.py", line 163, in bind_sockets
    value = future.result()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/deploy/spec.py", line 158, in _start
    self.scheduler = await self.scheduler
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/scheduler.py", line 1239, in __await__
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
    self.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/scheduler.py", line 1200, in start
    self.listen(addr_or_port, listen_args=self.listen_args)
  File "home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/core.py", line 322, in listen
    self.listener.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/comm/tcp.py", line 421, in start
    self.port, address=self.ip, backlog=backlog
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/netutil.py", line 163, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
distributed.nanny - WARNING - Worker process 24873 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 24874 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

Any pointers if I am missing something ? Thanks in advance !

Move to dask org

This project is more closely tied to Dask projects than RAPIDS projects, particularly with dependencies and release cycles. Should we move it from the rapidsai organization to the dask org?

One challenge is that we still need GPU CI for testing PRs. Is that likely to cause an issue with this process?

[BUG] dask-cuda is missing a CUDA 10.0 build in the conda nightly channel

dask-cuda 0.7 is available with a CUDA 9.2 tag, but not available for CUDA 10.0. For conda users using the rapidsai-nightly/label/cuda10.0 channel (for concurrent installs that require a specific CUDA version build), they'll only have dask-cuda 0.6 available to them.

Support for CUDA streams

When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.

Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.

I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.

We should check what's the correct way of using them with Dask and how/if they provide performance improvements.

cc @mrocklin @jakirkham @jrhemstad

[BUG] LocalCUDACluster constructor fails with unexpected keyword "env"

Software:
Anaconda3 2018.12 (Python 3.7)
Rapids 0.5.0 - pip install of cudf-cuda100
dask-cuda - git commit e405194.
notebook: git commit dcbc0d9238e86a0a5383f6841b3f13223bd0bebb notebooks/dask/Dask_Hello_World.ipynb

import dask
from dask.delayed import delayed
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

Exception:

The interesting part:

TypeError: __init__() got an unexpected keyword argument 'env'

/opt/anaconda3/lib/python3.7/site-packages/dask_cuda-0.0.0-py3.7.egg/dask_cuda/local_cuda_cluster.py in _start(self, ip, n_workers)
     77                 env={"CUDA_VISIBLE_DEVICES": cuda_visible_devices(i)},
     78             )
---> 79             for i in range(n_workers)
     80         ]
     81

More of the code around the exception for context:


    @gen.coroutine
    def _start(self, ip=None, n_workers=0):
        """
        Start all cluster services.
        """
        if self.status == "running":
            return
        if (ip is None) and (not self.scheduler_port) and (not self.processes):
            # Use inproc transport for optimization
            scheduler_address = "inproc://"
        elif ip is not None and ip.startswith("tls://"):
            scheduler_address = "%s:%d" % (ip, self.scheduler_port)
        else:
            if ip is None:
                ip = "127.0.0.1"
            scheduler_address = (ip, self.scheduler_port)
        self.scheduler.start(scheduler_address)

        yield [
            self._start_worker(
                **self.worker_kwargs,
               env={"CUDA_VISIBLE_DEVICES": cuda_visible_devices(i)},
            )
            for i in range(n_workers)
        ]

The function signature for _start_worker doesn't have an env argument, though I'm not sure why it isn't absorbed by **kwargs. (also why call _start_work when there is a public function start_worker?).

%pinfo dask.distributed.LocalCluster._start_worker
Signature:
dask.distributed.LocalCluster._start_worker(
    ['self', 'death_timeout=60', '**kwargs'],
)
Docstring: <no docstring>
File:      /opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py
Type:      function

The full exception:

tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 883, in callback
    result_list.append(f.result())
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py", line 207, in _start_worker
    silence_logs=self.silence_logs, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 99, in __init__
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'env'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-aeeb11c1def4> in <module>
----> 1 cluster = LocalCUDACluster()

/opt/anaconda3/lib/python3.7/site-packages/dask_cuda-0.0.0-py3.7.egg/dask_cuda/local_cuda_cluster.py in __init__(self, n_workers, threads_per_worker, processes, memory_limit, **kwargs)
     51             threads_per_worker=threads_per_worker,
     52             memory_limit=memory_limit,
---> 53             **kwargs,
     54         )
     55 

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py in __init__(self, n_workers, threads_per_worker, processes, loop, start, ip, scheduler_port, silence_logs, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, **worker_kwargs)
    139             self.worker_kwargs['security'] = security
    140 
--> 141         self.start(ip=ip, n_workers=n_workers)
    142 
    143         clusters_to_close.add(self)

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py in start(self, **kwargs)
    169             self._started = self._start(**kwargs)
    170         else:
--> 171             self.sync(self._start, **kwargs)
    172 
    173     @gen.coroutine

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py in sync(self, func, *args, **kwargs)
    162             return future
    163         else:
--> 164             return sync(self.loop, func, *args, **kwargs)
    165 
    166     def start(self, **kwargs):

/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

/opt/anaconda3/lib/python3.7/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
   1139                     if exc_info is not None:
   1140                         try:
-> 1141                             yielded = self.gen.throw(*exc_info)
   1142                         finally:
   1143                             # Break up a reference to itself

/opt/anaconda3/lib/python3.7/site-packages/dask_cuda-0.0.0-py3.7.egg/dask_cuda/local_cuda_cluster.py in _start(self, ip, n_workers)
     77                 env={"CUDA_VISIBLE_DEVICES": cuda_visible_devices(i)},
     78             )
---> 79             for i in range(n_workers)
     80         ]
     81 

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in callback(f)
    881             for f in children:
    882                 try:
--> 883                     result_list.append(f.result())
    884                 except Exception as e:
    885                     if future.done():

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in wrapper(*args, **kwargs)
    324                 try:
    325                     orig_stack_contexts = stack_context._state.contexts
--> 326                     yielded = next(result)
    327                     if stack_context._state.contexts is not orig_stack_contexts:
    328                         yielded = _create_future()

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/local.py in _start_worker(self, death_timeout, **kwargs)
    205         w = W(self.scheduler.address, loop=self.loop,
    206               death_timeout=death_timeout,
--> 207               silence_logs=self.silence_logs, **kwargs)
    208         yield w._start()
    209 

/opt/anaconda3/lib/python3.7/site-packages/distributed/nanny.py in __init__(self, scheduler_ip, scheduler_port, scheduler_file, worker_port, ncores, loop, local_dir, services, name, memory_limit, reconnect, validate, quiet, resources, silence_logs, death_timeout, preload, preload_argv, security, contact_address, listen_address, worker_class, **kwargs)
     97         super(Nanny, self).__init__(handlers, io_loop=self.loop,
     98                                     connection_args=self.connection_args,
---> 99                                     **kwargs)
    100 
    101         if self.memory_limit:

TypeError: __init__() got an unexpected keyword argument 'env'

Worker processes fail to start?

I installed this package by doing a git clone then running pip install . in the git folder.
When I try to run

from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()
client = Client(cluster)

I get:

distributed.nanny - WARNING - Worker process 694 exited with status 1
distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
distributed.nanny - ERROR - Failed to restart worker after its process exited
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 293, in _on_exit
    yield self.instantiate()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 228, in instantiate
    self.process.start()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda/lib/python3.6/site-packages/distributed/nanny.py", line 375, in start
    yield self.process.start()
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/conda/lib/python3.6/site-packages/distributed/process.py", line 35, in _call_and_set_future
    res = func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/distributed/process.py", line 184, in _start
    process.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 948, in reduce_connection
    df = reduction.DupFd(conn.fileno())
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 170, in fileno
    self._check_closed()
  File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 136, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed

TimeoutErrorTraceback (most recent call last)
<ipython-input-4-d604c0151a0d> in <module>
      1 from dask_cuda import LocalCUDACluster
      2 
----> 3 cluster = LocalCUDACluster()
      4 client = Client(cluster)

/opt/conda/lib/python3.6/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, n_workers, threads_per_worker, processes, memory_limit, **kwargs)
     51             threads_per_worker=threads_per_worker,
     52             memory_limit=memory_limit,
---> 53             **kwargs,
     54         )
     55 

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in __init__(self, n_workers, threads_per_worker, processes, loop, start, ip, scheduler_port, silence_logs, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, **worker_kwargs)
    140             self.worker_kwargs['security'] = security
    141 
--> 142         self.start(ip=ip, n_workers=n_workers)
    143 
    144         clusters_to_close.add(self)

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in start(self, **kwargs)
    177             self._started = self._start(**kwargs)
    178         else:
--> 179             self.sync(self._start, **kwargs)
    180 
    181     @gen.coroutine

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in sync(self, func, *args, **kwargs)
    170             return future
    171         else:
--> 172             return sync(self.loop, func, *args, **kwargs)
    173 
    174     def start(self, **kwargs):

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1139                     if exc_info is not None:
   1140                         try:
-> 1141                             yielded = self.gen.throw(*exc_info)
   1142                         finally:
   1143                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/dask_cuda/local_cuda_cluster.py in _start(self, ip, n_workers)
     77                 env={"CUDA_VISIBLE_DEVICES": cuda_visible_devices(i)},
     78             )
---> 79             for i in range(n_workers)
     80         ]
     81 

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in callback(f)
    881             for f in children:
    882                 try:
--> 883                     result_list.append(f.result())
    884                 except Exception as e:
    885                     if future.done():

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1145                             exc_info = None
   1146                     else:
-> 1147                         yielded = self.gen.send(value)
   1148 
   1149                     if stack_context._state.contexts is not orig_stack_contexts:

/opt/conda/lib/python3.6/site-packages/distributed/deploy/local.py in _start_worker(self, death_timeout, **kwargs)
    227         if w.status == 'closed' and self.scheduler.status == 'running':
    228             self.workers.remove(w)
--> 229             raise gen.TimeoutError("Worker failed to start")
    230 
    231         raise gen.Return(w)

TimeoutError: Worker failed to start

semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown in E2E.ipynb

Getting the follow issues when running using dask-cuda in https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb on a DGX-1:

In notebook:

distributed.nanny - WARNING - Worker process 47271 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47270 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47274 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47268 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47275 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47272 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47273 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 47269 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

In logs:

/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/home/nfs/kkraus/anaconda3/envs/cudf_dev/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

The workers restart and work as expected, but it does this consistently.

Work Stealing with multi-GPU Systems

It is often the case, when running workloads, tests, etc., that Dask will engage in intermittent work stealing. The work stealing typically results in undefined behavior as a GPU processes more than its intended volume of work.

As a work-around, we've used

export DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=False
export DASK_DISTRIBUTED__SCHEDULER__BANDWIDTH=1

To trick the scheduler into disabling work-stealing, and to prevent it from shuffling data (which may cause a GPU to go out of memory).

I'm wondering how we should address this in a more general way, that does not require us (or a user) to set environment variables.

One thought I had would be to have Dask, when dealing with GPUs, simply not steal work or shuffle data unless explicitly directed by the user in the form of an API call.

[BUG] ImportError: cannot import name 'SpecCluster' from 'dask.distributed'

Describe the bug
Creating a LocalCUDACluster results in an ImportError.

Steps/Code to reproduce bug

import dask; print('Dask Version:', dask.__version__)
from dask.distributed import Client
# import dask_cuda; print('Dask CUDA Version:', dask_cuda.__version__)
from dask_cuda import LocalCUDACluster


# create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)

Dask Version: 1.2.2
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-d6c333133bab> in <module>
      2 from dask.distributed import Client
      3 # import dask_cuda; print('Dask CUDA Version:', dask_cuda.__version__)
----> 4 from dask_cuda import LocalCUDACluster
      5 
      6 

/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda-0.0.0.dev0-py3.7.egg/dask_cuda/__init__.py in <module>
      1 from .local_cuda_cluster import LocalCUDACluster
----> 2 from .dgx import DGX

/conda/envs/rapids/lib/python3.7/site-packages/dask_cuda-0.0.0.dev0-py3.7.egg/dask_cuda/dgx.py in <module>
      1 import os
      2 
----> 3 from dask.distributed import Nanny, SpecCluster, Scheduler
      4 from distributed.worker import TOTAL_MEMORY
      5 

ImportError: cannot import name 'SpecCluster' from 'dask.distributed' (/conda/envs/rapids/lib/python3.7/site-packages/dask/distributed.py)

Expected behavior
Successful creation of LocalCUDACluster.

Environment details:

Environment location: Docker
Method of cuDF install: Docker
- If method of install is [Docker], provide docker pull & docker run commands used
Please run and attach the output of the cudf/print_env.sh script to gather relevant environment details

FROM rapidsai/rapidsai-nightly:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7

SHELL ["/bin/bash", "-c"]
RUN source activate rapids && conda install -y \
        matplotlib \
        scikit-learn \
        seaborn \
        python-louvain \
        jinja2 \
        && pip install graphistry mockito

RUN source activate rapids && conda install -c \
        nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c numba -c conda-forge -c defaults cugraph

RUN apt update &&\
    apt install -y graphviz &&\
    source activate rapids && pip install graphviz
        
# ToDo: let user supply kaggle creds
RUN source activate rapids && pip install kaggle

ADD data /data
RUN mkdir -p /rapids/notebooks/extended
# symlinked so users can browse the data directory inside JupyterLab
RUN ln -s /data /rapids/notebooks/extended

ADD . /rapids/notebooks/extended
#ADD cpu_comparisons /rapids/notebooks/extended/cpu_comparisons
#ADD tutorials /rapids/notebooks/extended/tutorials
#ADD cugraph_benchmark /rapids/notebooks/extended/cugraph_benchmark

WORKDIR /rapids/notebooks/extended
CMD source activate rapids && sh /rapids/notebooks/utils/start-jupyter.sh

Output of cudf/print_env.sh:

(rapids) root@8f57dc093a93:/rapids/cudf# bash print_env.sh 
**git***
commit 878f02e0de32cf4ca675fba28ce8dcce5d613344 (HEAD -> branch-0.8, origin/branch-0.8, origin/HEAD)
Merge: d91b50d6 76b505fd
Author: Keith Kraus <[email protected]>
Date:   Mon Jun 10 21:39:40 2019 -0400

    Merge pull request #1948 from devavret/fea-binary-fill-value
    
    [REVIEW] Binary Operator functions for Series and DataFrame
**git submodules***
 b165e1fb11eeea64ccf95053e40f2424312599cc thirdparty/cub (v1.7.1)
 5c792cef3aee54ad8b7000111c9dc1797f327b59 thirdparty/dlpack (v0.2-3-g5c792ce)
 c7c55b38333f5fa9ad5ec5e0804d5c9eedd14de2 thirdparty/jitify (heads/cudf)
 d704d10ec729437e0edc313c38bb7b4a1987b015 thirdparty/rmm (v0.7.0.dev0-54-gd704d10)
 37896cc9bfc6536a8c878a1e675835c22d827821 thirdparty/rmm/thirdparty/cnmem (v1.0.0-8-g37896cc)

***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Linux 8f57dc093a93 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Thu Jun 13 10:29:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
| N/A   36C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
| N/A   35C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   39C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM3...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   38C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   36C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   37C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
| N/A   34C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   38C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla V100-SXM3...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   36C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla V100-SXM3...  On   | 00000000:B9:00.0 Off |                    0 |
| N/A   37C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla V100-SXM3...  On   | 00000000:BC:00.0 Off |                    0 |
| N/A   38C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   39C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla V100-SXM3...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   34C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
| N/A   34C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla V100-SXM3...  On   | 00000000:E5:00.0 Off |                    0 |
| N/A   38C    P0    54W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
| N/A   37C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

***CPU***
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Byte Order:           Little Endian
CPU(s):               80
On-line CPU(s) list:  0-39
Off-line CPU(s) list: 40-79
Thread(s) per core:   1
Core(s) per socket:   20
Socket(s):            2
NUMA node(s):         2
Vendor ID:            GenuineIntel
CPU family:           6
Model:                85
Model name:           Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:             4
CPU MHz:              1563.867
CPU max MHz:          3700.0000
CPU min MHz:          1000.0000
BogoMIPS:             4800.00
Virtualization:       VT-x
L1d cache:            32K
L1i cache:            32K
L2 cache:             1024K
L3 cache:             28160K
NUMA node0 CPU(s):    0-19
NUMA node1 CPU(s):    20-39
Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke spec_ctrl intel_stibp

***CMake***
/conda/envs/rapids/bin/cmake
cmake version 3.14.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/usr/bin/g++
g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

***Python***
/conda/envs/rapids/bin/python
Python 3.7.3

***Environment Variables***
PATH                            : /conda/envs/rapids/bin:/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/conda/bin
LD_LIBRARY_PATH                 : /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
NUMBAPRO_NVVM                   : /usr/local/cuda/nvvm/lib64/libnvvm.so
NUMBAPRO_LIBDEVICE              : /usr/local/cuda/nvvm/libdevice
CONDA_PREFIX                    : /conda/envs/rapids
PYTHON_PATH                     : 

***conda packages***
/conda/condabin/conda
# packages in environment at /conda/envs/rapids:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.12.1           py37h0e61e49_0    conda-forge
atomicwrites              1.3.0                      py_0    conda-forge
attrs                     19.1.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
blas                      1.0                         mkl  
bleach                    3.1.0                      py_0    conda-forge
bokeh                     1.2.0                    py37_0    conda-forge
boost                     1.68.0          py37h8619c78_1001    conda-forge
boost-cpp                 1.68.0            h11c811c_1000    conda-forge
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2019.3.9             hecc5488_0    conda-forge
certifi                   2019.3.9                 py37_0    conda-forge
cffi                      1.11.5          py37h9745a5d_1001    conda-forge
chardet                   3.0.4                    pypi_0    pypi
click                     7.0                      pypi_0    pypi
cloudpickle               1.2.1                      py_0    conda-forge
cmake                     3.14.3               hf94ab9c_0    conda-forge
cmake_setuptools          0.1.3                      py_0    rapidsai-nightly/label/cuda10.0
cuda100                   1.0                           0    pytorch
cudatoolkit               10.0.130                      0  
cudf                      0.8.0a1+606.g878f02e0          pypi_0    pypi
cugraph                   0.8.0a0+266.gd70e3ab          pypi_0    pypi
cuml                      0.8.0a0+1068.g436b429d          pypi_0    pypi
curl                      7.64.1               hf8cf82a_0    conda-forge
cycler                    0.10.0                     py_1    conda-forge
cython                    0.29.10          py37he1b5a44_0    conda-forge
cytoolz                   0.9.0.1         py37h14c3975_1001    conda-forge
dask                      1.2.2                      py_3    conda-forge
dask-core                 1.2.2                      py_0    conda-forge
dask-cudf                 0.0.0.dev0               pypi_0    pypi
dask-cuml                 0.8.0a0                  pypi_0    pypi
dask-xgboost              0.1.5                    pypi_0    pypi
dbus                      1.13.6               he372182_0    conda-forge
decorator                 4.4.0                      py_0    conda-forge
defusedxml                0.5.0                      py_1    conda-forge
distributed               1.28.0                   py37_0    conda-forge
entrypoints               0.3                   py37_1000    conda-forge
expat                     2.2.5             hf484d3e_1002    conda-forge
faiss-gpu                 1.5.0           py37_cuda10.0_1  [cuda100]  pytorch
fontconfig                2.13.1            he4413a7_1000    conda-forge
freetype                  2.10.0               he983fc9_0    conda-forge
future                    0.17.1                   pypi_0    pypi
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
glib                      2.58.3            hf63aee3_1001    conda-forge
graphistry                0.9.67                   pypi_0    pypi
gst-plugins-base          1.14.5               h0935bb2_0    conda-forge
gstreamer                 1.14.5               h36ae1b5_0    conda-forge
heapdict                  1.0.0                 py37_1000    conda-forge
icu                       58.2              hf484d3e_1000    conda-forge
idna                      2.8                      pypi_0    pypi
importlib_metadata        0.17                     py37_1    conda-forge
intel-openmp              2019.4                      243  
ipykernel                 5.1.1            py37h24bf2e0_0    conda-forge
ipython                   7.3.0            py37h24bf2e0_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.13.3                   py37_0    conda-forge
jinja2                    2.10.1                   py37_0  
joblib                    0.13.2                     py_0    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
jsonschema                3.0.1                    py37_0    conda-forge
jupyter_client            5.2.4                      py_3    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
jupyterlab                0.35.6                   py37_0    conda-forge
jupyterlab_server         0.2.0                      py_0    conda-forge
kaggle                    1.5.4                    pypi_0    pypi
kiwisolver                1.1.0            py37hc9558a2_0    conda-forge
krb5                      1.16.3            h05b26f9_1001    conda-forge
libblas                   3.8.0                    10_mkl    conda-forge
libcblas                  3.8.0                    10_mkl    conda-forge
libclang                  8.0.0                h6bb024c_0    rapidsai/label/cuda10.0
libcudf                   0.7.2                cuda10.0_0    rapidsai/label/cuda10.0
libcugraph                0.7.0                cuda10.0_0    rapidsai/label/cuda10.0
libcumlmg                 0.0.0.dev0         cuda10.0_373    nvidia/label/cuda10.0
libcurl                   7.64.1               hda55be3_0    conda-forge
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libiconv                  1.15              h516909a_1005    conda-forge
liblapack                 3.8.0                    10_mkl    conda-forge
liblapacke                3.8.0                    10_mkl    conda-forge
libnvstrings              0.7.0                cuda10.0_0    rapidsai/label/cuda10.0
libpng                    1.6.37               hed695b0_0    conda-forge
libprotobuf               3.6.1             hdbcaa40_1001    conda-forge
librmm                    0.7.0                cuda10.0_0    rapidsai/label/cuda10.0
librmm-cffi               0.8.0                    pypi_0    pypi
libsodium                 1.0.16            h14c3975_1001    conda-forge
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.0.10            h57b8799_1003    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libuv                     1.29.1               h516909a_0    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.9                h13577e0_0    conda-forge
llvmlite                  0.26.0          py37hdbcaa40_1000    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
markupsafe                1.1.1            py37h14c3975_0    conda-forge
matplotlib                3.1.0            py37h5429711_0  
matplotlib-base           3.1.0            py37hfd891ef_1    conda-forge
mistune                   0.8.4           py37h14c3975_1000    conda-forge
mkl                       2019.4                      243  
mkl-service               2.0.2            py37h7b6447c_0  
mockito                   1.1.1                    pypi_0    pypi
more-itertools            4.3.0                 py37_1000    conda-forge
msgpack-python            0.6.1            py37h6bb024c_0    conda-forge
nbconvert                 5.5.0                      py_0    conda-forge
nbformat                  4.4.0                      py_1    conda-forge
ncurses                   6.1                  he6710b0_1  
networkx                  2.3                        py_0    conda-forge
notebook                  5.7.8                    py37_1    conda-forge
numba                     0.41.0          py37h637b7d7_1000    conda-forge
numpy                     1.16.2           py37h8b7e671_1    conda-forge
nvstrings                 0.7.0                    py37_0    rapidsai/label/cuda10.0
nvstrings-cuda100         0.0.0.dev0               pypi_0    pypi
olefile                   0.46                       py_0    conda-forge
openblas                  0.3.5             h9ac9557_1001    conda-forge
openssl                   1.1.1b               h14c3975_1    conda-forge
packaging                 19.0                       py_0    conda-forge
pandas                    0.23.4          py37h637b7d7_1000    conda-forge
pandoc                    2.7.2                         0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.1                         4    conda-forge
parso                     0.4.0                      py_0    conda-forge
partd                     0.3.10                     py_1    conda-forge
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.41              hf484d3e_1003    conda-forge
pexpect                   4.7.0                    py37_0    conda-forge
pickleshare               0.7.5                 py37_1000    conda-forge
pillow                    6.0.0            py37he7afcd5_0    conda-forge
pip                       19.1.1                   py37_0  
pluggy                    0.12.0                     py_0    conda-forge
prometheus_client         0.7.0                      py_0    conda-forge
prompt_toolkit            2.0.9                      py_0    conda-forge
protobuf                  3.8.0                    pypi_0    pypi
psutil                    5.6.2            py37h516909a_0    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
py                        1.8.0                      py_0    conda-forge
pyarrow                   0.12.1           py37hbbcf98d_0    conda-forge
pycparser                 2.19                     py37_1    conda-forge
pygments                  2.4.2                      py_0    conda-forge
pyparsing                 2.4.0                      py_0    conda-forge
pyqt                      5.9.2            py37hcca6a23_0    conda-forge
pyrsistent                0.15.2           py37h516909a_0    conda-forge
pytest                    4.6.2                    py37_0    conda-forge
python                    3.7.3                h0371630_0  
python-dateutil           2.8.0                      py_0    conda-forge
python-graphviz           0.11                     pypi_0    pypi
python-louvain            0.13                       py_0  
python-slugify            3.0.2                    pypi_0    pypi
pytz                      2019.1                     py_0    conda-forge
pyyaml                    5.1.1            py37h516909a_0    conda-forge
pyzmq                     18.0.1           py37hc4ba49a_1    conda-forge
qt                        5.9.7                h52cfd70_2    conda-forge
readline                  7.0                  h7b6447c_5  
requests                  2.22.0                   pypi_0    pypi
rhash                     1.3.6             h14c3975_1001    conda-forge
rmm                       0.7.0                    py37_0    rapidsai/label/cuda10.0
scikit-learn              0.21.2           py37hd81dba3_0  
scipy                     1.3.0            py37h921218d_0    conda-forge
seaborn                   0.9.0                    py37_0  
send2trash                1.5.0                      py_0    conda-forge
setuptools                41.0.1                   py37_0  
sip                       4.19.8          py37hf484d3e_1000    conda-forge
six                       1.12.0                py37_1000    conda-forge
sortedcontainers          2.1.0                      py_0    conda-forge
sqlite                    3.28.0               h7b6447c_0  
statsmodels               0.9.0           py37h3010b51_1000    conda-forge
tblib                     1.4.0                      py_0    conda-forge
terminado                 0.8.2                    py37_0    conda-forge
testpath                  0.4.2                   py_1001    conda-forge
text-unidecode            1.2                      pypi_0    pypi
thrift-cpp                0.12.0            h0a07b25_1002    conda-forge
tk                        8.6.9             hed695b0_1002    conda-forge
toolz                     0.9.0                    pypi_0    pypi
tornado                   6.0.2            py37h516909a_0    conda-forge
tqdm                      4.32.1                   pypi_0    pypi
traitlets                 4.3.2                 py37_1000    conda-forge
urllib3                   1.24.3                   pypi_0    pypi
wcwidth                   0.1.7                      py_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.33.4                   py37_0  
xgboost                   0.90.rapidsdev1          pypi_0    pypi
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7             h14c3975_1001    conda-forge
zeromq                    4.3.1             hf484d3e_1000    conda-forge
zict                      0.1.4                    pypi_0    pypi
zipp                      0.5.1                      py_0    conda-forge
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.0                h3b9ef0a_0    conda-forge

Additional context
This is coming from the Nightly Builds on Docker.

rapidsai / dask-cuda Goto Github PK

dask-cuda's Introduction

Dask CUDA

Example

What this is not

dask-cuda's People

Contributors

Stargazers

Watchers

Forkers

dask-cuda's Issues

[FEA] Better CUDF/Nvstrings Spill over to Disk/Memory

The setup

The data

The task

The problem

My question

Misc

Exception:

The interesting part:

The full exception:

Recommend Projects

Recommend Topics

Recommend Org