We need some internal documentation for how to get our "platform" running on Cheyenne.

You can use our new <a href="https://github.com/pangeo-data/pangeo-discussion/wiki/NCA

Previously I was piggy-backing on <a class="user-mention notranslate" data-hovercard-t

I'm copying some notes that <a class="user-mention notranslate" data-hovercard-type="u

I have started putting stuff up on the <a href="https://github.com/pangeo-data/pangeo-

<a href="https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/software/

So far my approach is as follows: Install Miniconda <div class

Documentation for xarray / dask.distributed / jupyter notebook on Cheyenne,about pangeo-data/pangeo

Comments (49)

mrocklin commented on August 15, 2024 1

I'm glad to hear it and thank you for tweaking directly.

My setup has evolved slightly. I've just uploaded a screencast to Youtube that might be interesting. It includes an actual notebook and execution which is fun to watch.

In that video I lay out some questions about interactive use, particularly around launching many small jobs rather than a single large deployment. Having a herd of independent workers might be a decent way to handle interactive jobs within a batch infrastructure short term.

from pangeo.

rabernat commented on August 15, 2024

You can use our new allocation for any jobs on Cheyenne.

from pangeo.

mrocklin commented on August 15, 2024

Previously I was piggy-backing on @jhamman 's allocation. I take it it's preferred to use the new one?

from pangeo.

rabernat commented on August 15, 2024

Yes, use the new one.

from pangeo.

kmpaul commented on August 15, 2024

But only after September 1. The allocation doesn't start until then.

…

On Wed, Aug 30, 2017 at 9:51 AM, Ryan Abernathey ***@***.***> wrote: Yes, use the new one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AK4fg-q8bwiaAQUVWgsFpg0ayQxvDuP4ks5sdYUKgaJpZM4PHbji> .

-- *Kevin Paul, PhD* Project Scientist, Head of I/O & Workflow Applications (IOWA) The National Center for Atmospheric Research Computational and Information Systems Laboratory 1850 Table Mesa Dr Boulder, CO 80305 Phone: (303) 497-2441 Office: ML460B

from pangeo.

mrocklin commented on August 15, 2024

I'm copying some notes that @jhamman sent by e-mail a while ago

Joe's e-mail

Here are a few details on the running a Jupyter Notebook Server on Cheyenne

UCAR's documentation: https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/software/jupyter-and-ipython#notebook
I think I created my own copy of the start-notebook script so I could make sure I'm running my Anaconda install of Jupyter. My copy is in the notebooks directory below.
Side note: you may need to manually unset your LD_LIBRARY_PATH to get your anaconda distribution to play nice on cheyenne.

from pangeo.

rabernat commented on August 15, 2024

I have started putting stuff up on the wiki of this repo. Not sure if this is a good long-term solution, but it's there and it works.

from pangeo.

davidedelvento commented on August 15, 2024

Supported way to use Python on Cheyenne but of course other options work and might be a better fit for this project.

Supported way to use Jupyter on Cheyenne (and Yellowstone, however the latter is EOL'ed)

from pangeo.

mrocklin commented on August 15, 2024

Wiki seems like a good idea

from pangeo.

mrocklin commented on August 15, 2024

So far my approach is as follows:

Install Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

Create Development environment

conda create -n pangeo python=3.6 dask distributed xarray jupyterlab mpi4py -c conda-forge

Activate environment

source activate pangeo

Install development version of dask.distributed for MPI deployment support

pip install git+https://github.com/mrocklin/distributed.git@cli-logic --upgrade

Create a job script

Mine looks like the following

#!/bin/bash
#PBS -N sample
#PBS -q regular
#PBS -A UCLB0022
#PBS -l select=9:ncpus=4:mem=16G
#PBS -l walltime=00:20:00
#PBS -j oe
#PBS -m abe

rm -f scheduler.json
mpirun --np 9 dask-mpi --nthreads 4 --memory-limit 16e9

And submit

qsub myscript.sh

Connect from Python

This writes connection information into a local file, scheduler.json. We can use this to connect

$ ipython

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

>>> client
<Client: scheduler='tcp://10.148.3.189:8786' processes=8 cores=32>

from pangeo.

mrocklin commented on August 15, 2024

Current challenges:

How to clean up reliably. The job scheduler is sending SIGTERM to my processes, so normal cleanup processes fail to take over. Is there any way to get a polite SIGINT a few seconds before SIGTERM? Is there somewhere I can register cleanup code? Perhaps in my script? (this question might be for @davidedelvento , let me know if I should raise a ticket)
Is the way that I specify cores-per-process with PBS -l select=9:ncpus=4:mem=16G appropriate? This is important because we will want to play with larger and smaller workers for benchmarking.
What is the right way to set up a JupyterLab (or notebook) server and tunnel appropriately. It would be unfortunate to take over another interactive node just for this. Instead we should probably put this on the same node running the scheduler. Now we'll need to grab that hostname and issue an informative ssh tunneling message like the following to the user:
```
  ssh -N -l username 8888:HOSTNAME:8888 cheyenne.ucar.edu
```
How do we want to handle the diagnostic dashboard, presumably we add another port to the tunneling suggestion above? I wonder if we can get JupyterLab to embed an iframe for us.

from pangeo.

davidedelvento commented on August 15, 2024

Cleanup: as far as I know, this is not possible and PBS User's Guide does not mention anything in this regard. LSF does exactly as you said, but we found that no user took advantage of that feature. Maybe a workaround could be to start an "at" command to send a SIGINT a minute before the scheduled end time?
core-per-process: it depends what you are trying to achieve. See the documentation about it which is all I know
tunneling: I agree no need to have an additional node just for that (however ssh may take a fair amount of CPU). You should be able to look into the default start-notebook script for a good starting point.

from pangeo.

darothen commented on August 15, 2024

Historically I've always made a custom modulefile to manage a miniconda installation I curate in $HOME on yellowstone. We're writing up documentation on this approach for GCPy, since it applies to most clusters. You just extend @mrocklin's steps by creating a modulefile that you save locally, e.g. $HOME/modulefiles/miniconda:

#%Module -*- tcl -*-

# 'Real' name of package, appears in help,display message
set PKG_NAME      miniconda

# Path to package
set PKG_ROOT      $env(HOME)/miniconda

######################################################################

proc ModulesHelp { } {
    global PKG_ROOT
    global PKG_NAME
    puts stdout "Build:       $PKG_NAME"
    puts stdout "URL:         http://conda.pydata.org"
}

module-whatis "$PKG_NAME: streamlined conda-based Python package/env manager"

#
# Standard install locations
#
prepend-path PATH             $PKG_ROOT/bin
prepend-path MANPATH          $PKG_ROOT/share/man
prepend-path LD_LIBRARY_PATH  $PKG_ROOT/lib

or for the newer versions of lmod:

help([[
Curated miniconda Python installation.
]])
whatis("Keywords: Python, analysis")
whatis("URL: http://conda.pydata.org")
whatis("Description: Simplified python environment and package manager")

local home = os.getenv("HOME")
prepend_path(                      "PATH", home .. "path/to/miniconda/bin")
prepend_path(              "MANPATH", home .. "path/to/miniconda/share/man")
prepend_path("LD_LIBRARY_PATH", home .. "path/to/miniconda/lib")

Then the path containing this (and other) modulefiles as to be made known to lmod in your whatever startup scripts you have,

export MODULEPATH=/home/<username>/modulefiles:$MODULEPATH

Then you can module load miniconda as part of any set of scripts which start/deploy your distributed environment.

WRT to tunneling for jupyter notebooks and dask dashboard, are there issues tunneling though the login nodes to compute nodes on cheyenne? Some systems are hit/miss, but may require you to tunnel from the compute node back to the login if direct ssh access to compute nodes is disabled. I've never found clear documentation on exactly how to do this, and it would be useful if someone with more knowledge could pitch in.

from pangeo.

jhamman commented on August 15, 2024

@davidedelvento

core-per-process: it depends what you are trying to achieve. See the documentation about it which is all I know

This is going to look something like what NCAR describes as a Hybrid OpenMP/MPI job. We want one MPI process per node and a dask-configurable number of tasks per node. We will certainly want to have access to all processors on each node.

Do you know if there is someone else at CISL that would have a better idea of how to do this? I had a short call with @mrocklin this morning where we made some rapid progress. However, the remaining tasks of figuring out how to get the PBS scheduler to work probably need to addressed by a Sys Admin.

from pangeo.

mrocklin commented on August 15, 2024

Yes, to be clear we would like each MPI rank to own a be allocated a set of cores. I might want 10 ranks, each with 4 cores. I don't particularly care if they are on the same physical nodes or not.

from pangeo.

mrocklin commented on August 15, 2024

Cleanup: as far as I know, this is not possible and PBS User's Guide does not mention anything in this regard. LSF does exactly as you said, but we found that no user took advantage of that feature. Maybe a workaround could be to start an "at" command to send a SIGINT a minute before the scheduled end time?

Most of my cleanup is to delete temporary files. Is there somewhere I can write temporary data that I know will also be cleaned up by the job scheduler?

Also, just checking, the compute nodes on Cheyenne don't have attached local storage, correct? Is there a fast/insecure place to write temporary data?

from pangeo.

davidedelvento commented on August 15, 2024

I think you just use either or both ncpus and mpiprocs and that should work. Give it a try and if you encounter issues send email to cislhelp. See also the provided examples including the section on pinning which is important if your workload seriously depends on the cache for performance

from pangeo.

davidedelvento commented on August 15, 2024

The scratch part of the distributed file system seems a natural choice to me. This will be cleaned automatically according to the purge policy.

If you want higher performance you can use the local tmp which is ramdisk, but very limited in size. This will be cleaned automatically by some PBS hooks.

If you need more (fast) disk and you don't need much memory (each node has 64GB or
128 GB) we can probably set up a larger ramdisk, but I have to double check on that, and only after you present convincing arguments for the previous two not being adequate.

from pangeo.

mrocklin commented on August 15, 2024

Regarding Jupyter servers (I prefer JLab, but I imagine that others may prefer the Jupyter notebook server) I do the following:

From the login node I quickly connect to the cluster, set up JLab to run on the scheduler process, get the hostname, and print out the appropriate ssh command

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

def start_jlab(dask_scheduler):
    import subprocess
    proc = subprocess.Popen(['jupyter', 'lab', '--ip', '*', '--no-browser'])
    dask_scheduler.jlab_proc = proc

client.run_on_scheduler(start_jlab)

import socket
hostname = client.run_on_scheduler(socket.gethostname)

print("ssh -N -L 8787:%s:8787 -L 8888:%s:8888 cheyenne.ucar.edu" % (host, host))
print("Navigate to http://localhost:8787 for the dask diagnostic dashboard")
print("Navigate to https://localhost:8888 for a JupyterLab server")

Previously I have set up a password for Jupyter in my home directory by running

jupyter notebook --generate-config

Then I edit the file ~/.jupyter/jupyter_notebook_config.py to include a password (search for password and instructions will be in the right place).

from pangeo.

mrocklin commented on August 15, 2024

My desire for fast ephemeral storage is for writing excess data to disk when workers run out of memory. On commodity systems this is standard practice but obviously a bit less natural on an HPC system. We obviously want to strongly discourage users from depending on this, but in the course of interactive workloads they'll inevitably push up against memory boundaries. Having some disk lying around that we can spill to, even at extreme performance penalty is nicer than OOM-killing their jobs. (the dashboards will also go all red when do they do this, so they get good feedback).

Given what you've said above it sounds like the scratch drive is the best choice that we have. It'll be interesting to see if this generates excessive junk data on that drive.

Regarding planned shutdown that's certainly possible, and we have good mechanisms to do this already. The challenge to interactive workloads is, I think, that people will likely overshoot their walltime significantly and then cancel jobs when they're done.

from pangeo.

mrocklin commented on August 15, 2024

I'm going to write up what I have in the wiki, and see if I can consolidate some of this into a script or something.

from pangeo.

mrocklin commented on August 15, 2024

First draft: https://github.com/pangeo-data/pangeo-discussion/wiki/Getting-Started-with-Dask-on-Cheyenne

I get the sense that everyone is busy, feedback and trial testers welcome :)

from pangeo.

darothen commented on August 15, 2024

@mrocklin I walked through your wiki steps with a beginner-Python colleague just now. Worked great except for a small tweak to the ssh tunneling command (need to pass username or else my YubiKey tokens wouldn't work). I'll make an edit on the wiki.

from pangeo.

mrocklin commented on August 15, 2024

The dask-mpi executable is now on the released version of dask.distributed with packages on conda-forge.

from pangeo.

mrocklin commented on August 15, 2024

I've updated the wiki with the workflow from the youtube video

from pangeo.

mrocklin commented on August 15, 2024

OK, I think that the first basic pass here is complete. There is still almost certainly work to do here but I think that we're hopefully at a point where we can set up some feedback cycles from basic users.

Thank you @darothen for starting this up. I would encourage others to walk through things as well and report where things break. @rabernat if you can look things over and see if we're at a point where we can start engaging others at your institution that might be useful.

from pangeo.

mrocklin commented on August 15, 2024

@darothen also please let me know if there is anything I can do to engage you and your group more effectively on this. It would be great to collaborate further.

from pangeo.

guillaumeeb commented on August 15, 2024

(Edit: add another point on resources allocation an memory)

Thanks @mrocklin for pointing this discussion to me. I definitly need to try this on our cluster. I hope I will find the time soon. Just two quick remarks:

On your PBS resources allocation, I feel there is something wrong in your example:
#PBS -l select=2:ncpus=72:mpiprocs=6:ompthreads=6: With that, you're actually reserving 2 nodes with 72 core each. It is visible in the qstat output which show 144 tasks. PBS way is select 2 nodes with ncpus, mpiprocs or any other options applied to each of them.
You should also probably limit the memory you intend to use, by adding :mem=24G at the end of the select line. E.g. at the end: #PBS -l select=2:ncpus=36:mpiprocs=6:ompthreads=6:mem=24G. This way you ensure that you'll have enough memory, and also limit your use to 24GB per selected resource.
On the dynamic allocation. This can be very usefull, but I don't believe in this sentence:

However we seem to be able to get much faster response from the job scheduler if we launch many single-machine jobs. This allows us to get larger allocations faster (often immediately).

We will indeed be able to have a smaller cluster faster, and then increase its size, but I don't believe we can get larger (final?) allocations faster. Maybe you observed this because of the problem on PBS resource allocation I mentionned above?

@subhasisb could you confirm both remarks if you have some time?

And finally I also definitly need to promote pangeo initiative internaly. I see the LEGOS lab is already involved! Thanks all for your work.

from pangeo.

guillaumeeb commented on August 15, 2024

(Edit for result with complete nodes)

Not sure if it is the right place to post...

I gave a try to this solution this morning, this looks promising but I ran into a problem, it seems like the Scheduler is started on several of my mpi procs. Maybe it is because I am selecting small resources which may run on the same host. But in that case I would expect to get the error for the Workers, not for the Scheduler. Our cluster has nodes with 24 cores and 128GB. I am currently trying with a selection of complete nodes (ncpus=24), but I need to wait for some room on our cluster.
(Edit:) I am actually getting the same problem with full reserved nodes. With resources #PBS -l select=2:ncpus=24:mpiprocs=4:ompthreads=6:mem=110G, I get two nodes, but the scheduler is launched 4 times on each node, so one correct start and 3 failures on each nodes. Dask distributed version is 1.18.3.

Here is my PBS script:

#!/bin/bash
#PBS -N sample_dask
#PBS -l select=4:ncpus=6:mpiprocs=1:ompthreads=6:mem=24G
#PBS -l walltime=01:00:00

# Qsub template for CNES HAL
# Scheduler: PBS

export PATH=/home/eh/eynardbg/miniconda3/bin:$PATH
source activate pangeo
module load openmpi/2.0.1
rm -f scheduler.json
mpirun --np 4 dask-mpi --nthreads 6 --memory-limit 24e9 --interface ib0

Here is the error stack.

distributed.scheduler - INFO -   Scheduler at:   tcp://10.135.36.24:8786
distributed.scheduler - INFO -       bokeh at:         10.135.36.24:8787
distributed.scheduler - INFO -   Scheduler at:   tcp://10.135.36.23:8786
distributed.scheduler - INFO -       bokeh at:         10.135.36.23:8787
Traceback (most recent call last):
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/bin/dask-mpi", line 6, in <module>
    sys.exit(distributed.cli.dask_mpi.go())
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_mpi.py", line 85, in go
    main()
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_mpi.py", line 48, in main
    scheduler.start(addr)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/scheduler.py", line 459, in start
    self.listen(addr_or_port, listen_args=self.listen_args)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 216, in listen
    self.listener.start()
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 360, in start
    backlog=backlog)
  File "/home/eh/eynardbg/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/netutil.py", line 199, in bind_sockets
    sock.listen(backlog)
OSError: [Errno 98] Address already in use

from pangeo.

mrocklin commented on August 15, 2024

You are probably calling the dask-mpi program four times somehow. The dask-mpi program in pseudocode looks like the following:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    start_scheduler()
else:
    start_worker()

This seems simple enough that I don't expect there to be a problem on the dask side. I suspect that the problem is in how you're calling the MPI program.

from pangeo.

mrocklin commented on August 15, 2024

We will indeed be able to have a smaller cluster faster, and then increase its size, but I don't believe we can get larger (final?) allocations faster. Maybe you observed this because of the problem on PBS resource allocation I mentionned above?

I don't know enough about the scheduling policies of the job scheduler to comment intelligently here. I'm just reporting my experience.

from pangeo.

guillaumeeb commented on August 15, 2024

Thanks @mrocklin for the pseudocode, I was able to test a simple python script and indeed it comes from the MPI program. It didn't work with openmpi, rank variable was always 0, but it worked with intel version. After discussing with my coworkers, it seems mpi4py is tightly linked with an mpi implementation when it installs. So it's kind of a tricky issue here, be carefull of what mpi implementation is available in your environment when issuing:

conda create -n pangeo -c conda-forge \
    python=3.6 dask distributed xarray jupyterlab mpi4py

Now I am able to start dask:

In [1]: from dask.distributed import Client
   ...: client = Client(scheduler_file='scheduler.json')
   ...: client
   ...: 
Out[1]: <Client: scheduler='tcp://10.135.36.37:8786' processes=7 cores=42>

I have no time to go further yet, but its already a really good result! Thanks again for the work.

One question: is it possible to choose in which folder we want to write the scheduler.json file using dask-mpi? And perhaps it could be written in the current submission folder by default?

from pangeo.

mrocklin commented on August 15, 2024

Yes, see the helpstring for dask-mpi

mrocklin@carbon:~$ dask-mpi --help
Usage: dask-mpi [OPTIONS]

Options:
  --scheduler-file TEXT         Filename to JSON encoded scheduler
                                information.
  --interface TEXT              Network interface like 'eth0' or 'ib0'
  --nthreads INTEGER            Number of threads per worker.
  --memory-limit TEXT           Number of bytes before spilling data to disk.
                                This can be an integer (nbytes) float
                                (fraction of total memory) or 'auto'
  --local-directory TEXT        Directory to place worker files
  --scheduler / --no-scheduler  Whether or not to include a scheduler. Use
                                --no-scheduler to increase an existing dask
                                cluster
  --help                        Show this message and exit.

You want the --scheduler-file keyword. It defaults to scheduler.json. Should it default to something else?

from pangeo.

mrocklin commented on August 15, 2024

Hrm, can you think of any way to make mpi4py work more generically?

from pangeo.

guillaumeeb commented on August 15, 2024

For your first question, it may be again that I did not use mpirun correctly, I am not an mpi expert (more used to Hadoop and Spark). I was in a specific folder when I issued the qsub command, so I was expecting the scheduler.json file to be written to that folder (which should be the case if I understand correctly what you are saying). But it was written in my $HOME dir. I need to check my PBS script or the way mpirun is working, I may have to do a change dir command in the PBS script or give some option to mpirun.

For the second point, I was just repeating what one of my colleague observed when installing python and mpi4py on our cluster. It seems that during the module installation, openmpi or intel library (the one being available at the time) is statically linked to the mpi4py installation, with no way to change it afterwards. It appears some path to the library is written once and for all in some file. So if this mechanism is confirmed, I believe it should be changed.
But some warning may be enough, and I did not check on mpi4py page or source code to verify it, take this carefully even if I completly trust my colleague.

from pangeo.

mrocklin commented on August 15, 2024

@jhamman any thoughts on why this might suddenly start failing?

import netCDF4 as nc4
filename = '/glade/u/home/jhamman/workdir/GARD_inputs/newman_ensemble/conus_ens_001.nc'
nc4.Dataset(filename)

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-7-d6fa56ea26ea> in <module>()
      1 filename = '/glade/u/home/jhamman/workdir/GARD_inputs/newman_ensemble/conus_ens_001.nc'
----> 2 nc4.Dataset(filename)

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: NetCDF: Unknown file format

from pangeo.

darothen commented on August 15, 2024

@mrocklin I was actually just playing with this before lunch... it looks like conus_ens_001.nc and conus_ens_004.nc in that folder are empty files, and the NetCDF reader doesn't handle them gracefully.

from pangeo.

mrocklin commented on August 15, 2024

Oddly it did handle it well yesterday. Perhaps these files have been removed or recently added?

from pangeo.

darothen commented on August 15, 2024

Looks like they were recently re-written (yesterday afternoon) -

from pangeo.

jhamman commented on August 15, 2024

Yes. I'm reprocessing the data. I'll get it ready for use today.

from pangeo.

mrocklin commented on August 15, 2024

No rush, was just curious

from pangeo.

jhamman commented on August 15, 2024

@mrocklin - they're back and improved. I'm adding the other 95 ensemble members too. Note that I added the lowest level of zlib compression (level 1) to these files. Let me know if that causes any problems.

from pangeo.

jhamman commented on August 15, 2024

@mrocklin and @darothen - The sample dataset has been revived and is now 100 ensemble members in size.

I think we can close this issue. I've successfully run xarray / dask.distributed / jupyter notebook on cheyenne and on two other PBS systems. I also roped a few students from the University of Washington into walking through the wiki and setup the system on their local clusters - without my help, they were able to successfully do it!

from pangeo.

mrocklin commented on August 15, 2024

I'm glad to hear this. I suspect that we'll need to iterate in the future. Please speak up if you or anyone around you notices any opportunities for improvement.

from pangeo.

delgadom commented on August 15, 2024

@rabernat @jhamman @mrocklin - thanks so much for this. We just used these notes to get set up on UC Berkeley's Savio SLURM cluster, connecting to compute nodes through their jupyterhub login nodes. Haven't figured out yet how to enable the dashboard, as they block SSH port forwarding, but we're working with the HPC's IT staff to find a solution. We can let you know how that goes and contribute notes/instructions on translating this to a SLURM environment if you're interested.

from pangeo.

mrocklin commented on August 15, 2024

I'm very glad to hear it @delgadom . You might consider trying nbserverproxy

from pangeo.

mrocklin commented on August 15, 2024

pip install git+https://github.com/jupyterhub/nbserverproxy
jupyter serverextension enable --py nbserverproxy --sys-prefix

Then you should be able to navigate to /proxy/8787/status or something similar

See https://github.com/jupyterhub/nbserverproxy/ cc @yuvipanda

You may also want distributed master

from pangeo.

rabernat commented on August 15, 2024

Fantastic @delgadom!

At some point, we want to try to collect a list of all the local clusters where this has been deployed, along with any site-specific tweaks that are necessary. I'll follow up with you once we figure out how to organize that within the documentation.

from pangeo.

mrocklin commented on August 15, 2024

It would be nice for such a list to point to other active documentation on how to deploy these systems within various clusters. I suspect that having such a list of active deployments would provide examples for other groups to start themselves.

from pangeo.

Documentation for xarray / dask.distributed / jupyter notebook on Cheyenne about pangeo HOT 49 CLOSED

Comments (49)

Joe's e-mail

Install Miniconda

Create Development environment

Activate environment

Install development version of dask.distributed for MPI deployment support

Create a job script

Connect from Python

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent