pangeo-data / pangeo Goto Github PK

View Code? Open in Web Editor NEW

698.0 94.0 188.0 10.81 MB

Pangeo website + discussion of general issues related to the project.

Home Page: http://pangeo.io

Shell 1.12% Python 0.79% Jupyter Notebook 98.09%

hacktoberfest

pangeo's Introduction

A Community Platform for Big Data Geoscience

This repository is home to the Pangeo website: pangeo.io. The website contains general information about the Pangeo project. This repo's issue tracker also serves as a general-purpose discussion forum. For news and updates about Pangeo, check out our Medium blog and our Twitter feed. To engage with the Pangeo community, head over to our Discourse forum or browse our GitHub repos.

What is Pangeo?

Pangeo is first and foremost a community promoting open, reproducible, and scalable science. This community provides documentation, develops and maintains software, and deploys computing infrastructure to make scientific research and programming easier. The Pangeo software ecosystem involves open source tools such as xarray, iris, dask, jupyter, and many other packages. There is no single software package called "pangeo"; rather, the Pangeo project serves as a coordination point between scientists, software, and computing infrastructure. Welcome to the Pangeo community!

Our Goals

Foster collaboration around the open source scientific python ecosystem for ocean / atmosphere / land / climate science.
Support the development with domain-specific geoscience packages.
Improve scalability of these tools to handle petabyte-scale datasets on HPC and cloud platforms.

pangeo's People

Contributors

Stargazers

Watchers

Forkers

rabernat jedwards4b aashish24 dgergel mrocklin jmunroe cloudmaven leej3 tjcrone jbusecke informatics-lab naomi-henderson jgerardsimcock pbranson tomaugspurger willirath eddienko friedrichknuth martindurant philippjfr leifdenby kaipak druidsmith duncanwp sridge wy2136 ocefpaf solomonvimal kazuakiyasunaga lesteve chiaral adekunleoajayi samanvp ahuang11 guillaumeeb raybellwaves mnlevy1981 kinow gccrowther dsludwig delandmeterp eladante nicwayand bartnijssen tam203 arokem amanda-tan carreau mwengren paraita gutmann raphaeldussin saviokay rmshur huikyole jianwuwang hskendall kleok davidbrochart brtkwr djhoese zihengsun scottyhq bhanujeet gavin971 leosiqueira anirban89 aaarendt apatlpo thomas-moore-creative dpeterk jukent aidanheerdegen fmaussion dougiesquire zvonkok eddebbar fanchic betolink rowhit zdgriffith henrryvargas daxsoule bradyrx maartaud cspencerjones dnowacki-usgs jsignell rhattersley 1solation sengkim123 chunli-dai kaedonkers shivanshgrover kn-neelalohitha cloudmelon lauren2590 kumaranupam2020 hari1603 joestricker

pangeo's Issues

update pangeo communication channels

We currently have many inconsistent "membership" lists / communication channels

Issue discussions in this repo
The pangeo google group
The people list from pangeo-data.github.io
The pangeo-data github organization members
Email threads involving the NSF-funded project participants

Let's simplify this so that we can communicate more effectively and eliminate stale channels.

Initial blogposts

I'm at a point where I would typically write up a developer blogpost about current work and current issues for Dask + HPC systems.

However I probably don't want to publish such a post until after a general announcement goes out. @rabernat has this happened? If not then is there anything I can do to accelerate such an announcement?

Use case for monitoring a running simulation (e.g. AMOC or TOA radiation)

Is there interest in looking at this use case?
I am currently having some problems with this notebook and might be able to get some help here.
Also, I'm looking to use dask-distributed on our HPC systems but I need to work with HPC support on some network issues I'm having.

The use case is described in the intro of the notebook:
https://github.com/guidov/cesm_pangeo/blob/master/amoc_serial.ipynb

Document using dask-distributed on Geyser/Caldera Systems

With Cheyenne's prolonged outage, I have been working on Geyser using a local cluster with a fair bit of success. I'll just share the workflow here so others can move forward in the short term.

log on to yellowstone
$ ssh yellowstone.ucar.edu
go to your notebook directory, for me that was:
$ cd ~/projects/pangeo/notebooks
launch a notebook server on a geyser node:
execgy notebook
follow the instructions for creating an ssh tunnel, something like:
ssh -N -l user_name -L 8888:geyserXX-ib:8888 yellowstone.ucar.edu

Open a notebook and connect to a dask distributed "local cluster" using:

from distributed import LocalCluster, Client
client = Client(LocalCluster(processes=True, n_workers=10, threads_per_worker=2))
client

cc @rabernat - this could be useful to you.
cc @mrocklin - any thoughts on how this should perform. Seems to be working well for me.

Autoscaling on GCE

To manage costs it would be useful to use autoscaling clusters on GCE. This is especially valuable for dask workers, whose usage will be highly variable. However, I'm somewhat concerned that user's juypter pods may be relocated up while scaling down. I'm curious if it's possible to use multiple node groups to do this, and to have certain pods prefer certain node groups.

Get accouts on JADE

The UK Met office's Jade project has invite only access to a JupyterHub + Dask + Iris deployment on their hardware. I think that the following XArray/JupyterHub/Dask developers might find this system interesting and informative to play with. @jacobtomlinson if you're able to can I ask you to give the following github usernames access?

My understanding is that they have a single large Dask cluster accessible from a modified Zero-to-JupyterHub deployment.

My apologies if anyone didn't want to be listed here. Hopefully unanticipated access is not troublesome to anyone. @jacobtomlinson seemed to suggest that they were pretty open to other people, so please feel free to nominate others.

Can't import netCDF4 on Cheyenne in pangeo environment

I followed the wiki instructions on Getting Started with Dask on Cheyenne. I created the pangeo environment as described, but I can't import netCDF4.

conda create -n pangeo -c conda-forge \
    python=3.6 dask distributed xarray jupyterlab mpi4py
source activate pangeo

When I try python -c 'import netCDF4' I get the error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/glade/u/home/nnaik/miniconda3/envs/pangeo/lib/python3.6/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ImportError: /glade/u/home/nnaik/miniconda3/envs/pangeo/bin/../lib/./libcom_err.so.3: symbol k5_strerror_r, version krb5support_0_MIT not defined in file libkrb5support.so.0 with link time reference

GCS FileSystem for JupyterLab

When manipulating data and notebooks on Google Cloud I often wish I could easily manipulate my GCS bucket as a file system in Jupyter Lab, for example by uploading/opening notebooks or moving files around by clicking, dragging, and double-clicking.

Apparently JLab has nice interfaces for this that have already been implemented for Google Drive and Github (read-only). It might be useful for us to do the same for GCS.

Google drive code: https://github.com/jupyterlab/jupyterlab-google-drive/blob/master/src/drive/contents.ts
Github drive code: https://github.com/jupyterlab/jupyterlab-github

This would benefit me personally (I would just maintain my notebooks in a GCS bucket) and I suspect that it would more significantly benefit less technical users.

cc @bollwyvl

Ask for allocation on inactive data-processing clusters

In the call yesterday people mentioned that some other clusters like Geyser may have large memory nodes that are almost always available. This cluster resource might be a better fit for us. Would it be possible to ask for an allocation there as well? I find myself waiting for quite a while if I ask Cheyenne for more than a couple nodes at a time.

I don't know the costs or politics behind asking for compute time allocations. Please excuse me if I've asked for something silly.

"cluster has unschedulable pods"

I tried to follow the instructions on the wiki for launching dask-kubernetes on our GCP project.

I got this error on the console

Use Case Notebook for "Convective Parameters for Understanding Severe Thunderstorms"

We will work on developing this notebook. In particular - after setting the notebook as mentioned in issue #1 we suggest to do the following:

calculate CAPE using a widely used fortran code developed by George H. Bryan, maybe wrap it in python
calculate CAPE using MetPy
compare performance, requirements and actual CAPE distributions

We suggest to do it for a subset of MERRA2 data, but if @dopplershift has some suggestion about which data to use (or maybe he has done this to some extent to test MetPy) we welcome his input.

We are at a very early stage of this, but will start working on it in the next days.

Use Case Notebook for "Atmospheric Moisture Budgets"

We need a preliminary notebook that uses Xarray to calculate something related to this use case. It doesn't have to be super long or complicated. It just has to express a real-world, scientifically relevant calculation on a large dataset.

Requirements:

Carefully explain the science behind the calculation in words and equations.
Clearly document the input dataset (preferably a standard product like ERA-interim or NCAR large ensemble)
Use Xarray to load the dataset and, as much as possible, to make the calculations
Where Xarray can't accomplish the necessary tasks, clearly document the other functions and / or outside packages that need to be employed
If the analysis has been implemented in other languages (e.g. MATLAB, Ingrid, etc.), provide a link to the code and, if possible, the output (which can be used as validation)

cc: @naomi-henderson

I would like to use this as a template for the other three use cases, so I would appreciate general feedback on the checklist above. Are these the right requirements?

Raise your hand if you'll be attending the AGU Fall Meeting in New Orleans

I'd like to coordinate a Pangeo happy hour during the AGU fall meeting. I'm thinking of Wednesday evening following "New Approaches to Analyze Big Geoscientific Data Sets" session. Please give a 👍 if you are going to be at AGU and are interested in a happy hour. Feel free to use this issue to bring up other relevant AGU topics (e.g. talks, meetings, etc.).

KubeCluster only using one worker

If I run the newmann_ensemble notebook and run the following code:

ds.t_mean.mean().load()

it only uses one worker, rather than all the workers of the cluster.

Can others reproduce this problem?

access to TACC wrangler

I would like to grant those interested access to our allocation on TACC wrangler.

If you already have XSEDE account, just tell me your username.

If not, you need to create and XSEDE account.

cc @mrocklin, @jhamman

Documentation for xarray / dask.distributed / jupyter notebook on Cheyenne

We need some internal documentation for how to get our "platform" running on Cheyenne. This documentation should make it straightforward for a user familiar with HPC platforms to

create an appropriate python environment (with modules, with conda, etc...whatever works)
start up a dask cluster
connect to the scheduler dashboard
start and connect to a Jupyter notebook
connect to the scheduler from the notebook
load an example dataset with Xarray

Some relevant active threads are dask/distributed#1367 and dask/distributed#1260.

cc: @jhamman, @kmpaul, @davidedelvento, @mrocklin

workflow for moving data to cloud

I am currently transferring a pretty large dataset (~11 TB) from a local server to gcs.
Here is an abridged version basic workflow:

# open dataset (about 80 x 133 GB netCDF files)
ds = xr.open_mfsdataset('*.nc', chunks={'time': 1, 'depth':1}) 

# configure gfs
import gcsfs
config_json = '~/.config/gcloud/legacy_credentials/[email protected]/adc.json'
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=config_json)
bucket = 'pangeo-data-private/path/to/data'
gcsmap = gcsfs.mapping.GCSMap(bucket, gcs=fs, check=True, create=True)

# set recommended compression
import zarr
compressor = zarr.Blosc(cname='zstd', clevel=5, shuffle=zarr.Blosc.AUTOSHUFFLE)
encoding = {v: {'compressor': compressor} for v in ds.data_vars}

# store
ds.to_zarr(store=gcsmap, mode='w', encoding=encoding)

Each chunk in the dataset has 2700 x 3600 elements (about 75 MB), and there are 292000 total chunks in the dataset.

I am doing this through dask.distributed using a single, multi-threaded worker (24 threads). I am watching the progress through the dashboard.

Once I call to_zarr, it takes a long time before anything happens (about 1 hour). I can't figure out what dask is doing during this time. At some point the client errors with the following exception: tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fe371f58a58> exception was never retrieved. Nevertheless, the computation eventually hits the scheduler, and I can watch its progress.

I can see that there are over 1 million tasks. Most of the time is being spent in tasks called open_dataset-concatenate and store-concatenate. There are 315360 of each task, and each takes about ~20s. Doing the math, at this rate it will take a couple of days to upload the data, this is slower than scp by a factor of 2-5.

I'm not sure if it's possible to do better. Just raising this issue to start a discussion.

A command line utility to import netcdf directly to gcs/zarr would be a very useful tool to have.

Formal benchmarks

At this point we have several different platforms working and a few different use cases. It’s time to start thinking about organizing benchmarks more formally. We need to define

What we want to measure? Just runtime, or more fine-grained detail?
What parameters to vary? (n_workers, n_threads, etc.)
How to automate this? (Shell script, airspeedvelocity, etc.)

Thoughts?

Connections with the NSF EarthCube ICEBERG project

@hlynch and @shantenujha of the ICEBERG project got in touch with me last week to discuss our respective projects. Our development goals are aligned enough that I think we should stay in touch (at a minimum) as our projects progress. I'm posting this issue to make sure they see that e're using this repository to handle our project communications / documentation and to offer to answer any other questions they may have about Pangeo/Xarray/Dask here.

How to organize files on GCP

Now that we have our GCP credits, we can start moving some bigger data volumes to cloud storage. GCP has two basic categories of storage:

unified object storage, which is API compatible with S3
persistent disk, which can be attached to compute instances like a regular filesystem.

When the xarray zarr backend is working (see pydata/xarray#1528), it should be straightforward to plug it into the object storage.

For our file-based datasets, I think persistent disk is the way to go. I propose we create a unique persistent disk for each dataset in our collection. That will give us maximum flexibility in organizing and sharing the different datasets.

So, what are the datasets we want to load first? I would propose we start with

The data @jhamman worked on in #18
Some of the CMIP5 data @naomi-henderson is analyzing in #1

specify bokeh port for dashboard

Let me start of with a big THANKS. This stuff is amazing!
I have sucessfully launched a notebook using dask-mpi by using a custom port specified in the jupyterconfig file. I am still not able to connect to the dashboard though, suspecting that the default port 8787 might not be available on my hpc cluster. Is there a way to specify the bokeh-port to a custom value?

NetCDF4 import error

I have conda install netCDF4 from conda-forge. When I go to import I get the following error:

In [1]: import netCDF4
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-f731da2de255> in <module>()
----> 1 import netCDF4

~/miniconda3/envs/pangeo/lib/python3.6/site-packages/netCDF4/__init__.py in <module>()
      1 # init for netCDF4. package
      2 # Docstring comes from extension module _netCDF4.
----> 3 from ._netCDF4 import *
      4 # Need explicit imports for names beginning with underscores
      5 from ._netCDF4 import __doc__, __pdoc__

ImportError: /glade/u/home/mrocklin/miniconda3/envs/pangeo/bin/../lib/./libcom_err.so.3: symbol k5_strerror_r, version krb5support_0_MIT not defined in file libkrb5support.so.0 with link time reference

@jhamman have you seen this before?

What is WORKDIR in utilities/cheyenne/launch-dask.sh?

The WORKDIR environment variable is used in the utilities/cheyenne/launch-dask.sh script, but it is not set. As far as I know, this is not a standard environment variable on Cheyenne. Should we default it to something reasonable? Otherwise, the script will attempt to place the scheduler.json file in the / directory, and nobody but admins have permission to do that.

Big Data Mining with xarray/dask vs pyspark/MLlib

Hi all,
It's not too late to wish you a happy new year 2018 and a productive year !

I am currently working with pyspark/MLlib to conduct some big data mining on ocean datasets (eg unsupervised classification).
I'm also using xarray/dask, but to work with smaller datasets or to prepare data to feed my pyspark/MLlib workflow.

I would surely move away from pyspark toward xarray and its awesome interface. But I've looked around (eg dask/dask-ml, phausamann/sklearn-xarray) and I'm confused.

I have the impression that xarray/dask can only handle incremental methods so far.
Although @TomAugspurger noticed that

All the estimators in dask-ml will work in parallel on distributed arrays.

(dask/dask-ml#111), yet, no other methods than incremental ones are available.

So my questions are:

what xarray/dask is or will be able to achieve vs pyspark/MLlib in order to mine large datasets ?
does xarray/dask can make use of a distributed parallel environment to implement classic data mining methods, without relying on incremental methods only ?

I struggle to find the answer to these questions that are important to me simply because pyspark/MLlib provides full parallel implementation of machine learning methods and I'd like to settle on which framework to work with.

Thanks for letting me know if:

we're in the likely stage where such no-incremental methods are simply awaiting for community contributions to pangeo or dask/dask-ml
pangeo folks have this in mind and possibly will address this ?

How to host our documentation

A major goal of this project is basically producing a ton of documentation. Eventually I envision an online resource consisting of:

Documentation on how to deploy a "pangeo environment" (including distributed cluster) in various contexts such as
- different clusters (Cheyenne [already started], Gaea, Pleiades, etc.)
- generic local cluster
- cloud environment
Guides for system administrators about how to support pangeo environment
Domain-specific tutorials based on the use-cases (e.g. #1, #11)
Benchmarking results

Our wiki is useful for getting info online fast, but it might not be the best option for scaling up. It's not scriptable, PR-able, etc. And it can't easily import jupyter notebooks.

Some options for moving forward are:

Just keep the github wiki as the central pangeo documentation resource.
Use the Pangeo jeykll site (repo).
Develop a sphinx site for our documentation.

There are pros and cons to each. I'm curious to hear your thoughts.

ECMWF / Copernicus Climate Data Store

I just read an article about a new "climate data store" that is being developed by ECMWF

https://www.ecmwf.int/en/newsletter/151/meteorology/climate-service-develops-user-friendly-data-store

This looks quite ambitious and very complex:

Despite the highly customized architecture, there is an explicit mention of open-source and even xarray:

It was also decided that the CDS should be based on open source software where possible, so that other instances could be deployed if necessary. This is particularly important for the development of the toolbox: there is a vibrant community developing scientific libraries in Python, such as Numpy, Scipy, Pandas, xarray, dask, matplotlib etc. These libraries provide many of the algorithms required, and users from the weather and climate communities are already familiar with them. Making use of those libraries will therefore make it easier for users to contribute new additions to the toolbox.

We should keep this on our radar. Do any of the euro folks (e.g. @lesommer) have any connections to this group? It would be great to develop connections with ECMWF, as they are one of the largest providers of weather and climate data in the world.

Documentation for xarray / dask.distributed / Google Cloud Platform

We need some internal documentation for how to get our "platform" running on GCP. This documentation should make it straightforward for a user familiar with HPC platforms to

create an appropriate python environment (with modules, with conda, etc...whatever works)
start up a dask-kubernetes cluster
connect to the scheduler dashboard
start and connect to a Jupyter notebook
connect to the scheduler from the notebook
load an example dataset with Xarray (this will require setting up a persistant disk system and moving some sample data to GCP)

cc: @jhamman, @mrocklin, @rabernat

Question for @mrocklin. @rabernat and I have been thinking that the dask-kubernetes repo is a good starting point for launching the pangeo platform on GCP. At least for a first cut. Do you have any reservations to that end.

I was able to launch a small dask-kubernetes cluster this evening and open up a Jupyter notebook. Due to limitations in the free trial, I'll need to revisit this after our project has more GCP credits.

Webinar with EarthCube Technology and Architecture Committee

The Earthcube project coordinator sent me the following request

Members of the EarthCube community are eager to learn about your new project. The EarthCube Technology and Architecture Committee (TAC) will be hosting a series of webinars this fall to promote the new EarthCube projects. Would you or someone from the team be available to make a 15-20 min presentation (overview of your project) at the next TAC meeting on Thursday September 28 at 330 PM Central Time?

I'd also be happy to schedule a webinar for our science committee as well.
Thanks to let me know. If this date does not work, we could schedule in October (5th or 19th).

The proposed date falls right during my class. Would someone else from the team be willing to take this on? Maybe @jhamman or @chiaral?

Worker environments and image size

What software do we want on basic worker and notebook environments?

Currently we have something like the following on workers:

dask distributed 
numpy pandas numba nomkl 
fastparquet zict blosc cytoolz 
xarray netcdf4 scipy 
git

This results in a conda environment that is around 800MB large. This is fine, but does slow things down a little bit. Interestingly, if we drop scipy we also drop many other dependencies and go down to about 400MB. This is in the broader context of a 2GB image (there are other things we need to clean up)

we need a data catalog

At this point we have multiple datasets on multiple different systems.

It would be great to have some kind of catalog interface to avoid having to manually change the path names every time. For example.

from pangeo import catalog
ds = xr.open_mfdataset(**catalog.dataset('CM2.6'))

Soliciting Pangeo talks at the SEA conference

The 2018 SEA conference is accepting abstract submissions! The event will be from April 2nd to April 6th 2018 and will cover "Frontiers in Scientific Software", namely all aspects of state of the art scientific software. I think it will be fantastic if we could have one or more of the following focused on Pangeo:

talk(s)
tutorial
symposium

The more, the merrier, of course, if workload allows. Anything to add @kmpaul ?

For more information see: https://sea.ucar.edu/conference/2018

feedback on sphinx site

I have a skeleton of a sphinx site set up here:
https://pangeo-data.github.io/pangeo/

Right now the CSS is not working...a subdirectory issue I think. Once that is resolved, we can use this issue to give feedback and discuss how we actually want the site to look.

Use Case Notebooks: COSIMA Cookbook

I've been developing a set of notebooks to analyze fairly large MOM5 ocean model output for at 1, 0.25, and 0.1 degree resolutions. Under the hood, it is primarily using xarray/dask with the intention of building upon what is to be developed from within the Pangeo project. I'll put a link here in case it is of use to others:

COSIMA Cookbook

Most members of COSIMA (Consortium for Ocean-Sea Ice Modelling in Australia) use the Australian National Computing Infrastructure (NCI) on a machine called Raijin. One thing that I am doing is experimenting with some of the advances being made within xarray/dask that are currently being used on Cheyenne and applying them to Raijin as well.

Would there be value in developing a stand-alone 'use case notebook' that for Pangeo that shows analysis of MOM5 model output?

Store notebooks on GCS

If we get further into using Google compute engine for computation we'll need some place to store and manage notebooks. When googling around for this I found https://github.com/danielfrg/s3contents which seems to imply that we can teach Jupyter notebooks to manage notebooks in storage systems like S3 or GCS, which might be convenient.

cc @danielfrg

Set up adaptive deployment on Cheyenne

It might be convenient to set up a long-running dask-scheduler on Cheyenne that is capable of adaptively asking for more workers as needed and releasing those workers back to the job scheduler when not needed. This would be helpful in allowing us to scale up when necessary without incurring significant costs.

From a talk with @jhamman it may also be valuable to have a long-running scheduler that other people outside of the group can use without having to set anything up.

Network bandwidth

I'm getting about 1GB of bandwidth (simultaneous read and write) on Cheyenne.

I'm using the following code:

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

import dask.array as da
x = da.random.random((100000, 100000), chunks=(5000, 5000)).persist()

y = (x + x.T).persist()   # either of these will stress bandwidth
z = y.rechunk((100000, 250)).persist()   # either of these will stress bandwidth

I then measure bandwidth by looking at the dashboard for a single worker. We see peaks around 1GB/s and one full CPU being used (presumably the Tornado event loop thread).

I suspect that this is below what we should expect on Cheyenne's network for TCP over IB (cc @davidedelvento for confirmation).

I suspect that this is due to Tornado's byte handling.

Do we need a GLADE Project Space?

Lots of the datasets we want to work with are already on Cheyenne. But many are not. I personally would like to bring over a ~12TB dataset for one of my use cases. This is too big for my personal scratch space, and anyway, I need my scratch space for other projects.

It seems like we should request a project space:
https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade-file-spaces#projectspace

From their docs

Justifications for project spaces are scrutinized closely. Carefully document which parts of your data and data workflow cannot be accommodated within the home, scratch, or work file spaces and policies.

I think we can make this justification. Would @kmpaul or someone else care to make the request? Or I can do it, but I would like help writing it. The form is here:
https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade-file-spaces/request-for-project-space

cc @davidedelvento (perhaps you have insights on this question)

environment.yml no longer works?

I'm just curious if anyone else sees this error, but when I try to create a fresh conda environment using the pangeo/environment/environment.yml using:

conda env create -f environment/environment.yml

from the repo root directory, I get this:

Fetching package metadata .....

ResolvePackageNotFound: 
  - numba

I suspect that this must have to do with environment.yml being outdated, but is this true? Or is something more serious wrong?

Ask for pangeo name on github

This seems relatively unused: https://github.com/pangeo

Have we asked GitHub about taking over this name?

JupyterHub server for Cheyenne

To reduce startup costs it would be convenient to set up a JupyterHub server that would launch notebooks with easy access to Cheyenne. We might then make a small Python package that would launch Dask workers on PBS from a local notebook.

Issues trying to get running on TACC Wrangler

Both Cheyenne and our local HPC cluster are down at the moment, and pangeo-GCP is not working yet. I need to do a dask distrubted demo tonight, so I am scrambling to get a solution.

I remembered I have a startup allocation on the XSEDE system TACC Wrangler

Wrangler is the most powerful data analysis system allocated in XSEDE. The system is designed for large scale data transfer, analytics, and sharing and provides flexible support for a wide range of software stacks and workflows. Its scalable design allows for growth in the number of users and data applications

Wrangler is actually an ideal place to test out pangeo. (Wrangler user guide)

I am trying to get up and running, using the Cheyenne launch scripts as a template. I have encountered a number of issues along the way. I will post them here.

dask workers dying mysteriously on wrangler

I have a new notebook I'm trying to use on wrangler:
https://github.com/pangeo-data/pangeo/blob/master/notebooks/test_cm2-6_wrangler.ipynb

Wrangler seems fantastic due to the speed of the flash storage. It can also use this flash storage to spill to disk, which can be convenient (and is necessary for this use case).

However, I'm encountering some weird dask behavior. The calculation and the failures are described in the notebook.

@mrocklin, as soon as you get access to wrangler, I think you might enjoy playing around with it. It's an amazing system. If you want to take a look at this problem I'm having, I would be very grateful.

Formalize contract between XArray and the dask.distributed scheduler

XArray was designed long before the dask.distributed task scheduler. As a result newer ways of doing things, like asynchronous computing, persist, etc. either don't function well, or were hacked on in a less-than-optimal-way. We should improve this relationship so that XArray can take advantage of newer dask.distributed features today and also adhere to contracts so that it benefits from changes in the future.

There is conversation towards the end of dask/dask#1068 about what such a contract might look like. I think that @jcrist is planning to work on this on the Dask side some time in the next week or two.

Cool articles on "Cloud Native Geospatial"

There are some very interesting articles on medium by @cholmes of Planet Labs. They are thinking about GIS data (basically geoTIFF rasters), but many of the questions they are thinking about are the same as ours.

I thought I would post here for discussion.

NASA OceanWorks

Along similar lines as #40, NASA is developing their own custom big data portal for Oceanographic data. There is not much info online. I found this conference abstract

https://agu.confex.com/agu/os18/meetingapp.cgi/Paper/314599

OceanWorks focuses on technology integration, advancement and maturity by bringing together several previously funded open-source, big data projects. It includes cloud-based technologies for on-the-fly data analysis (NEXUS), anomaly detection (OceanXtremes), in situ to satellite matchup (DOMS), quality-screened subsetting (VQSS), search relevancy (MUDROD), and web-based visualization

And this presentation:
https://esto.nasa.gov/forum/estf2017/presentations/Huang_A7P6_2017%20ESTF2017.pdf

It appears to be mainly java. They have a github organization, but not much code up:
https://github.com/aist-oceanworks

Use Case Notebook for "Statistical Downscaling"

I will be working on developing a use case notebook that covers statistical downscaling of climate model datasets.

Summary

Excerpt from NSF Earthcube Proposal:

Downscaling is a procedure that relates information at large scales (ESMs) to local scales. With the common goal of addressing the issues described above, a range of downscaling methodologies have been developed that span a broad range of complexities from simple statistical tools to full dynamic atmosphere models. Ongoing research in the NCAR Water Systems Program (partially supported by NSF award AGS-0753581) is developing tools [22, 21] for understanding how methodological choices in the downscaling step impact the inferences end-users may draw from climate change impacts work.

The specific workflow to be implemented will enhance ongoing research in the NCAR Water Systems Program by developing a Python-based toolkit for circulation-based statistical downscaling. We will leverage the tools developed in this project to build an extensible and user-friendly interface for existing tools currently written in low level languages (C and Fortran), specifically the Generalized Analog Regression Downscaling (GARD) toolkit. Using this interface, GARD will be extended to support additional downscaling methodologies that utilize high-level machine learning packages available in Python (e.g., scikit-learn, PyMVPA) and to operate on larger datasets than is currently possible. The end product will be an open-source package that enables future downscaling work throughout the climate impacts community.

Workplan

My proposed work plan will follow these steps:

Implement traditional statistical downscaling methods in (linear/analog/regression, see GARD) in Python/Scipy/Xarray.
- TODO: decide if we should write the GARD functions in Python or write a Python wrapper around existing Fortran code.
Implement new machine learning downscaling method (artificial neural networks) in Python/Scikit-learn.
Apply these methods at individual grid points, use interactive Python/Jupyter notebooks to explore skill and parameter space.
Apply these methods to large spatial domains (CONUS), evaluating skill of downscaling method and computational performance

Data:

I will be using the WRF 50-km CONUS downscaled CMIP5 dataset developed here at NCAR. I can provide more details on the dataset if others are interested.

Requirements (for use case notebooks):

Carefully explain the science behind the calculation in words and equations.
Clearly document the input dataset (preferably a standard product like ERA-interim or NCAR large ensemble)
Use Xarray to load the dataset and, as much as possible, to make the calculations
Where Xarray can't accomplish the necessary tasks, clearly document the other functions and / or outside packages that need to be employed
If the analysis has been implemented in other languages (e.g. MATLAB, Ingrid, etc.), provide a link to the code and, if possible, the output (which can be used as validation)

cc @gutmann

Use Case related to PyReshaper

@kmpaul developed PyReshaper, "a tool for converting time-slice (or history-file or synoptically) formatted NetCDF files into time-series (or single-field) format."

This sort of tool is very useful for people who want to do timeseries analysis on climate models. It can be thought of a sort of on-disk rechunking of the data. I think it would be very valuable to compare the tradeoffs between using a tool like PyReshaper for on-disk rechunking vs xarray / dask for dynamic, in-memory rechunking.

A basic use case scenario would be to do something like compute the linear temporal trend of several different variables at each gridpoint on a GCM dataset. This type of thing comes up often on the xarray mailing list and has motivated some open xarray issues (e.g. pydata/xarray#585). I also found a nice simple example of this sort of analysis on someone's blog.

Could we develop such a scenario based on one of our datasets? I think we would learn a lot from it, and especially from the comparison to disk-basked workflows.

Dynamic deployments on the job scheduler

Currently I find that I can more easily get access to many nodes if I ask for them one at a time

$ qsub start-dask.sh      # only ask for one machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine
$ qsub add-one-worker.sh  # ask for one more machine

The workers find each other and connect up to form one logical job out of man independent ones. This is fine from Dask's perspective but raises some potential questions from an IT / infrastructure perspective:

Is this OK from an IT perspective (cc @davidedelvento)? I would not be surprised to learn that I was abusing some policy that bumps up the priority of small jobs
Is there some way that we should formalize this process? For example, I'm also entirely willing to be bumped off of some of these machines if someone else comes by. I can imagine specifying policies for this.

I suspect that this is beyond the scope of Cheyenne's existing job scheduler but it is something that we may want to consider long-term. I thought I'd bring it up now. This isn't any strong feature or service request. This is just here for conversation.

Near term pangeo.pydata.org goals

Thank you all for your work and involvement over the last couple weeks. http://pangeo.pydata.org is up and running, I encourage you to try it out.

There is still plenty of work to do. However now is probably a good time to take a step back and figure out where we want to take this near and mid term so that we can prioritize effort appropriately.

What do we want to do with this? Are people around for a video conversation about this? I'm particularly free Tuesday afternoon and all day Friday.

cc @yuvipanda @rabernat @jhamman @jacobtomlinson (feel free to ping others, I've tried to ping a small number of people who will know others that may or may not want to be involved with work short term on this)

Benchmark and tune FUSE on NetCDF

We're working a lot on Zarr right now, but we should also test the performance of gcsfuse on NetCDF data on GCS.

@jhamman has already placed a 4TB dataset in /pangeo-data/newmann-met-ensemble-netcdf

The use of FUSE is an open question for a number of groups using cloud systems (cc @niallrobinson) and we should probably consider it a baseline to alternatives.

https://github.com/GoogleCloudPlatform/gcsfuse/