pangeo-data / foss4g-2022 Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 9.0 147.91 MB

Pangeo tutorial at FOSS4G 2022

Home Page: https://pangeo-data.github.io/foss4g-2022

License: Other

Jupyter Notebook 99.95% TeX 0.02% Python 0.02%

data-analysis hvplot pangeo time-series xarray

foss4g-2022's People

Contributors

Stargazers

Watchers

Forkers

pl-marasco tinaok annefou guillaumeeb j34ni vikineema hkejigu sebastian-luna-valero junjie2008v

foss4g-2022's Issues

Resources in BEYOND THE WORKSHOP

Please type here additional resources you consider relevant to highlight in the Beyond the workhop section.

Adding a Docker image build that we could use on EOSC or any other place

I guess it would be nice to have a Docker image we could just pull to start Dask cluster on EOSC infra or any other K8S Jupyterhub or even on any laptop to reproduce tutorial prepared here.

I'm not sure if this was already done by @tinaok or any of you in some other repo? Should it go here or in Pangeo-eosc repo ? (I would say here because env is related to Foss4g application).

I've done something similar in https://github.com/guillaumeeb/pangeo-docker by extending Pangeo docker images. I can work on this if you feel this is interesting and you didn't do it yet.

Add other Pangeo related events/talks FOSS4G

Mention other Pangeo related presentations in FOSS4G in BEYOND THE WORKSHOP > Resources?

Friday 26th

Pangeo Forge: Crowdsourcing Open Data in the Cloud

Update the content of chunking_introduction.ipynb

The goal of this issue is to discuss what we should put in chunking introduction notebook.

This comes from comments and discussions in #45.

I believe we should do the following, but this needs discussion (especially with @tinaok):

Either remove the word compression or talk about it somewhere, a short chapter should be fine. Some subjects to talk about
- Generalities (why compression matters, how it is a gain in processing time despite the overcost of compression -- loading big files in memory is slow, different algorithms currently used in scientific domains, Python blosc, etc.),
- A section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size
- Default of Xarray / Various data formats (is NetCDF compression activated by default?).
- Compressions is often apply on chunks for optimized access.
Add more context about Xarray and Chunking before introducing kerchunk. Begin with open_dataset and the chunks attribute, introduce Dask Arrays, talk about lazyness and sequential processing. Talk also of native chunks in files here, if possible without kerchunk (with h5py maybe?).
Kerchunk part: introduce it by talking about the difficulties of reading a dataset composed of several files in a optimal way. Then explain how it reads file metadata and detect chunks, and construct a Zarr compatible metadatafiles that allows for optimized access of whole dataset with Xarray. Should we keep the main metadata file creation in the notebook? I'm not sure of that.

Update content of dask_introduction.ipynb

Here are some propositions as discussed in #45.

Please indicate whether it's OK for you (especially @tinaok):

Add a little part on Dask Clusters, talk about Client, Scheduler, Worker, dask-gateway, Kubernetes, HPC. Only quick words on the last terms. Talk about Dask Dashboard (which needs to be accessible in the platforms we propose).
- Maybe we wan to do a short demonstration of Dask and its Dashboard on some code independent of our dataset?
Is the call to optimize for the Dask graphs really important? I find it weird, you don't need this with Dataframes.
Remove the part about installing packages on worker pods. I think this is an advanced topic and should not be usefull if we fix pangeo-data/pangeo-eosc#3.
Add some code at the end with the global scaling of some process as announced in the beginning of the notebook.

Add compression level in episode 1 and remove compression in episode 4

At the very end of episode 1 (Xarray), we add info on how to add compression when saving file.

Then we remove compression from "Chunking and compression".

Contributors

Summary
Issue to add contributors information with @allcontributors.

What needs to be done?
Write a comment with the contribution information with the following nomenclature:
@all-contributors please add @<username> for <contributions>

See the available Emoji Key ✨ here.

Feel free to list those contribution types you find more relevant to this repo.

Further info of the all-contributors bot here.

Who can help?
Anyone

fix hyperlinks to open a new tab

The current version opens hyperlinks in the same tab.

Dask introduction

I've lost a little bit the track about the infrastructure but, to reproduce the error @j34ni is facing, I started to run the notebook.

Once I try to create the cluster I get ClientResponseError: 401, message='Unauthorized', url=URL('http://api-daskhub-dask-gateway.daskhub:8000/api/v1/clusters/')
Does anyone know the reason?
Testing the code without it doesn't reproduce the error he is facing.

Where should the Notebooks go?

Currently, there are two main folders: NOTEBOOKS (not fond of the upper case style), and tutorial. In tutorial: notebooks.

I'm under the impression that all notebooks that workshop participants want to see should be in tutorial/notebooks.

If so, we should move important content from NOTEBOOKS folder to tutorial/notebooks, and either remove the NOTEBOOKS folder, or rename it in a way it is clear that its content is not important for learning about Pangeo.

Anyway, I think we should clarify things here.

How to run the tutorial or notebooks from this repo

There should be a how to run things section somewhere. I think this is foreseen and intended to go there: https://pangeo-data.github.io/foss4g-2022/before/setup.html, am I right?

I propose to put the following content:

Running on Binderhub
Running locally on personal computer
Running on Pangeo EOSC infrastructure

Thoughts?

What can I do to help?

Hi all,

It seems you are all working on some part of this repo (building use cases for @pl-marasco and @acocac), working on docs/Jupyter book for @annefou and @acocac, integrating things with Dask distributed and EOSC for @tinaok.

I'd like to contribute here for some simple things to help, but I'm not sure in which part to dive into, in addition to reviewing content/notebooks if asked.

A few things I notice for now when opening the repo:

It lacks some information/links about Foss4G and the Pangeo session schedule.
It lacks some basic information on how to try things out: do you code your notebook locally, do you use EOSC Pangeo Hub, Other? Maybe things could be added into the README for different situations?
It lacks a link to the generated Jupyterbook.
It lacks some kind of binder/repo2docker thing for reproducibility (or only the documentation mentioned above?).

Should you/we open issues for several tasks to do, and see who want to work on them?

Or do you have any idea for me of simple things to do (I cannot engage into lengthy tasks unfortunately).

Best,

Do we need a Setup chapter in the tutorial notebooks?

We already have a main setup page and an environment.yaml file.

Is this really important that each of our notebooks list the main libraries it needs?

How do we chose the libraries we put there: those which are imported in cells? Those for which we uses some objects one way or another?

I understand that we may want each notebook to contains every information to run it, but this section is really hard to maintain, and maybe we should just link to the setup page and the environment.yaml?

split data access and discovery sections

I'd propose to have a separate episode for data discovery after Parallel computing with Dask. We can keep a short episode introducing remote access with s3fs before Data chunking.

cross-reference Pangeo Forge and previous FOSS4G material

It's good idea to crossreference previous Pangeo related training material in FOSS4G. For instance, I did something similar to cite the Pangeo 101 Galaxy training.

In the data discovery episode, I'll point to the STAC notebooks covered in 2021 FOSS4G. Also I'll highlight @rabernat's talk of Pangeo-Forge.

Feel free to add further ideas in this issue.

pangeo-data / foss4g-2022 Goto Github PK

foss4g-2022's People

Contributors

Stargazers

Watchers

Forkers

foss4g-2022's Issues

Recommend Projects

Recommend Topics

Recommend Org