Coder Social home page Coder Social logo

pangeo-data / foss4g-2022 Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 9.0 147.91 MB

Pangeo tutorial at FOSS4G 2022

Home Page: https://pangeo-data.github.io/foss4g-2022

License: Other

Jupyter Notebook 99.95% TeX 0.02% Python 0.02%
data-analysis hvplot pangeo time-series xarray

foss4g-2022's People

Contributors

acocac avatar allcontributors[bot] avatar guillaumeeb avatar j34ni avatar pl-marasco avatar tinaok avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

foss4g-2022's Issues

Adding a Docker image build that we could use on EOSC or any other place

I guess it would be nice to have a Docker image we could just pull to start Dask cluster on EOSC infra or any other K8S Jupyterhub or even on any laptop to reproduce tutorial prepared here.

I'm not sure if this was already done by @tinaok or any of you in some other repo? Should it go here or in Pangeo-eosc repo ? (I would say here because env is related to Foss4g application).

I've done something similar in https://github.com/guillaumeeb/pangeo-docker by extending Pangeo docker images. I can work on this if you feel this is interesting and you didn't do it yet.

Update the content of chunking_introduction.ipynb

The goal of this issue is to discuss what we should put in chunking introduction notebook.

This comes from comments and discussions in #45.

I believe we should do the following, but this needs discussion (especially with @tinaok):

  • Either remove the word compression or talk about it somewhere, a short chapter should be fine. Some subjects to talk about
    • Generalities (why compression matters, how it is a gain in processing time despite the overcost of compression -- loading big files in memory is slow, different algorithms currently used in scientific domains, Python blosc, etc.),
    • A section showing the difference between a NetCDF or zarr file with and witout compression and the resulting size
    • Default of Xarray / Various data formats (is NetCDF compression activated by default?).
    • Compressions is often apply on chunks for optimized access.
  • Add more context about Xarray and Chunking before introducing kerchunk. Begin with open_dataset and the chunks attribute, introduce Dask Arrays, talk about lazyness and sequential processing. Talk also of native chunks in files here, if possible without kerchunk (with h5py maybe?).
  • Kerchunk part: introduce it by talking about the difficulties of reading a dataset composed of several files in a optimal way. Then explain how it reads file metadata and detect chunks, and construct a Zarr compatible metadatafiles that allows for optimized access of whole dataset with Xarray. Should we keep the main metadata file creation in the notebook? I'm not sure of that.

Update content of dask_introduction.ipynb

Here are some propositions as discussed in #45.

Please indicate whether it's OK for you (especially @tinaok):

  • Add a little part on Dask Clusters, talk about Client, Scheduler, Worker, dask-gateway, Kubernetes, HPC. Only quick words on the last terms. Talk about Dask Dashboard (which needs to be accessible in the platforms we propose).
    • Maybe we wan to do a short demonstration of Dask and its Dashboard on some code independent of our dataset?
  • Is the call to optimize for the Dask graphs really important? I find it weird, you don't need this with Dataframes.
  • Remove the part about installing packages on worker pods. I think this is an advanced topic and should not be usefull if we fix pangeo-data/pangeo-eosc#3.
  • Add some code at the end with the global scaling of some process as announced in the beginning of the notebook.

Contributors

Summary
Issue to add contributors information with @allcontributors.

What needs to be done?
Write a comment with the contribution information with the following nomenclature:
@all-contributors please add @<username> for <contributions>

See the available Emoji Key โœจ here.

Feel free to list those contribution types you find more relevant to this repo.

Further info of the all-contributors bot here.

Who can help?
Anyone

Where should the Notebooks go?

Currently, there are two main folders: NOTEBOOKS (not fond of the upper case style), and tutorial. In tutorial: notebooks.

I'm under the impression that all notebooks that workshop participants want to see should be in tutorial/notebooks.

If so, we should move important content from NOTEBOOKS folder to tutorial/notebooks, and either remove the NOTEBOOKS folder, or rename it in a way it is clear that its content is not important for learning about Pangeo.

Anyway, I think we should clarify things here.

What can I do to help?

Hi all,

It seems you are all working on some part of this repo (building use cases for @pl-marasco and @acocac), working on docs/Jupyter book for @annefou and @acocac, integrating things with Dask distributed and EOSC for @tinaok.

I'd like to contribute here for some simple things to help, but I'm not sure in which part to dive into, in addition to reviewing content/notebooks if asked.

A few things I notice for now when opening the repo:

  • It lacks some information/links about Foss4G and the Pangeo session schedule.
  • It lacks some basic information on how to try things out: do you code your notebook locally, do you use EOSC Pangeo Hub, Other? Maybe things could be added into the README for different situations?
  • It lacks a link to the generated Jupyterbook.
  • It lacks some kind of binder/repo2docker thing for reproducibility (or only the documentation mentioned above?).

Should you/we open issues for several tasks to do, and see who want to work on them?

Or do you have any idea for me of simple things to do (I cannot engage into lengthy tasks unfortunately).

Best,

Do we need a Setup chapter in the tutorial notebooks?

We already have a main setup page and an environment.yaml file.

Is this really important that each of our notebooks list the main libraries it needs?

How do we chose the libraries we put there: those which are imported in cells? Those for which we uses some objects one way or another?

I understand that we may want each notebook to contains every information to run it, but this section is really hard to maintain, and maybe we should just link to the setup page and the environment.yaml?

split data access and discovery sections

I'd propose to have a separate episode for data discovery after Parallel computing with Dask. We can keep a short episode introducing remote access with s3fs before Data chunking.

Timeline

Add timeline + names (who is teaching what).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.