Coder Social home page Coder Social logo

shauryashaurya / learn-data-munging Goto Github PK

View Code? Open in Web Editor NEW
41.0 5.0 21.0 596.26 MB

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

License: MIT License

Jupyter Notebook 99.98% CSS 0.01% Rust 0.01%
dask data-engineering jupyter pandas pyspark ray spark arrow dask-distributed datafusion

learn-data-munging's Introduction

Data Munging Using *X* in Python, Rust & Julia

Data Engineering Workshops on some of the more popular libraries, frameworks and tech circa 2023-2024.

Data Wrangling with Python, Rust and Julia, Image © Shaurya Agarwal, created using Dalle and GIMP

Data Engineers working with Python, Rust and Julia :P


Notebooks

00 Python Collections

This set of notebooks works through examples of how some pretty sophisticated data engineering can be done using Python Collections, Itertools and Functools. It uses the small MovieLens dataset.

  • Basic Collections and the Collections Module: Notebook also Open In Colab
  • NumPy vs Python Collections Notebook also Open In Colab
  • Wrangling MovieLens with Pandas - Part 1: Getting Started, Load the MovieLens dataset: Notebook also Open In Colab
  • Wrangling MovieLens with Pandas - Part 2: Playing with the Movies and Ratings data: Notebook also Open In Colab

01 - Toy introduction to the basics

  • 01 - Setting up Spark locally (on Windows): Notebook also Open In Colab

  • 02 - How to run Apache Spark based notebooks in Google Colab: Notebook also Open In Colab

02 - A set of notebooks exploring data wrangling in depth using the MovieLens dataset

  • Part 01: Overview, Starting Spark and Loading the data: Notebook or Open In Colab

  • Part 02: Data Analysis basics using tags.csv from the MovieLens dataset: Notebook or Open In Colab

04 Dask

  • Distributed Data Analysis with Dask - Part 1: Getting Started, Load the MovieLens dataset: Notebook also Open In Colab
  • Distributed Data Analysis with Dask - Part 2: Playing with the Movies data: Notebook also Open In Colab
  • Polars with the MovieLens dataset - Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis: Notebook also Open In Colab
  • 01 - 10+ minutes to Arrow+DataFusion+Ballista [WIP]: Notebook also Open In Colab

07 Ray

  • [WIP]

99 Static: The TPC Benchmark Queries

  • [WIP]

Note

The "10+ minutes to XX" notebooks are just references, not to be run as actual workshop material. These are there to carry toy examples that "getting started" pages for XX carry. I have tried to ensure there's a 10+ minutes notebook for each data engineering library/framework considered here. While it may be interesting to go through these to quickly refresh the syntax and other idiosyncracies, the actual data munging happens in other notebooks.

References

04 Dask

The approach is different: Dask focuses on Task scheduling vs Spark's Map-Reduce

07 Ray

Future State / Miscellany

Datasets we use:

There's a lot of interesting (interesting to me) tools, datasets and papers out there.
When there's time or need, we'll get to them as well.

MOAR GIMME MOAR LINKS!!!

Kitchen sink of all other references I've found useful (or wonderful). There's so much to learn I tell you!

.

learn-data-munging's People

Contributors

dependabot[bot] avatar shauryashaurya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

learn-data-munging's Issues

Memory manegement for Arrow

How does (py)Arrow manages memory when operations like concat, drop column happens? Is it zero copy? Or is it just a manipulation of pointers internally?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.