Coder Social home page Coder Social logo

sravi2421 / nyc-transport Goto Github PK

View Code? Open in Web Editor NEW

This project forked from r-shekhar/nyc-transport

0.0 1.0 0.0 58.76 MB

A Unified Database of NYC transport (subway, taxi/Uber, and citibike) data.

License: BSD 3-Clause "New" or "Revised" License

Shell 0.03% Python 0.06% Jupyter Notebook 99.92%

nyc-transport's Introduction

NYC-Transport Readme

This is a combined repository of all publicly available New York City transit datasets.

  • Taxi and Limousine Commission (TLC) Taxi trip Data
  • FOIA requested Uber trip data for portions of 2013-2015
  • Subway turnstile data from the Metropolitan Transit Authority (MTA)
  • Citibike system data

This repository contains code to download all the data, clean it, remove corrupted data, and produce a set of pandas dataframes, which are written to Parquet format files using Dask and Fastparquet.

These Parquet format files are repartitioned on disk with PySpark, and resulting files are queried with PySpark SQL and Dask to produce data science results in Jupyter notebooks.

Requirements

  • Python 3.4+

  • Beautiful Soup 4

  • Bokeh

  • Dask Distributed

  • FastParquet

  • Geopandas

  • Jupyter

  • Numba 0.29+

  • Palettable

  • PyArrow

  • PySpark 2.0.2+

  • Python-Snappy

  • Scikit-Learn

  • Seaborn

A tutorial on my blog shows how to set up an environment compatible with this analysis on Ubuntu. This tutorial has been tested locally and on Amazon EC2.

If you want to skip obtaining and processing the raw Taxi/Uber data into Parquet format, the processed dataset is available on Academic Torrents here.

Steps

  1. Setup your conda environment with the modules above.

    conda install -c conda-forge \
        beautifulsoup4 bokeh distributed fastparquet geopandas \
        jupyter numba palettable pyarrow python-snappy  \
        scikit-learn seaborn
    conda install -c quasiben spark
  2. Download the data in the 00_download_scripts directory

    • ./make_directories.sh -- Alternatively you can create a raw_data directory elsewhere and symlink it.
    • python download-subway-data.py (~ 10 GB)
    • ./download-bike-data.sh (~7 GB)
    • ./download-taxi-data.sh (~250 GB)
    • ./download-uber-data.sh (~5 GB)
    • ./decompress.sh
  3. Convert the data to parquet format using scripts in 05_raw_to_dataframe. Times given are on a 4GHz i5-3570K (4 core) with fast SSD and 16GB memory.

    • Adjust config.json to have correct input and output paths for your system
    • python convert_bike_csv_to_parquet.py (~2 hours)
    • python convert_subway_to_parquet.py (~2 hours)
    • python convert_taxi_to_parquet.py (~32 hours)
  4. Repartition and recompress the parquet files for efficient access using PySpark in 06_repartition. This is especially useful for later stages, where queries are performed on Amazon EC2 using a distributed Spark engine using files on S3.

Analysis

Analysis scripts and notebooks live in the 15_dataframe_analysis directory. Some require PySpark 2+, but most simply require Dask and Jupyter.

nyc-transport's People

Contributors

r-shekhar avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.