Coder Social home page Coder Social logo

chaddling / bikeshareto_data Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 51.8 MB

Cleaned/augmented datasets of Bike Share Toronto trip events from 2016 (partial) to 2017, as well as Python scripts used to perform the data engineering.

License: Other

Python 100.00%
bikeshare-data transportation bikeshare

bikeshareto_data's Introduction

Bike Share Toronto Data

This repository contains the cleaned/augmented datasets of Bike Share Toronto trip events from 2016 (partial) to 2017, as well as Python scripts used to perform the data engineering. Each start/end station pair is additionally assigned their latitude/longitude coordinates and the distance of a trip between the pair.

Inspired by the UCI Machine Learning Bike Sharing Dataset, I also generate hourly usage datasets containing local weather information. Cleaned datasets are located in this folder.

Go here for an analysis of the data .

Data sources

From Open Data Catalogue published by the city of Toronto:

Additional sources:

Comments on the data

Each raw dataset contains important columns such as:

  • trip_duration: duration of travel, including timestamp trip_start_time. I dropped the trip_start_time timestamp and add a column wkday for the day of the week.

  • from_station_name, to_station_name and their IDs from_station_id, to_station_id, which can be used to look up their corresponding latitude and longitude coordinates in Station Information. The IDs are missing in some datasets.

  • user_type: whether the user is a Bike Share Toronto member or not.

  • See readme.txt for information on the cleaned data columns.

Cleaning the data

Some datasets only contain station names, which is not great for identification/location purposes. For such datasets, I use the station name to look up an ID as well as its latitude/longitude in Station Information. This is tricky due to differences in spelling between the two sources.

I parsed the station names into street names, ignoring designations such as E/W and other extra symbols. Using the parsed names in Station Information, I build a dictionary intersection_lookup that can be searched as follow:

intersection_lookup[name1][name2] = station_name

For example station_name = "Fort York Blvd / Capreol Crt" is parsed into, and can be looked up by name1 = "Fort York Blvd" and name2 = "Capreaol Crt". Values of the name columns in the Bike Share Ridership Data are parsed into name1, name2 and looked up to find their names, identifies and coordinates, etc. in the JSON file. Station names with a single street name/string (e.g. "424 Wellington St 1", "Union Station") are looked up with name2 = ''.

Exceptions (in data_exceptions.py)

  • Removal of data points: these include station name columns which contain NaN and name instances ("Base Station", "Fringe Next Stage - 7219") which cannot be identified by a location.
  • Names which are not found in Station Information are either mapped to their nearest station, handled as spelling exceptions, or renamed to match the correct entry in the JSON file.
  • Huge error in the data (!!) In the 2017 data, "Bay St / Bloor St W" has id = 7029 but in the JSON file this is associated with "St. James Park (King St. E.)". This id is kept consistent (since St. James Park does not show up in any of the datasets) and its coordinates are put in by hand to ensure distances are calculated correctly - given the centrality of the Bay St / Bloor St intersection, this is quite important to not be overlooked.

Generating hourly usage data

Using the cleaned data tables containing all events within a year as well as weather information, I tallied the total, as well as a breakdown of casual vs member hourly-usage. Each row in the hourly-usage table is assigned temperature/weather condition.

Adjusted temperature adjtemp is calculated either as wind chill (when < 5 degree Celsius) or humidex, using some additional information such as wind speed/dew point temperature from the weather data. Numerical data columns as normalized (see readme.txt).

In the raw weather data, weather condition is recorded every 3-hour interval starting from 1:00am, with more frequent in-between recordings if the condition changes. Thus, empty recordings are assigned the last observed condition(s). The condition column in the hourly-usage tables are human annotated, is assigned an integer 1 (good weather condition) to 4 (severe weather condition). See readme.txt for how this value was assigned.

bikeshareto_data's People

Contributors

chaddling avatar

Watchers

 avatar

Forkers

jafri43748848

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.