This repository contains the cleaned/augmented datasets of Bike Share Toronto trip events from 2016 (partial) to 2017, as well as Python scripts used to perform the data engineering. Each start/end station pair is additionally assigned their latitude/longitude coordinates and the distance of a trip between the pair.
Inspired by the UCI Machine Learning Bike Sharing Dataset, I also generate hourly usage datasets containing local weather information. Cleaned datasets are located in this folder.
Go here for an analysis of the data .
From Open Data Catalogue published by the city of Toronto:
Additional sources:
- Historical weather data from Environment Canada
- Open Source Routing Machine (OSRM) API
Each raw dataset contains important columns such as:
-
trip_duration
: duration of travel, including timestamptrip_start_time
. I dropped thetrip_start_time
timestamp and add a columnwkday
for the day of the week. -
from_station_name
,to_station_name
and their IDsfrom_station_id
,to_station_id
, which can be used to look up their corresponding latitude and longitude coordinates in Station Information. The IDs are missing in some datasets. -
user_type
: whether the user is a Bike Share Toronto member or not. -
See readme.txt for information on the cleaned data columns.
Some datasets only contain station names, which is not great for identification/location purposes. For such datasets, I use the station name to look up an ID as well as its latitude/longitude in Station Information. This is tricky due to differences in spelling between the two sources.
I parsed the station names into street names, ignoring designations such as E/W and other extra symbols. Using the parsed names in Station Information, I build a dictionary intersection_lookup
that can be searched as follow:
intersection_lookup[name1][name2] = station_name
For example station_name =
"Fort York Blvd / Capreol Crt" is parsed into, and can be looked up by name1 =
"Fort York Blvd" and name2 =
"Capreaol Crt". Values of the name columns in the Bike Share Ridership Data are parsed into name1
, name2
and looked up to find their names, identifies and coordinates, etc. in the JSON file. Station names with a single street name/string (e.g. "424 Wellington St 1", "Union Station") are looked up with name2 = ''
.
- Removal of data points: these include station name columns which contain NaN and name instances ("Base Station", "Fringe Next Stage - 7219") which cannot be identified by a location.
- Names which are not found in Station Information are either mapped to their nearest station, handled as spelling exceptions, or renamed to match the correct entry in the JSON file.
- Huge error in the data (!!) In the 2017 data, "Bay St / Bloor St W" has
id = 7029
but in the JSON file this is associated with "St. James Park (King St. E.)". Thisid
is kept consistent (since St. James Park does not show up in any of the datasets) and its coordinates are put in by hand to ensure distances are calculated correctly - given the centrality of the Bay St / Bloor St intersection, this is quite important to not be overlooked.
Using the cleaned data tables containing all events within a year as well as weather information, I tallied the total, as well as a breakdown of casual vs member hourly-usage. Each row in the hourly-usage table is assigned temperature/weather condition.
Adjusted temperature adjtemp
is calculated either as wind chill (when < 5 degree Celsius) or humidex, using some additional information such as wind speed/dew point temperature from the weather data. Numerical data columns as normalized (see readme.txt).
In the raw weather data, weather condition is recorded every 3-hour interval starting from 1:00am, with more frequent in-between recordings if the condition changes. Thus, empty recordings are assigned the last observed condition(s). The condition
column in the hourly-usage tables are human annotated, is assigned an integer 1 (good weather condition) to 4 (severe weather condition). See readme.txt for how this value was assigned.