Author: Sunil Veeravalli
This is a case study to accurately find out the assigned gate for an aeroplane from sensor readings that will make the airport more efficient and improve its billing application.
The challenge problem involves an airport. The airport has four gates and seven sensors placed around the airport. The airport would like to know which gate the aircraft went to given the sensor readings. Being able to accurately establish the assigned gate from sensor readings will make the airport more efficient and improve its billing application.
The airport has provided 2,000 examples of a gate assignment along with the readings of seven sensors at each assignment. The challenge is to build a model that will accurately predict the gate based on the sensor readings available.
The following data files are available:
- master.csv โ Data on the gate assignments for incoming flights
- assignment - The key of a gate assignment event provided by the airport
- gate - The gate assigned when the aircraft opened its doors
- stream[1-7].csv โ Data from each sensor is provided in a separate file
- assignment - The key of a gate assignment event provided by the airport
- s[1-7] - A reading from the sensor at the event time
- n[1-7] - A reading from the sensor at the event time
Upon extracting the zipped folder you received, the file structure you see will be as follows:
- Folder: 'Scripts'
- Contains jupyter notebook files.
- Start reading with file
01_Importing and Cleaning.ipynb
to07_Final.ipynb
in the same order.
- Folder: 'Raw_Data'
- Contains all instruction and data files sent by you
- Folder: 'Clean_Data'
- Contains files and folders I created during the project, so that one need not run all scripts to create an object. Import the object from this 'Clean_Data' folder and start from there.
- Folders: 'Scenario[1-5]'
- Each folder contains data files in pickle and numpy format
- Folder: 'Final'
- Contains the final tables for the metrics: precision, recall and fscore generated from all models and all scenarios
- Folder: 'Images'
- Contains images and snapshots used in creating this
Readme.md
file
- Contains images and snapshots used in creating this
- Folder: '.git'
- Used a git version control system to track and manage file creations and changes.
- I answered the question in five different scenarios and the work flow is as follows:
- Script for each scenario is saved as an independent file in the folder Scripts
-
'Jupyter notebook' package should be installed either using pip or conda
-
On Windows computer after pip installation of jupyter notebook:
- Open
cmd
prompt and change to path to your 'Cirium Project' directory - Type
jupyter notebook
and press enter
- Open
- Imported data from 'Raw_Data' folder into jupyter notebook and then performed EDA analysis
- Rows with extreme outliers were dropped
- All other outliers were flagged using a combination of ZScores and Isolation forest methods
- Feature scaling, dimensionality reduction, modelling and validation
- Rows with extreme outliers were dropped
- Rows with at least one outlier (based on ZScores) were dropped
- Feature scaling, dimensionality reduction, modelling and validation
- s/n ratio for each sensor was calculated and dropped independent features: s1, n1, ..., n7
- Rows with extreme outliers were dropped
- Rows with at least one outlier (based on ZScores) were dropped
- Feature scaling, dimensionality reduction, modelling and validation
- s minus n values are calculated and dropped independent features: s1, n1, ..., n7
- Rows with extreme outliers were dropped
- Rows with at least one outlier (based on ZScores) were dropped
- Feature scaling, dimensionality reduction, modelling and validation
- Features 's6', 'n6', 's7' and 'n7' were dropped
- Rows with extreme outliers were dropped
- All other outliers were flagged using a combination of ZScores and Isolation forest methods
- Feature scaling, dimensionality reduction, modelling and validation
- All the metrics (precision, recall, fscore) from each scenario saved as pickle file in their respective 'Clean_Data' folders were pulled and merged together
- Top five models based on their average score were populated and used as a basis to get to the following conclusion.