Simple project to demo AutoML on NYC taxi trip duration prediction.
While explaining Optuna to a client in the context of hyperparameter tuning, and performing more research on the topic, I came across AutoGluon to perform "AutoML for images, text, and tabular data". After a quick scan of the documentation, I decided to give it a try and see how it performs on a simple project.
I always loved the Kaggle competition NYC Taxi Trip Duration, as the data has spatial, temporal and other traditional attribute information, and is a great dataset to test various models and feature engineering techniques.
I used a local Apache Spark instance (as it is my go to ETL engine) to perform some feature engineering before letting AutoGluon do its magic. As a quick proof of concept, the results are quite impressive, and below are the steps to reproduce the project and visualize the result.
git clone https://github.com/mraad/auto_ml_nyc.git
cd auto_ml_nyc
conda env create
conda activate auto_ml_nyc
- Download the train.csv file from Kaggle and place it in the same folder as this project, so it can be locally referenced.
- Download Apache Spark 3.4.2. In my case, I placed the unzipped folder in my home folder.
./auto_ml_nyc.sh
- In your browser, navigate to http://localhost:8989/lab
- Load and run the
auto_ml_nyc.ipynb
notebook.
Happy AutoMLing, and ff you find this project informative, consider giving it a โญ on GitHub!