MLOps Zoomcamp Final Porject - Predict Airline Satisfaction

Project to create pipeline to apply MLOps technologies covered in Data Talks Club MLOps Zoomcamp.

Problem Statement

This project attempts to create a MLOps pipeline to predict airline customer satisfaction using a handful of available metrics that are tracked by the airline. This pipeline utilizes enviornment creation (Docker), infrastructure as code (IaC) (Terraform), workflow orchestration (Prefect), Cloud (AWS), Experement tracking tools (MLFlow), and Monitoring (Evidently). After EDA and general model comparison performed in flight-satisfaction-prediction-scratch.ipynb, CatBoost model is selected for use in pipeline and further hyperperameter tuning.

Data

The dataset can be sourced from Kaggle
This dataset contains an airline passenger satisfaction survey.
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

Chosen model

From EDA and a rough comparision of canned Sklearn models, I've chosen to implement a Catboost model in the final data pipeline. The model is further optimized using optuna to find the hyperperameters that yield the best accuracy score. This process is tracked within MLFlow.

Overview of Catboost: CatBoost is a powerful open-source gradient boosting library designed for machine learning tasks, particularly in the realm of tabular data. Developed by Yandex, CatBoost stands out for its efficiency in handling categorical features without the need for explicit preprocessing, making it highly user-friendly. By utilizing techniques like ordered boosting and oblivious trees, CatBoost optimizes the training process and exhibits excellent predictive accuracy. It supports both classification and regression tasks, while also offering robustness against overfitting.

Data Pipeline

Note, due to time constraints, I was not able to implement Docker/EvidentlyAI, CI/CD, and a few other elements I had originally intended to build.

Docker

Docker is a technology that lets you package and run applications in isolated, consistent environments called containers. Containers are lightweight, portable, and efficient, making it easy to develop, deploy, and scale applications across different systems. It ensures applications work the same way in any environment, from development to production, improving efficiency and collaboration between teams.

Terraform

Terraform is an open-source infrastructure as code (IaC) tool used to build, manage and version infrastructure resources across multiple cloud providers, including AWS, Azure, and Google Cloud Platform. It allows you to define your infrastructure in code, and then provision and manage it using simple and repeatable workflows. With Terraform, you can automate the deployment and scaling of infrastructure resources, ensuring consistency and reducing the risk of errors.

Prefect

Prefect is a Python-based open-source workflow automation tool used to build, schedule, and monitor data workflows. It provides a flexible and scalable platform for creating workflows that can run on your local machine or in the cloud. With Prefect, you can define your workflows as code, and then execute them on any infrastructure, making it easy to scale and integrate with your existing systems.

Airflow is also an open-source workflow automation tool, but it focuses on data processing and has a strong emphasis on scheduling and task dependencies. It allows you to define workflows using Python code, and then schedule and monitor their execution using a web-based interface. Airflow has a large community and many plugins available, making it a popular choice for data engineering and data science teams.

The main difference between Prefect and Airflow is their approach to workflow execution. Prefect is designed to be more flexible and can run workflows on any infrastructure, while Airflow has a strong focus on scheduling and task dependencies. Additionally, Prefect has a modern architecture that allows for easier customization and debugging, while Airflow has a more established ecosystem and is known for its stability and reliability. Ultimately, the choice between Prefect and Airflow will depend on your specific use case and requirements.

Data Cloud Storage and Warehouse

Amazon Web Services (AWS) is a comprehensive cloud computing platform offered by Amazon. It provides a wide range of services that enable individuals, businesses, and organizations to build and manage various applications and services without the need for physical hardware. Some of the most popular AWS tools and services include:

Amazon EC2 (Elastic Compute Cloud): Virtual servers in the cloud, allowing users to scale computing resources up or down as needed.
Amazon S3 (Simple Storage Service): Scalable storage solution for storing and retrieving data, suitable for various applications like backups, static website hosting, and data archiving.
Amazon RDS (Relational Database Service): Managed database service supporting various database engines like MySQL, PostgreSQL, SQL Server, and more.
AWS Lambda: Serverless compute service that lets you run code in response to events without provisioning or managing servers.
Amazon Sagemaker: Fully managed service for building, training, and deploying machine learning models.
Amazon Redshift: Data warehousing service for running complex queries and analytics on large datasets.

These are just a few examples of the extensive range of services offered by AWS. For this project, we will only be utilizing EC2 and S3 services.

MLFlow

MLflow is a comprehensive platform that simplifies the management of the end-to-end machine learning lifecycle. It provides tools for tracking, sharing, and deploying machine learning models in a collaborative environment. With MLflow, data scientists can easily experiment with different algorithms, track model versions, and organize their work. It also supports model deployment and monitoring, enabling seamless transition from experimentation to production. MLflow streamlines the development process, fosters collaboration among teams, and enhances the reproducibility and scalability of machine learning projects.

EvidentlyAI

EvidentlyAI is a platform designed to enhance the development and deployment of machine learning models. It offers tools for model analysis, interpretation, and monitoring, helping data scientists and engineers ensure their models are transparent, reliable, and performing as expected. With EvidentlyAI, you can gain insights into model behavior, identify potential issues, and maintain the quality of machine learning systems throughout their lifecycle. This platform promotes better understanding and trust in AI models, fostering improved decision-making and accountability.

Project setup

Either follow the detail instructions below, or modify the makefile with the appropriate directory and run the available commands. Note, these instructions provide amble detail to run the project locally, but may not be detailed enough to run the code entirely on AWS/as intended.

Requirements

Setup Enviornment

Create a virtual environment using conda

conda create --name py35 python=3.10

Install libraries

pip install -r ~/mlops-zoomcamp-project/requirements.txt

API Keys

The data is available in the ~\data directory, but you can also source it by creating a Kaggle account and downloading your credentials in a Json format for use with the Kaggle python library.

Jupyter Notebook

At this point, feel free to play around with the jupyter notebook I provided (flight-satisfaction-prediction.ipynb). There's a comparison of several Sklearn classification models and EDA of the data.

Setup AWS

Create public and private key for use with EC2: ssh-keygen -f <Local directory to store private keys>/mlops_key_pair2 -t rsa -b 4096

To run the jupyter notebook or another instance on the EC2, you can connect through the public ip address, available on the AWS dashboard: ssh -i <Location of private key on local machine> -L localhost:8888:localhost:8888 ubuntu@<EC2 IP Address>

Setup Terraform

Set directory, initialize, plan, and apply terraform:

cd ~/mlops-zoomcamp-project/terraform

terraform init

terraform plan

terraform apply

Setup Prefect

To run locally, start prefect server:

prefect server start

Set directory, build, apply, and launch prefect agent:

cd ~/mlops-zoomcamp-project

conda activate prefect-env

prefect deployment build prefect/main_flow.py:main -n Example_flow

prefect deployment apply main-deployment.yaml

prefect agent start -q 'default'

To run on an EC2 instance, follow the additional instructions below:

prefect config set PREFECT_API_URL=<EC2 Instance Public IP>

MLFlow

To run locally:

mlflow ui --host localhost:5000 --backend-store-uri sqlite:///mlflow.db

To run on an EC2 instance:

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://mlflow:mlflowadmin@...../mlflow_db --default-artifact-root s3://mflow-remote

Result

After setting up the infrastructure and running prefect, you should see the final optimized CatBoost model in MLFlow: Below we can see:

All the Catboost models optimized with Optuna
The final model, with hyperperameters that yield the highest accuracy score of .989

Below we can see an accuracy score of .989

MLFlow

Note the same setup detailed in the above instructions has been provided in the Makefile as well

evanhofmeister / mlops-zoomcamp-project Goto Github PK

mlops-zoomcamp-project's Introduction

MLOps Zoomcamp Final Porject - Predict Airline Satisfaction

Problem Statement

Data

Chosen model

Data Pipeline

Docker

Terraform

Prefect

Data Cloud Storage and Warehouse

MLFlow

EvidentlyAI

Project setup

Requirements

Setup Enviornment

API Keys

Jupyter Notebook

Setup AWS

Setup Terraform

Setup Prefect

MLFlow

Result

MLFlow

mlops-zoomcamp-project's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org