This is a project for Basics MLE module of a course. All .py scripts were tested on MacOS, if there are any performance issues on different OS please let me know.
Project does not require additional setup and can be run as is when cloned.
Upon completion of intermediate steps script returns small and (sometimes) informative log.
As a result you will recieve .csv
file with predictions and .pth
model file of the latest trained model.
epam_hometask
├── data # Data files used for training and inference, file containing complete dataset
│ ├── raw_data.csv
│ ├── inference_iris.csv
│ └── train_iris.csv
├── data_prep # Scripts used for data uploading and splitting into training and inference parts
│ ├── data_prep.py
│ └── __init__.py
├── inference # Scripts and Dockerfiles used for inference
│ ├── Dockerfile
│ ├── inference.py
│ └── __init__.py
├── models # Folder where trained models are stored
│ └── various model files
├── training # Scripts and Dockerfiles used for training
│ ├── Dockerfile
│ ├── train.py
│ └── __init__.py
├── results # Folder where final model and results are stored
│ ├── Outputs.csv
│ └── model files
├── utils.py # Utility functions and classes that are used in scripts
├── requirements.txt # All requirements for the project
├── settings.json # All configurable parameters and settings
└── README.md
To run training you should first creare image for training.py To create image run:
docker build -t train_img -f training/Dockerfile .
And then run this command to execute training
docker run -it train_img /bin/bash
Doing the following will create docker image with copied data for training and output trained model.
To run inference you should first creare image for inference.py To create image run:
docker build -t inference_img -f inference/Dockerfile .
And then run this command to execute inference
docker run -it inference_img /bin/bash
Doing the following will create docker image with copied data for training and output trained model. Script will also run automatically with creation of a docker container.
Alternatively you can simply run python scripts to ensure that everything works as intended. These scripts should be run in order, demonstrated below to successfully build the model and not return any errors:
- Run data_prep
- Run training.py
- Run inference.py
Succsessful run of data_prep is indicated by creating data
directory and 3 files inside of it;
Succsessful run of data_prep is indicated by creating models
directory, model file, checkpoint file and decoder file inside of it
Succsessful run of inference is indicated by creating results
directory, outputs file inside of it
Running data_prep.py
script performs the following:
- Downloads data from the webpage;
- Saves full dataset into
data
directory as a.csv
file. Ifdata
directory does not exist, directory is created; - Splits dataset into training and inference parts according to
test_size
parameter in settings.json; - Saves training and inference dataset into
data
directory as a.csv
files with names specified insettings.json
;
Running train.py
script performs the following:
- Training file from
data
is preprocessed for modelling:- Target column is label encoded, decoder is saved in
model
directory for future use; - Data is split into training and validation parts;
- Train and validation datasets are converted to Dataloaders;
- Target column is label encoded, decoder is saved in
- Model is trained and validated on created dataloaders;
- Model is saved in
model
directory; - Model checkpoint with best model performance is saved into
models
directories; - F1 score of best performing checkpoint is printed out;
Running inference.py
script performs the following:
- Inference file from
data
directory is preprocessed for predictions; - Model and checkpoint with best performance are loaded from
model
directory; - Inference data is passed into a model and outputs are saved in
results
directory as a.csv
file;