Coder Social home page Coder Social logo

ml_cs433_project1's Introduction

CDM Group Project

The purpose of this project is to generate an ML model able to infer the presence, or absence, of the Higgs Boson starting from CERN measurements. The data were obtained on the Kaggle competition platform ( https://www.kaggle.com/c/epfml18-higgs ).

In this README you will find general information about the methods, and a more detailed documentation is given within the functions.

Getting Started

The provided code was tested with Python 3.6.5. The following libraries are used within the script:

Computational:

numpy (as np)

Graphical:

seaborn (as sns)
matplotlib (as plt)

Prerequisites

The folder structure has to be the following:

.
├── Data                    # Data files, in .csv
    ├── train.csv
    └── test.csv
├── src                     # Source files
└── README.md

All the scripts are in src, where in run.py you can find the code that generates our prediction.

Implementation details

run.py

This script produces a csv file containing the predictions Kaggle_CDM_submission.csv. The following are executed:

a) loading the data

b) data processing:

    - applying the log transformations with translation
    - impute the missing values with the median
    - normalize the data
    - split the variable `num_jet` into 4 categorical variables

b) polynomial extension of degree 4 of the data

c) interactions between the categorical variables and the continuous features

d) training a Ridge regression model, using cross_validation to determine the hyper-parameter lambda

e) training the Ridge regression model on the whole training data set with the determined lambda to obtain the weight vector w

f) compute predictions and create the .csv file

The data preprocessing applies log transformation to a specific set of features after translating some of them. Then it imputes the mean for the missing values and take out the phi and eta features out.

Implementation of class methods.

For the sake of automation we added a pred keyword argument (kwarg) to all our model functions. It is False by default, and if set to True the function returns as first output a pointer on the function to use in order to get the predictions for that model.

All functions using the gradient_descent algorithm have, in addition, the two kwargs printing, all_step, which are False by default. If printing=True, then at all GD steps you will see in the shell the actual mse value and the value of the first two parameters of w. If all_step=True, then the function returns all the computed w-s and errors (by default they are not stored and only the last value is given).

The following functions were implemented:

Function Arguments
least_squares_GD y, tx, initial_w[, max_iters, gamma, *args, **kwargs]
least_squares_SGD y, tx, initial_w[, batch_size, max_iters, gamma, *args, **kwargs]
least_squares y, tx[, **kwargs]
ridge_regression y, tx, lambda_[, **kwargs]
logistic_regression y, x, [w, max_iters, gamma, **kwars]
reg_logistic_regression y, x, lambda_, [initial_w, max_iters, gamma, **kwargs]

The default values were chosen in order to get convergence on the GD algorithm.

ATTENTION:

Since our goal is to find a classification model, we have that all functions compute the error vector err as categorical (i.e if y_hat is the vector of estimated categories and y the true categories, j-th coordinate of err will be err[j]=1{y[j]=y_hat[j]}, where 1 is the indicator function.). Furthermore, the loss value returned is the misclassification ratio (i.e. the number of wrong predictions over the total number of predictions).

If one desires to implement our functions for different tasks, it is enough to set the two global functions err_f and loss_f to the desired ones.

Possible loss_f Possible err_f
calculate_mae MAE error
calculate_mse MSE category_error
calculate_rmse RMSE

They can be set as follows:

err_f = error #For continuous estimation.
loss_f = calculate_mse #For mean squared error loss.

Notes on cross_validation and multi_cross_validation

These are the two main functions inplemented in order to choose our model, and in particular to get an estimation of the prediction error.

  • cross_validation(y, tx, k_fold, method, *args_method[, k_indices, seed]) compute the k-fold cross validation for the estimation of y using a the method-function stored (as pointer) in the argument method. The arguments necessary for the method are to be passed freely after method. It returns predictor, w, loss_tr, loss_te, which are, in order, the predicting function, the mean of the trained weights, the mean of the train error and the estimate test error.

  • multi_cross_validation(y, x, k_fold[, transformations=[[id, []]], methods=[[least_squares, []]], seed=1, only_best=True]) Perform automatically the cross validation on all the combinations of transformations in the transformations list (their parameters have to be passed as a list coupled with the transformation) and methods with changing parameters in the methods list (the coupled list have in this case to be a list of the tuples of parameters combinations to test.) It then plots the estimated losses (both on train and test) and outputs predictor, weight, losses_tr, losses_te, transformations_list, methods_list. If only_best=True, those are the variables corresponding to the lowest test-error estimate, otherwise they contain the variables computed at each step. An implementation example can be found in the documentation.

Authors

  • William Cappelletti
  • Charles Dufour
  • Marie Sadler

ml_cs433_project1's People

Contributors

dufourc1 avatar masadler avatar williamcappelletti avatar

Stargazers

 avatar

Watchers

 avatar  avatar

ml_cs433_project1's Issues

Cross-validation

Cross validation is too specific, it works only for the ridge regression (and also OLS by setting lambda =0), but we should make it more general, useable with any model we could have

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.