covid-logistic-bfgs

Logistic regression on COVID-19 data using BFGS algorithm

This repository contains a Python project I developed for the Optimization exam in the AI Master Course I'm attending. The aim of the project is to provide an implementation of the BFGS algorithm and to apply it to a particular case of study: I decided to perform a logistic regression on the official COVID-19 data (provided by Protezione Civile and available on GitHub at this link). I worked on data referring to the whole country, included in folder dati-andamento-nazionale: I reuploaded the folder on this repository for ease of use.

Model

I modelled the early evolution of COVID-19 as a logistic function with 6 parameters:

where σ(x,θ) represents the number of positive cases at x-th day after the 24th of February.
I exploited BFGS algorithm to find the optimal parameters θ^*, i.e. those that minimize the objective (loss) function:

where y are the data points.

To implement the algorithm, I was inspired by an article on aria42's blog. Though, the code included in this repository is entirely written by me.

NOTE: To compute the optimal parameters I mapped the original data to the range [0,1] for the purpose of computability: the order of magnitude of the positive cases quickly reaches 10⁴ and goes up to 10⁵, which leads to significant effort in computations mainly due to the exponential.
Thus the data points actually employed in the computations were obtained through:

As a consequence, the values of σ(x,θ) and loss(x,θ) reported in the following will refer to rescaled data. On the other hand, after performing the fit on the data obtaining the optimal parameters, I scaled back the resulting predictions according to the inverse transformation to produce the final result, so that I could compare real data to the predictions made by the model.

Implementation

Parameters

The algorithm is implemented as BFGS_algorithm(obj_fun, theta0, max_iter=2e04, epsilon=0), where obj_fun is the objective function to be minimized (in our case loss(x,θ)), theta0 is the initial guess for the parameters (a numpy array), max_iter is the maximum number of iterations performed and epsilon is the minimum value the loss function may have for the purpose of convergence.

Algorithm

The function takes advantage of the library scipy.optimize, deploying functions BFGS and line_search, as I discuss in the following.
Firstly, a BFGS object is created invoking the constructor: this object allows to efficiently store the (inverse) approximate Hessian matrix of the function and perform computations (such as the dot product with a vector), thus keeping track of the updates.
Then, the recursive loop does the job:

at each iteration n, the initial guess θ_n is stored along with the gradient g_n, computed at its position;
the search direction d is computed as dot product between H_n^-1 and g_n, where the former is the current approximation of the inverse Hessian matrix;
the step-size α is computed through line_search, which is scipy.optimize's implementation of a backtracking line search algorithm: essentially, it computes the non-negative parameter α that minimizes the expression f(x_n - αd);
accordingly to these updates, the array of the parameters θ_n+1 is computed along with the gradient g_n+1, computed at its position; the new approximation of the Hessian is obtained via the method BFGS.update, which takes as inputs the differences θ_n+1 - θ_n and g_n+1 - g_n;
the loop is stopped whenever the value of loss(x,θ) is below epsilon or when max_iter is reached.

Return

The function returns a tuple theta_history, cost_history, niter, success in which the following is stored: the parameters history, the loss function values history, the number of iterations performed, information about the convergence (i.e. whether the loss function of θ^* is below epsilon or not).

Results

These are the results I obtained choosing as input parameters firstly max_iter = 20000 and then max_iter = 30000, and epsilon = 0 in both cases (i.e. I ended up performing no check for convergence: empirically, I observed it never dropping below ~1.747*10^-5). The Logistic fit graphs show the data points as orange dots and the logistic regression as a blue dashed line: at a glance, the curve fits quite well the data. As you can immediately notice from the Cost history graphs, there is no meaningful improvement in terms of value of the loss function after ~15000 iterations.

Results:

Number of iterations:  20000
Theta:  [ 0.06900987  0.05576653  5.52850234  0.17732944  1.34652566 -0.01234024]
Loss:  104709.8770002242
Loss (rescaled data):  1.7472241874665985e-05
Converged: False, max_iter reached.

Results:

Number of iterations:  30000
Theta:  [ 0.06879371  0.05563397  5.53465311  0.17751231  1.34667285 -0.01218171]
Loss:  104697.39794599885
Loss (rescaled data):  1.747015957775154e-05
Converged: False, max_iter reached.

michimichiamo / covid-logistic-bfgs Goto Github PK