Coder Social home page Coder Social logo

amacaluso / machine-learning-and-deep-learning-an-application-in-bioinformatics Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 4.38 MB

Machine Learning and Deep Learning in Bioinformatics - Master's thesis repository

License: Apache License 2.0

Python 100.00%
applied-machine-learning applied-statistics applied-data-science-with-python bioinformatics hypothesis-testing

machine-learning-and-deep-learning-an-application-in-bioinformatics's Introduction

Machine Learning in Bioinformatics (Master's thesis)

This project is about the analysis of bioinformatics data, and it has been realised for during my master's thesis writing.

Machine Learning and Deep Learning: an application in bioinformatics

Abstract:

In the big data era, to transform biomedical data in useful knowledge is one of the most important challenges in bioinformatics. In order to understand the structure of a cell and its functioning, the valuable information needed concerns the amount of mRNA produced by each gene of the cells DNA (also referred to as gene expression). Thanks to microarray technologies it is possible to consider simultaneously, in only one experiment, up to 30 thousands genes on each cell-line, thus gathering a huge amount of data.

When it comes to evaluate the efficacy of a new drug, the experiment consists in measuring the gene expression of a tumor cell-line before the drug is administered and then using it to predict its response to the treatment. The analysis presented in this thesis is based on data about 18916 genes disposed on 464 cell-lines of different tumors (13 different types of cancer) collected by several experiments of the Mario Negri Instituite for Pharmacological Research. The efficacy of each treatment is evaluated on the basis of two quantitative variables, AUC and IC50, representing respectively the area under the curve of the dose-response plot and the maximal concentration of drug to cause 50% inhibition of biological activity of cancer cell.From a statistical point of view, given the numeric nature of the variables of interest, the problem can be described as a regression model in which each gene expression acts as a predictor for the drug efficacy. However, the huge amount of genes and the high costs of each records detection generate many practical complications, so that is not possible to apply classical regression methods without a proper pre-elaboration of the data. As far as this thesis is concerned, a computationally efficient method using parallel computing and both Python and R optimized libraries is dealt with in order to assess the relationship between gene-expression and drug response in the framework described above.

Firstly the responses to drugs in the 13 Cancer type cell-lines are compared by means of both a Kruskal-Wallis test and a multiple paired WilcoxonMann-Whitney test using Bonferroni correction. Moreover, given the enormous number of predictors, a Principal Component Analysis is performed before applying the Machine Learning algorithm in order to reduce the problems dimensionality without losing information. As a result the first 300 components are kept, covering more than 95% of the total explained variance.

Secondly, two Machine Learning methods (Linear Regression and Support Vector Machine) are adopted to estimate the drug response using PCA components as predictors. In particular, the independence of the drug response from the cancer type is investigated in first place using as training set all the cell-line types. In second place, the drug response of every single cell-line type is predicted from all the other cancer types. Lastly, blood cell-lines are used as baseline predictors for estimating drug response of each kind of tumor. Validation and Testing are then conducted using k-fold cross validation in order to exploit all available information in each step of analysis.

Thirdly, in order to capture the non-linear relationship between from gene-expression and drug response - suggested by non-linear choice of the kernel in the validation phase of the SVM - a Deep learning algorithm, namely the Multilayer feed forward neural network, is also explored using several configurations in validation step. Finally, the results of the Machine Learning and the Deep Learning approaches are compared. Additionally, after a brief discussion about some possible alternatives to optimize the computational effort is dealt with, the entire analysis is repeated using the CINECA’s supercomputer MARCONI and exploiting the advantages Graphics Processing Unit (GPU) parallel computing instead of the classical multithreading parallelism.

Maintainer

Antonio Macaluso

machine-learning-and-deep-learning-an-application-in-bioinformatics's People

Contributors

amacaluso avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.