Coder Social home page Coder Social logo

nebojsa55 / computational-genomics_midterm-project Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 478 KB

Project for the course Computational Genomics

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
computational-genomics preterm-birth genes machine-learning sklearn

computational-genomics_midterm-project's Introduction

Preterm Birth Prediction based on Gene Expression

This project was completed as part of the Computational Genomics course at the University of Belgrade, School of Electrical Engineering.

The goal was to predict gestational age in pregnant women by analyzing gene expression via Regression Models.

Data

The Datasets used for the analyses described in this project were contributed by Wayne State University School of Medicine Perinatal Initiative and by the Perinatology Research Branch, Division of Obstetrics and Maternal-Fetal Medicine, Division of Intramural Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, U.S. Department of Health and Human Services (NICHD/NIH/DHHS); and, in part, with Federal funds from NICHD/NIH/DHHS under Contract No. HHSN275201300006C. They were obtained as part of the DREAM Preterm Birth Prediction Challenge through Synapse (syn18380862), managed by Sage Bionetworks.

To learn more please click on the picture above.

Data stats:

Total num of samples Train samples Testing samples Number of features
735 367 368 32 830

DISCLAIMER: The challenge ended in 2019. and this implementation was not an active part of the challenge

Requirements

To install the necessary libraries, type in the terminal:

pip install -r requirements.txt 

Table of contents and results

The project is divided into 3 Jupyter notebooks, which simulate the thought flow that went to building the model. The main metric regression score was RMSE (root mean square error).

  1. Basic-regression-models.ipybn

    • Samples were standard scaled according to the belonging batch
    • PCA analysis was performed to acquire a minimum number of components to account for 95% variance
    • Random Forest Regressor and Support Vector regressor were tested as they are one of the most common ML models used in literature
    • Hyperparameter cross-validation and 10-fold cross-validation were performed in order to get the best model possible
    • RMSE(RFR) = 7.5441 ; RMSE(SVR) = 8.4081
  2. Better-model.ipybn

    • Samples were standard scaled according to the belonging batch
    • Instead of PCA, features were selected according to the f_regression score and SelectKBest class from sklearn.feature_selection module
    • Samples from the set 'GSE113966' were dropped as they seem to be outliers (32 in total)
    • Random Forest Regressor and Support Vector regressor were tested with the parameters found in notebook 1 through different parameter K to find the optimal number of features to use. After that, 10-fold cross-validation was performed
    • RMSE(RFR) = 5.9324 ; RMSE(SVR) = 8.0756
  3. Linear-regression.ipybn

    • Samples were standard scaled according to the belonging batch
    • Only linear regressor ElasticNet was considered, as results from the previous notebook suggest that linear regressors are most suitable for this dataset (f_regression score is linear regression test)
    • 10-fold cross-validation was performed to find optimal parameters for ElasticNet regressor and cross-validation was performed for the parameter K in SelectKBest
    • RMSE(EN) = 4.9283 ✔️
  4. Gene-importance.ipybn

    • The top 10 features (genes) were plotted, and the top 5 were presented in the table below, with their respective gene symbol, description (acquired from https://www.genecards.org/) and K-score:
Feature label Gene symbol Description K score
199675_at MCEMP1 This gene encodes a single-pass transmembrane protein. Based on its expression pattern, it is speculated to be involved in regulating mast cell differentiation or immune responses 72.28
2359_at FPR3 FPR3 (Formyl Peptide Receptor 3) is a Protein Coding gene. Diseases associated with FPR3 include Rubeosis Iridis. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and N-formyl peptide receptor activity 62.60
3507_at IGHM IGHM (Immunoglobulin Heavy Constant Mu) is a Protein Coding gene. Diseases associated with IGHM include Agammaglobulinemia 1, Autosomal Recessive and Agammaglobulinemia, Non-Bruton Type. Gene Ontology (GO) annotations related to this gene include single-stranded DNA binding and phosphatidylcholine binding 57.93
9619_at ABCG1 The protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the White subfamily. It is involved in macrophage cholesterol and phospholipids transport, and may regulate cellular lipid homeostasis in other cell types. Six alternative splice variants have been identified. 57.83
6689_at SPIB The protein encoded by this gene is a transcriptional activator that binds to the PU-box (5'-GAGGAA-3') and acts as a lymphoid-specific enhancer. Four transcript variants encoding different isoforms have been found for this gene 52.16

EDIT 5. Final-model.ipybn

  • This notebook will be the same as notebook #3, except that the model was now tested on the whole testing set (368 samples) to observe whether the model is viable.
  • RMSE(EN) = 5.0400 ✔️

computational-genomics_midterm-project's People

Contributors

nebojsa55 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.