Coder Social home page Coder Social logo

phylogrok / analyzebloodwork Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 16.0 2.92 MB

Data handling, linear regressions, plots, and classification for clinical bloodwork and gene expression data.

License: GNU General Public License v3.0

R 1.18% Jupyter Notebook 98.82%

analyzebloodwork's Introduction

AnalyzeBloodwork '1.5'

Biomarker discovery and predictive diagnostics with Machine Learning in R and Python

PI: Jeffrey Robinson, MS, PhD

DOI

A. Intro

The repository includes scripts with analysis of biomedical datasets from Robinson's research. The repository is a codebase for undergraduate courses in bio-statistics, R- and Python- coding, and machine learning in UMBC's TLST program. Courses have include BTEC350, (Biostatistics), BTEC330 (Software Applications in Biotechnology), BTEC 423 (Machine Learning with Bioinformatics Applications) and BTEC495 (Independent Student Research) courses taught by Robinson (Fall 2019 - Spring 2022).

Datasets and analyses address clinical and molecular data associated with metabolic syndrome, chronic inflammation, and immune responses, and currently include the Akimel O'otham (Pima Indians) diabetes dataset, Kaggle cardiology dataset, and an NIH IBS/gastrointestinal disorders dataset that includes Nanostring experession data.

Additional computational resources are provided by NSF Extreme Science and Engineering Discovery Environment (XSEDE) through an educational allocation awarded to Robinson: “Bioinformatics Training for Applications in Translational and Molecular Biosciences”, under NSF grant number ACI-1548562.

Student projects

Predicitive Diagnosis for Diabetes with Machine Learning Approach - Akimel O'otham (NIDDK Pima Diabetes dataset), with Python, Scikit-learn, and Jupyter Notebooks. Brandon Lamotte (BTEC423. Machine Learning with Bioinformatics Applications, Spring 2022)

Linear Regression in R with Examples. R-script and examples for single and multiple linear regressions. (BTEC330 Biostatistics, Fall 2019)

Algorithm selection for Biomarker Discovery from clinical and molecular expression data (ML) - NIH IBS dataset. Compare performance of Machine Learning algorithms with R and Caret package. (Acknowledgements to Daniel Gidron, BTEC495 intern, Summer 2021)

B. Code

Current Scripts and template analyses):

AnalyzeBloodwork.R

  1. Loads required packages and sample data,
  2. Generates histograms for all variables,
  3. Generates linear regression models and diagnostic plots for single and multiple regressions for BMI, Complete Blood Count (CBC), and inflammation markers,

IBSclassification.R

  1. Imputes data for columns with missing (NA) values,
  2. Balances unequally-sized sample groups,
  3. Generates box-and-whisker and scatterplot-matrix plots for WBC distributions,
  4. Tests the performance of machine learning classification algorithms for IBS diagnosis using CBC-WBC count data.

[DiabetesClassification]

C. Sample Datasets:

1. Akimel O'otham (NIDDK Pima Indians Diabetes) data set:

UCI Kaggle repository: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

Data from the Akimel O'otham dataset was collected from an NIH clinical study of the relationships between obesity, BMI, blood glucose, diastolic blood pressure, tricep fold thickness, serum insulin levels, family history of diabetes, number of times pregnant, age, and whether a person is diagnosed with type 2 diabetes or not. This data set is from a study of Akimel O'otham people on the Gila River Indian Reservation in Arizona. The study was conducted between the early 1960s and mid 1990s, and is a popular dataset for demonstration of ML-based predicitive diagnostics. (Smith et al. 1988)

2. NIH CBC with WBC RNA expression in IBS dataset:

Human buffy coat gene expression, custom 250-plex Nanostring panel. GSE124549. 2019.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124549.

ImmunoGC custom Nanostring probe panel. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL25996.

Data from the sample dataset was collected during an NIH natural history clinical study of the relationship between obesity, inflammation, stress, and gastrointestinal disorders, data and probe panels are fully open-sourced in the NCBI GEO database (Robinson 2021, Robinson 2019, Robinson et al. 2019).

The standard CBC parameters provide point-of-care physicians with powerful diagnostic capabilities using:

  1. White Blood Cell counts (absolute and relative counts for Monocytes, Lymphocytes, Neutrophils, Basophils, and Eosinophils),
  2. Red blood cell and hemoglobin parameters (RBC count, Hematocrit (HCT), Mean Corpuscular Hemoglobin (MCH), Erythrocyte Sedimentation Rate (ESR)),
  3. Platelet parameters (Platelet Counts, Mean Platelet Volume (MPV))

Additional parameters for obesity, inflammation, and GI-associated pain:

  1. Body Mass Index (BMI),
  2. Stress hormones: Cortisol and ACTH,
  3. Inflammation markers: C-Reactive Protein (CRP), sCD14, Lipopolysaccharide Binding Protein (LBP),
  4. Clinical diagnoses of subtypes of Irritable Bowel Syndrome (IBS)
  5. Nanostring White Blood Cell RNA expression data: an associated 250-gene panel of Nanostring RNA expression data (links in citations below).

3. Heart Disease Data Set, UC Irvine Machine Learning Repository.

https://archive.ics.uci.edu/ml/datasets/Heart+Disease. (Detrano et al. 1989)

D. Dataset Citations (See project pages for project-specific citations)

Detrano, R. 1989. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 64(5):304-10.. DOI: https://doi.org/10.1016/0002-9149(89)90524-9

Robinson, J. 2021. Predictive Classification of IBS-subtype: Performance of a 250-gene RNA expression panel vs. Complete Blood Count (CBC) profiles under a Random Forest model. medRxiv. doi: https://doi.org/10.1101/2021.08.31.21262766.

Robinson, JM. et al. 2019. Complete blood count with differential: An effective diagnostic for IBS subtype in the context of BMI? BioRxiv. doi: https://doi.org/10.1101/608208.

Robinson, J. 2019. Differential Gene Expression Associated with BMI, Gender, and IBS-subtype in Human White Blood Cells: Results from a Custom 250-plex Nanostring Probe Panel. Preprints. 2019120180 (doi: 10.20944/preprints201912.0180.v1).

Smith, JW., et al. 1988. Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus. Proceedings of the Annual Symposium on Computer Application in Medical Care, 261–265.

Funding:

NSF XSEDE Educational Allocation: PI Dr. Jeffrey Robinson “Bioinformatics Training for Applications in Translational and Molecular Biosciences”. Extreme Science and Engineering Discovery Environment (XSEDE), supported by National Science Foundation grant number ACI-1548562.

Adapted R source code:

STHDA: ggplot2 histogram: easy histogram with ggplot2 R package

STHDA: Scatterplot3d: 3D graphics - R software and data visualization

Quick R by DataCamp (StatMethods.net): Scatterplots

MachineLearningMastery.com: Machine Learning in R Step-by-Step

R-Bloggers.com: Regression analysis essentials for machine learning

R-Pubs: Residuals Analysis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.