Coder Social home page Coder Social logo

genopheno's Introduction

Summary

This project applies machine learning to publicly available genomic data to predict phenotype and identify Single Nucleotide Polymorphisms (SNPs) that are most influential. The method requires no domain knowledge and can be used with any binary phenotype classification.

As a proof of concept, this application was used to predict eye color using 436 users from self-reported data on openSNP. Using the default settings, the model achieved a 92% accuracy and 0.95 AUC for the ROC curve.

The model correctly identified HERC2 as the most influential gene and identified the polygenic relationship between genes HERC2 and OCA2.

Prerequisites

Python Packages

It is recommended to install this application in a virtual environment. To create and activate the virtual environment run:

# install virtualenv if it is not already installed
pip install virtualenv

# create virtualenv
virtualenv -p <python2.7 path> genopheno_ve

Activate the virtualenv on macOS and Linux.

source genopheno_ve/bin/activate

Activate the virtualenv on Windows.

genopheno_ve\Scripts\activate

From the repository directory, install the genopheno requirements.

pip install -r requirements.txt

Running with Sample Data

Out of the box, the application comes with sample data to predict eye color.

The commands in this section assume the virtual environment from the prerequisites step is activated and that the current directory genopheno/genopheno (the root package directory). Execute the following commands to use the sample data and default model settings.

# Preprocess the data
python preprocess.py

# Build the model
python model.py

# Use the model to predict phenotypes
python predict.py

Note, there is not enough data to produce a well trained model. The data is included as a quick way to understand how to run the application.

Running the Application

The application is broken down into three steps, each with their own command-line interface.

  1. Preprocessing the genomic data. This step converts user genotypes at each SNP to a mutation count.
  2. Building the model. This step uses the preprocessed data to build a model to predict phenotype.
  3. Using the model. This step uses the model from the previous step to predict phenotype for additional users.

The commands in this section assume the virtual environment from the prerequisites step is activated and that the current directory genopheno/genopheno (the root package directory).

Preprocessing

The application includes SNP data from dbSNP that will be used for SNP information by default. To run the application with a custom user set run:

python preprocess.py --user-geno mydata/users --known-phenos mydata/phenos.csv

Additionally, to use custom SNP data run:

python preprocess.py --user-geno mydata/users --known-phenos mydata/phenos.csv --snp mydata/snp

See the CLI for additional flags or refer to the table below.

python preprocess.py --help
Argument Short Description
--usergeno -u The directory containing user genomic data. Each file contains data for one user. 23andMe and Ancestry.com data formats are supported.
--known-phenos -p The file path to the file that contains the known phenotypes. This is used to train the model. This must be a CSV file with the following format with columns user_id and phenotype.
--snp -s The directory containing the SNP data for each genome. The supported file format is VCF.
--output -o The directory that the out files should be written to. This will include all files required for the machine learning input.

Custom Input Data

User genomic file names must start with the numeric user ID followed by an underscore and end with either 23andme.txt or ancestry.txt (i.e. user44_file19_yearofbirth_1970_sex_XY.23andme.txt).

The known phenotypes file must contain columns "user_id,phenotype". The user_id must be the numeric id that matches the numeric id in the genomic file name (i.e. 44). The phenotype column can contain any binary phenotype classification (i.e. blue_green/brown for eye color).

Building the Model

To build the model, use the output of the preprocessed step. If no preprocessed directory is supplied it uses the default output directory of the preprocess step.

SNPs are features in the model and are selected by identifying significant mutation differences between the negative and positive phenotype classifications. The model includes main effects for each selected SNP and optionally interaction effects between each selected SNP.

To build the model with a custom threshold for SNP selection and threshold for missing SNP data run:

python model.py --absolute-diff-thresh 35 --invalid-snp-thresh 50

There are more flags to customize the model. See the CLI help.

python model.py --help
Argument Short Description
--preprocessed -p The directory containing the output data from the initialization phase. Default: resources/full_data/preprocessed
--invalid-snp-thresh -it The maximum percentage of missing or invalid user observations a SNP can have before it is not considered as a feature in the model
--invalid-user-thresh -iu The maximum percentage of missing or invalid SNP observations a user can have before it is not considered as a valid example in the model.
--absolute-diff-thresh -adt The difference threshold required for the SNP to be selected, in percentage points.
--relative-diff-thresh -rdt The relative difference threshold. This is the absolute difference in mutation percentage divided by the minimum mutation value out of the two phenotypes.
--split s The percentage of users to use as test data.
--no-interactions -ni If set then interactions will not be included in the model.
--negative -n The phenotype value that should be considered as the negative case.
--max-snps -ms The maximum number of SNPs to include in the model

Output

After the model is built, a ROC curve and confusion matrix will be written to the output directory (roc.png and confusion_matrix.txt) to evaluate the model.

The model features, sorted by influence, will also be written to the output (features.csv). For example:

intercept: [104.24800058]

feature,coefficient
rs693:rs223192,-62.9458112451
rs693:rs306967,-55.0389140668
rs693:rs557993,-54.4003899739
rs693:rs382266,-54.4003899739

Using the Model

To build the model, use the output of the preprocessing and modeling steps. If no directories are supplied it uses the default output directories of each step.

To prediction phenotypes for new users run:

python predict.py --users-dir mydata/new_users

See the CLI for additional flags.

python predict.py --help
Argument Short Description
--users-dir -u The directory that contains the users genomic data to predict the phenotypes for. Default: resources/data/users
--init-dir -i The directory that the preprocessed files are in. Default: resources/full_data/preprocessed
--model-dir -m The directory that the model files are in. Default: resources/data/model
--output -o The directory that the output files should be written to. Default: resources/data/prediction

Output

The predictions are written to the output directory (predictions.csv). For example:

user_id,prediction
1892,Blue_Green
189,Brown
200,Blue_Green
2064,Blue_Green
2074,Blue_Green

genopheno's People

Contributors

lakarbatti avatar pfrancislyon avatar mbaybay avatar robreeves avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.