This project applies machine learning to publicly available genomic data to predict phenotype and identify Single Nucleotide Polymorphisms (SNPs) that are most influential. The method requires no domain knowledge and can be used with any binary phenotype classification.
As a proof of concept, this application was used to predict eye color using 436 users from self-reported data on openSNP. Using the default settings, the model achieved a 92% accuracy and 0.95 AUC for the ROC curve.
The model correctly identified HERC2 as the most influential gene and identified the polygenic relationship between genes HERC2 and OCA2.
It is recommended to install this application in a virtual environment. To create and activate the virtual environment run:
# install virtualenv if it is not already installed
pip install virtualenv
# create virtualenv
virtualenv -p <python2.7 path> genopheno_ve
Activate the virtualenv on macOS and Linux.
source genopheno_ve/bin/activate
Activate the virtualenv on Windows.
genopheno_ve\Scripts\activate
From the repository directory, install the genopheno requirements.
pip install -r requirements.txt
Out of the box, the application comes with sample data to predict eye color.
The commands in this section assume the virtual environment from the prerequisites step is activated and that the current directory genopheno/genopheno (the root package directory). Execute the following commands to use the sample data and default model settings.
# Preprocess the data
python preprocess.py
# Build the model
python model.py
# Use the model to predict phenotypes
python predict.py
Note, there is not enough data to produce a well trained model. The data is included as a quick way to understand how to run the application.
The application is broken down into three steps, each with their own command-line interface.
- Preprocessing the genomic data. This step converts user genotypes at each SNP to a mutation count.
- Building the model. This step uses the preprocessed data to build a model to predict phenotype.
- Using the model. This step uses the model from the previous step to predict phenotype for additional users.
The commands in this section assume the virtual environment from the prerequisites step is activated and that the current directory genopheno/genopheno (the root package directory).
The application includes SNP data from dbSNP that will be used for SNP information by default. To run the application with a custom user set run:
python preprocess.py --user-geno mydata/users --known-phenos mydata/phenos.csv
Additionally, to use custom SNP data run:
python preprocess.py --user-geno mydata/users --known-phenos mydata/phenos.csv --snp mydata/snp
See the CLI for additional flags or refer to the table below.
python preprocess.py --help
Argument | Short | Description |
---|---|---|
--usergeno | -u | The directory containing user genomic data. Each file contains data for one user. 23andMe and Ancestry.com data formats are supported. |
--known-phenos | -p | The file path to the file that contains the known phenotypes. This is used to train the model. This must be a CSV file with the following format with columns user_id and phenotype. |
--snp | -s | The directory containing the SNP data for each genome. The supported file format is VCF. |
--output | -o | The directory that the out files should be written to. This will include all files required for the machine learning input. |
User genomic file names must start with the numeric user ID followed by an underscore and end with either 23andme.txt or ancestry.txt (i.e. user44_file19_yearofbirth_1970_sex_XY.23andme.txt).
The known phenotypes file must contain columns "user_id,phenotype". The user_id must be the numeric id that matches the numeric id in the genomic file name (i.e. 44). The phenotype column can contain any binary phenotype classification (i.e. blue_green/brown for eye color).
To build the model, use the output of the preprocessed step. If no preprocessed directory is supplied it uses the default output directory of the preprocess step.
SNPs are features in the model and are selected by identifying significant mutation differences between the negative and positive phenotype classifications. The model includes main effects for each selected SNP and optionally interaction effects between each selected SNP.
To build the model with a custom threshold for SNP selection and threshold for missing SNP data run:
python model.py --absolute-diff-thresh 35 --invalid-snp-thresh 50
There are more flags to customize the model. See the CLI help.
python model.py --help
Argument | Short | Description |
---|---|---|
--preprocessed | -p | The directory containing the output data from the initialization phase. Default: resources/full_data/preprocessed |
--invalid-snp-thresh | -it | The maximum percentage of missing or invalid user observations a SNP can have before it is not considered as a feature in the model |
--invalid-user-thresh | -iu | The maximum percentage of missing or invalid SNP observations a user can have before it is not considered as a valid example in the model. |
--absolute-diff-thresh | -adt | The difference threshold required for the SNP to be selected, in percentage points. |
--relative-diff-thresh | -rdt | The relative difference threshold. This is the absolute difference in mutation percentage divided by the minimum mutation value out of the two phenotypes. |
--split | s | The percentage of users to use as test data. |
--no-interactions | -ni | If set then interactions will not be included in the model. |
--negative | -n | The phenotype value that should be considered as the negative case. |
--max-snps | -ms | The maximum number of SNPs to include in the model |
After the model is built, a ROC curve and confusion matrix will be written to the output directory (roc.png and confusion_matrix.txt) to evaluate the model.
The model features, sorted by influence, will also be written to the output (features.csv). For example:
intercept: [104.24800058]
feature,coefficient
rs693:rs223192,-62.9458112451
rs693:rs306967,-55.0389140668
rs693:rs557993,-54.4003899739
rs693:rs382266,-54.4003899739
To build the model, use the output of the preprocessing and modeling steps. If no directories are supplied it uses the default output directories of each step.
To prediction phenotypes for new users run:
python predict.py --users-dir mydata/new_users
See the CLI for additional flags.
python predict.py --help
Argument | Short | Description |
---|---|---|
--users-dir | -u | The directory that contains the users genomic data to predict the phenotypes for. Default: resources/data/users |
--init-dir | -i | The directory that the preprocessed files are in. Default: resources/full_data/preprocessed |
--model-dir | -m | The directory that the model files are in. Default: resources/data/model |
--output | -o | The directory that the output files should be written to. Default: resources/data/prediction |
The predictions are written to the output directory (predictions.csv). For example:
user_id,prediction
1892,Blue_Green
189,Brown
200,Blue_Green
2064,Blue_Green
2074,Blue_Green