ai-sandbox / gnomix Goto Github PK

A fast, scalable, and accurate local ancestry method.

License: Other

Python 99.47% Shell 0.53%

gnomix's Introduction

High Resolution Ancestry Deconvolution for Next Generation Genomic Data

This repository includes a python implementation of G-Nomix, a fast, scalable, and accurate local ancestry method. See demo.

G-Nomix can be used in two ways:

training a model from scratch using reference training data or
loading a pre-trained G-Nomix model (see Pre-Trained Models below)

In both cases the models are used to infer local ancestry on provided query data that has already been phased (using a program like beagle, shapeit, or eagle) and pre-processed to have the same sites as the reference training samples on the same strand, or if a pre-trained model is used instead see Pre-Trained Models below for requirements.

Installation and Dependencies

To install the software, navigate to the desired folder and enter in the command line interface:

git clone https://github.com/AI-sandbox/gnomix
cd gnomix

The dependencies are listed in requirements.txt. Assuming pip is already installed, they can be installed via

$ pip install -r requirements.txt

The combined runtime for the cloning and the dependency installation should be around 2 minutes on a normal laptop.

The software has been tested in Python 3.7.4 on the following operating systems:

Linux: Ubuntu 18.04.5
macOS: Monterey (12.0.1)

Usage

When Using Pre-Trained Models

gnomix.py loads and uses a pre-trained G-Nomix model to predict the ancestry for a given <query_file> and a chromosome.

To execute the program with a pre-trained model run:

$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <path_to_model>

where

<query_file> is a .vcf or .vcf.gz file containing the query haplotypes which are to be analyzed (see example in the demo/data/ folder)
<output_folder> is where the results will be written (see details in Output below and an example in the demo/data/ folder)
<chr_nr> is the chromosome number
<phase> is either True or False corresponding to the intent of using the predicted ancestry for phasing correction (see details in Phasing below and in the gnofix/ folder). Note that initial phasing (using a program like beagle, shapeit, or eagle) must still have been performed first.
<path_to_model> is a path to the model used for predictions (see Pre-trained Models below)

Downloading pre-trained models

In order to incorporate our pre-trained models into your pipeline, please use the following command to download pre-trained models for the whole human genome. The SNPs used for our pre-trained models are also included in the form of a plink .bim file for every chromosome.

sh download_pretrained_models.sh

This creates a folder called pretrained_gnomix_models. For each chromosome, we publish a default_model.pkl which can be used as a pre-trained model in the <path_to_model> field and a .bim file as explained above.

When making predictions, the input to the model is an intersection of the pre-trained model SNP positions and the SNP positions from the <query_file>. That means that the set of positions that are only in the original training input used to create the model (and not in the query samples) are encoded as missing, while the set of positions only in the <query_file> are discarded. We suggest that you attempt to have your query samples include as many model snps (listed in the .bim files) as possible (never less than 80% for sufficient accuracy. When the script is executed, it will log the intersection-ratio between these model snps and the snps in your query samples, since the anceestry inference performance will depend on how many of the model's snp positions are missing in your query samples. If the intersection is low, you must either impute your query samples to match the full set of snps that are present in the pre-trained model or you must train your own new model using references that contain all the snps in your query samples. N.B. Your query samples must have snps that are defined on the same strand as in the model. You can use the included model .bim files as a reference to find and then flip any snps in your query samples that are defined on the opposite strand. (If this step is not performed your query samples will appear to have snps containing variation unseen during the model's training and will thus be be assigned unpredictable ancestries.)

The models named default_model.pkl are trained on hg build 37 references from the following biogeographic regions: Subsaharan African (AFR), East Asian (EAS), European (EUR), Native American (NAT), Oceanian (OCE), South Asian (SAS), and West Asian (WAS) and the model labels and predicts them as 0, 1, .., 6 respectively. The populations used to train these ancestries are given in the supplementary section of the reference provided at the bottom of this readme.

When Training a Model From Scratch

To execute the program when training a model run:

$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <genetic_map_file> <reference_file> <sample_map_file>

where the first 4 arguments are described above in the pre-trained setting and

<genetic_map_file> is the genetic map file. It's a .tsv file with 3 columns; chromosome number, SNP physical position and SNP genetic position. There should be no headers unless they start with "#". See example in the demo/data/ folder.
<reference_file> is a .vcf or .vcf.gz file containing the reference haplotypes (in any order)
<sample_map_file> is a sample map file matching reference samples to their respective reference populations

The program uses these two files as input to our simulation algorithm (see pyadmix/) to create training data for the model. Also, note that when running inference on the trained models, the <query_file> needs to have the same build as the genetic map used to train the model. (For instance, in the case of humans, it is build37 or build38)

Demo

After downloading our pre-trained models, one can demo the software in inference mode by running:

python3 gnomix.py demo/data/small_query_chr22.vcf.gz demo_output 22 True pretrained_gnomix_models/chr22/model_chm_22.pkl

This small query file contains only 9 samples of European, East Asian and African ancestry. The execution should take around a minute on a standard laptop. The inference can be analyzed, for example in the file demo_output/quer_results.msp, where we expect to see those three ancestries being inferred. For more details on those analysis, see the section on output below.

For more demos with training and larger datasets, see the demo notebook demo.ipynb.

Advanced Options

More advanced configuration settings can be found in config.yaml. They include general settings, simulation settings and model settings. More details are given in the file itself. If training a model from scratch you can also pass an alternative config file as the last argument:

$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <genetic_map_file> <reference_file> <sample_map_file> <config_file>

If no config is given, the program uses the default (config.yaml). The config file has advanced training options. Some of the parameters are

verbose (bool) - verbosity (default True)
simulation:
- run: (bool) - whether to run simulation or not, can be skipped if previously done (default True)
- path: (path) - # where to store the simulated data, if run is False this is where the simulation data will be sought, default is <output_folder>/generated_data/
- r_admixed (float,positive) - number of simulated admixed individuals generated when training the model = r_admixed x size of sample map (number of reference samples). The default is 1. Set it lower if memory is an issue. (To overcome memory constraints a minor allele frequency filter can also be used to remove very rare variants.)
- splits: must contain proportion for train1, train2 and optionally validation. If validation ratio is 0, validation is not performed.
- generations indicates the total specturem of generations since admixture to simulate, not critical
- rm_data (bool) - whether to remove simulated data after training (to conserve disk space). It is set to false if run is False. Default False.
model:
- name (string) - model's name: default is "model"
- inference (string) - 4 possible options - best / fast / large / default. "best" uses random string kernel base + xgboost smoother and is recommended for array data. "fast" uses logistic regression base + crf smoother. "large" uses logistic regression + convolutional smoother and is good for large datasets for which memory requirements are an issue. "default" uses logistic regression base + xgboost smoother and on whole genome has nearly the same accuracy as "best," but with much faster runtime.
- window_size_cM (float, positive) - size of window in centiMorgans, use larger windows if snp density is lower e.g. genotype data vs. sequence (default .5)
- smooth_size (int, positive) - number of windows to be taken as context for smoother (default 75)
- context_ratio (float between 0 and 1) - context of base model windows (default .5)
- retrain_base (bool) - retrain base models using both train1 and train2 once smoother is trained, validation data for a final base model (default True)
- calibrate (bool) - applies calibration on output probabilities (default False)
- n_cores (int, positive) - how many units of cpu to use (default is maximum), reduce if you are on a shared cluster and using only a subset of nodes
inference:
- bed_file_output: generate files for each individual that show the run length encoding of their ancestry segments (default False)
- snp_level_inference: output ancestry inference for each marker of the query file (default False)
- visualize_inference: create pictures showing the ancestry segments colored along each individual's chromosomes using Tagore (default False)

More model combinations

For more base + smoother combinations one can edit the gnomix.py file in the following way:

import the base model of choice from src/base/model e.g.,

from src.Base.models import LogisticRegressionBase

import the smoother of choice from src/smooth/model e.g.,

from src.Smooth.models import XGB_Smoother

and then, in the train_model() function in initilize the Gnomix object with the imported models:

model = Gnomix(
	...,
	base = LogisticRegressionBase,
	smooth = XGB_Smoother,
	...
)

Output

The results (including predictions, trained models and analysis) are stored in the <output_folder>.

Inference

The inference is written to two files, one for a single ancestry estimates for each marker (qery_results.msp) and one for probability estimates for each ancestry at each marker (query_results.fb). Below, we describe the both files in more detail.

query_results.msp

In the query_results.msp file, the first line is a comment line, that specifies the order and encoding of populations, eg: #Sub_population order/code: golden_retriever=0 labrador_retriever=1 poodle poodle_small=2

The second line specifies the column names, and every following line marks an interval on the genome.

The first 6 columns specify

the chromosome
interval of genetic marker's physical position in basepair units (one column represents the starting point and one the end point)
interval of genetic position in centiMorgans (one column represents the starting point and one the end point)
number of <query_file> SNP positions that are included in interval

The remaining columns give the predicted reference panel population for the given interval. A genotype has two haplotypes, so the number of predictions for a genotype is 2*(number of genotypes) and therefore the total number of columns in the file is 6 + 2*(number of genotypes)

query_results.fb

In the query_results.fb file, the first line is a comment line, that specifies the order of the populations, eg: #reference_panel_population: AFR EUR NAT

The second line specifies the column names, and every following line marks an interval on the genome.

The first 4 columns specify

the chromosome
mean of genetic marker's physical position in base pair units
mean of genetic position in centiMorgans
genetic marker index

The remaining columns represent the query hapotypes and reference panel population and each line markes the estimated probability of the given genome position coming from the population. A genotype has two haplotypes, so the number of predictions for a genotype is 2*(number of genotypes)(number of reference populations) and therefore the total number of columns in the file is 6 + 2(number of genotypes)*(number of reference populations).

query_results.lai (BETA)

The query_results.lai is an optional output that includes the inferred ancestry label for each marker in the query file. Please note that this feature is in beta stage and therefore the program does not export this file unless snp_level_inference is set to True in the config.yaml file.

The first line of the output file is a comment line, that specifies the order and encoding of populations, eg: #Sub_population order/code: golden_retriever=0 labrador_retriever=1 poodle poodle_small=2 just like in the msp file.

The second line specifies the column names, and every following line marks a genome position.

The first column indicates the physical position of the SNP and the remaining columns give the predicted reference panel population for the given interval. A genotype has two haplotypes, so the number of predictions for a genotype is 2*(number of genotypes) and therefore the total number of columns in the file is 1 + 2*(number of genotypes).

query_file_phased.vcf

When using Gnofix for phasing error correcting (See Phasing below), the inference above will be performed on the query haplotype phased by Gnofix. These phased haplotypes will then also be exported to query_file_phased.vcf in the <output_folder>/ folder.

Visualization

To visualize the local ancestry output along the chromosome using tagore for plotting, use the visualize_inference True option in the config file.

Model

When training a model, the resulting model will be stored in <output_folder>/models. That way it can be re-used for analyzing another dataset. The model's estimated accuracy is logged along with a confusion matrix which is stored in <output_folder>/models/analysis.

Simulated data

The program simulates training data and stores it in <output_folder>/generated_data. To automatically remove the created data when training is done, set rm_simulated_data to True in config.yaml. Note that in some cases, the simulated data can be re-used for training with similar settings. In those cases, not removing the data and then setting run_simulation to False will re-use the previously simulated data which can save a lot of time and compuation.

Phasing

Accurate phasing of genomic data is crucial for human demographic modeling and identity-by-descent analyses. It has been shown that leveraging information about an individual’s genomic ancestry improves performance of current phasing algorithms. Gnofix is a method that uses local ancestry inference to do exactly that. If you suspect your data might have phasing errors (generally the case unless trio phasing was possible), we recommend using this option <phase> as True. See the gnofix/ folder if interested in more details on the algorithm.

Sequenced haplotypes phased with a phasing software (left). LAI is used to label haplotypes with ancestry predictions and phasing errors become evident (center). Phasing error correction using LAI is applied to correct phasing errors (right). Small numbers of phasing errors do not, however, impact the correct association of a variant with an ancestry, and so are typically only a visual nuisance.

Calibration

To ensure that G-Nomix outputs probability estimates that reflect it's true confidence and accuracy, we recommend using calibration. We use Isotonic Regression to map the predicted probabilities to calibrated probabilities where the latter are more likely to have predictions with a confidence of X% correct matching their actual X% frequency of being correct in practice.

License

NOTICE: This software is available for use free of charge for academic research use only. Commercial users, for profit companies or consultants, and non-profit institutions not qualifying as "academic research" must contact the Stanford Office of Technology Licensing for a separate license. This applies to this repository directly and any other repository that includes source, executables, or git commands that pull/clone this repository as part of its function. Such repositories, whether ours or others, must include this notice. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to this effect.

Cite

When using this software, please cite:

Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, Daniel Mas Montserrat, Alexander G Ioannidis: "High Resolution Ancestry Deconvolution for Next Generation Genomic Data"

https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1

@article {Hilmarsson2021.09.19.460980,
	author = {Hilmarsson, Helgi and Kumar, Arvind S and Rastogi, Richa and Bustamante, Carlos D and Mas Montserrat, Daniel and Ioannidis, Alexander G},
	title = {High Resolution Ancestry Deconvolution for Next Generation Genomic Data},
	elocation-id = {2021.09.19.460980},
	year = {2021},
	doi = {10.1101/2021.09.19.460980},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {As genome-wide association studies and genetic risk prediction models are extended to globally diverse and admixed cohorts, ancestry deconvolution has become an increasingly important tool. Also known as local ancestry inference (LAI), this technique identifies the ancestry of each region of an individual{\textquoteright}s genome, thus permitting downstream analyses to account for genetic effects that vary between ancestries. Since existing LAI methods were developed before the rise of massive, whole genome biobanks, they are computationally burdened by these large next generation datasets. Current LAI algorithms also fail to harness the potential of whole genome sequences, falling well short of the accuracy that such high variant densities can enable. Here we introduce G-Nomix, a set of algorithms that address each of these points, achieving higher accuracy and swifter computational performance than any existing LAI method, while also enabling portable models that are particularly useful when training data are not shareable due to privacy or other restrictions. We demonstrate G-Nomix (and its swift phase correction counterpart Gnofix) on worldwide whole-genome data from both humans and canids and utilize its high resolution accuracy to identify the location of ancient New World haplotypes in the Xoloitzcuintle, dating back over 100 generations. Code is available at https://github.com/AI-sandbox/gnomixCompeting Interest StatementCDB is the founder and CEO of Galatea Bio Inc and on the boards of Genomics PLC and Etalon.},
	URL = {https://www.biorxiv.org/content/early/2021/09/21/2021.09.19.460980},
	eprint = {https://www.biorxiv.org/content/early/2021/09/21/2021.09.19.460980.full.pdf},
	journal = {bioRxiv}
}

gnomix's People

Contributors

Stargazers

Watchers

Forkers

eardil dmasmont weekend37 roohy popgengent xuelei-dai oshiomah1 romerocruzsa gil-i lightning-auriga titouanch nmshahir

gnomix's Issues

Duplicated header in vcf output

After phasing, G-nomix spits out a phased vcf file.
Some of the header information comes once from the source (query) vcf and once from the G-nomix vcf writer:

This causes issues for downstream tasks.

Most of the input headers should probably be overwritten by the vcf writer.

Can't download pre-trained models

Hello dear developers,

I'm trying to download pretrained models but it doesn't seem to work as it's forming an empty cookie file.

Here's the content I'm getting :

# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

Thanks for putting this tool together!
Have a good day!

Jean-Christophe

Versioned releases

Could you please start creating versioned releases for ease of ensuring reproducible results?

Feature Request: Option to subset reference SNPs to query file positions before training

Would enhance model performance in most cases and avoid disaster where imputation has not been performed.

Merging MSP and VCF variant/ ancestry calling is .0|.1 correct?

Hi,
We are Trying to match phased ancestry calls in MSP with haplotype calls in vcf.
I assumed the .0 call before matches up with the haplotype before the pipe, is that true?

Thanks
Chiara

error of "struct.error: 'i' format requires -2147483648 <= number <= 2147483647"

I am getting the error below for most, but not all chromosomes. I am fitting my own model from scratch, using a subset of samples from the 1K Genomes 30x as a reference panel, with the variants reduced to match those in my array-based query file.

I have two different query files, on different arrays, and they both throw this error on the same chromosomes, so I suppose the error is coming about from the reference panel. I've looked into the solution mentioned in issue #40; it wasn't the problem. I'm currently trying the reference panel without dropping any variants, but that runs painfully slowly so it may be a while before I know.

Thanks for any suggestions.
Steve

Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 660/660 [10:10:54<00:00, 55.54s/it]
Training smoother...
100%|████████████████████████████████████████| 660/660 [02:38<00:00, 4.17it/s]

Traceback (most recent call last):
File "gnomix/gnomix.py", line 397, in
model = train_model(config, data_path, verbose=verbose)
File "gnomix/gnomix.py", line 195, in train_model
model.train(data=data, retrain_base=retrain_base, evaluate=True, verbose=verbose)
File "gnomix/src/model.py", line 116, in train
B_t2 = self.base.predict_proba(X_t2)
File "gnomix/src/Base/base.py", line 139, in predict_proba
return self.predict_proba_vectorized(X)
File "gnomix/src/Base/base.py", line 172, in predict_proba_vectorized
B = np.array(pool.starmap(self.predict_proba_base_model, base_args))
File "/usr/local/Cellar/[email protected]/3.7.16/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/local/Cellar/[email protected]/3.7.16/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/usr/local/Cellar/[email protected]/3.7.16/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
put(task)
File "/usr/local/Cellar/[email protected]/3.7.16/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/Cellar/[email protected]/3.7.16/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Can pre-trained models be used on GRCh38 data?

I am interested in using GNomix for local ancestry inference on my phased WES data.
I understand that the pre-trained models used GRCh37 data for training. My data is in GRCh38.
What I'm trying to understand is whether GNomix uses any positional information for the predictions in a way that would make it dependent on a genome build? Is it necessary to retrain the model from scratch using GRCh38 reference in order to use it on my data, or can I use the pre-trained models for my GRCh38 data out of the box? Any advice is highly appreciated.

Error when training model from scratch

Hello I'm following the gnomic demo tutorial. I'm running the line below adapted from the tutorial. I am using a Conda environment through a Mac terminal

python3 /share/hennlab/progs/gnomix/gnomix.py /share/genomes/1000genomes/phase3/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz /share/hennlab/progs/gnomix/demo/data/allchrs.b37.gmap /share/hennlab/progs/gnomix/demo/data/output 22 False /share/hennlab/progs/gnomix/demo/data/reference_1000g.vcf /share/hennlab/progs/gnomix/demo/data/1000g.smap > /share/hennlab/progs/gnomix/demo/training_log.txt`

I keep getting this error:
FileNotFoundError: [Errno 2] No such file or directory: 'False'

Is gnomix interpreting False as a file? How can I troubleshoot this error?
I'm using python 3.7.4 and I downloaded all required packages from the requirements.txt file

This is the python script that I used to create the gnomic call above:

`import numpy as np
import os
import pandas as pd

data_path = "/share/hennlab/progs/gnomix/demo/data"
query_file = "/share/genomes/1000genomes/phase3/" + "ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
reference_file = "/share/hennlab/progs/gnomix/demo/data/" + "reference_1000g.vcf"
sample_map_file = "/share/hennlab/progs/gnomix/demo/data/" + "1000g.smap"
genetic_map_file = "/share/hennlab/progs/gnomix/demo/data/" + "chr22.b37.gmap"
chm_nr = "22"
phase = "False"
output_basename = "/share/hennlabprogs/gnomix/demo/data/output"

train_cmd = " ".join(["python3 /share/hennlab/progs/gnomix/gnomix.py",
query_file, genetic_map_file, output_basename, chm, phase, reference_file, sample_map_file]) +
" > /share/hennlab/progs/gnomix/demo/training_log.txt"
print("Running in command line: \n\t", train_cmd)
os.system(train_cmd)

Error: Too few founders

Hello. I am trying to run Gnomix, but I am getting the following error message. I checked the input files and I cannot find the problem, could you give me some idea of what I should be looking for? I have three reference populations, with 10 individuals in each.

Thank you in advance,

Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Traceback (most recent call last):
File "/sw/bioinfo/G-Nomix/2022-09-18-de952a2/rackham/bin/gnomix.py", line 394, in
simulate_splits(base_args, config, data_path) # will create the simulation_output folder
File "/sw/bioinfo/G-Nomix/2022-09-18-de952a2/rackham/bin/gnomix.py", line 296, in simulate_splits
laidataset.simulate(num_outs[split],
File "/sw/bioinfo/G-Nomix/2022-09-18-de952a2/src/gnomix-main/src/laidataset.py", line 412, in simulate
maternal = admix(founders,founders_weight,gens[i],self.breakpoint_prob,self.num_snps,self.morgans)
File "/sw/bioinfo/G-Nomix/2022-09-18-de952a2/src/gnomix-main/src/laidataset.py", line 133, in admix
assert len(founders) >= 2, "Too few founders!!!"
AssertionError: Too few founders!!!

Incorrect window count in training set #2

We ran G-Nomix on array data from all autosomes using the default config file for arrays. All chromosomes completed successfully except chromosome 7 which died while training the smoother. The error message was:

Traceback (most recent call last):
File "gnomix.py", line 396, in
model = train_model(config, data_path, verbose=verbose)
File "gnomix.py", line 195, in train_model
model.train(data=data, retrain_base=retrain_base, evaluate=True, verbose=verbose)
File "/tmp/tmp.jstlaojJjT/src/model.py", line 117, in train
self.smooth.train(B_t2,y_t2)
File "/tmp/tmp.jstlaojJjT/src/Smooth/smooth.py", line 35, in train
B_s, y_s = self.process_base_proba(B, y)
File "/tmp/tmp.jstlaojJjT/src/Smooth/models.py", line 23, in process_base_proba
B_slide, y_slide = slide_window(B, self.S, y)
File "/tmp/tmp.jstlaojJjT/src/Smooth/utils.py", line 27, in slide_window
y_slide = None if y is None else y.reshape(N*W)
ValueError: cannot reshape array of size 1609674 into shape (1607992,)

In the final line above, y enters as a 2d object with size 1682*957, but is expected to be reshaped to size N*W=1682*956. We are unsure where the discrepancy in window counts arises, but after changing the value for model: window_size_cM to 0.4 from 0.2 to test if a different window count might work, the program is able to pass the previous point of failure and we confirmed that the new dimensions of y (1682*478) are compatible with the new values of N (1682) and W (478).

Issue in training mode : Qt platform plugin xcb

Hello dear support,

I tried to run my own training set but it's crashing at the re-training step.

Here's the command line I ran on a cluster machine :

python3 ~/bin/gnomix/gnomix.py hap_chr22.vcf chr22.output 22 False genetic_map_chr22_combined_b37.forGnomix.txt ~/References/1000G/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.noCNV.snv.vcf.gz ~/References/1000G/samples.merged.superpop.map

Here are the logs :

std err :

qt.qpa.xcb: could not connect to display localhost:24.0
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, webgl, xcb.

stdout :

Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
Training smoother...
[14:50:58] WARNING: xgboost/src/learner.cc:480:
Parameters: { use_label_encoder } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Evaluating model...
Re-training base models...

I also have some warnings due to xgboost, but it seems to run alright even with theses:

~/virtualenv_python_3.9_gnomix/lib/python3.9/site-packages/xgboost/compat.py:93: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Thanks for your help!

Jean-Christophe

Pre-Trained Models Calibration Issue

When trying to run the "python3 gnomix.py demo/data/small_query_chr22.vcf.gz demo_output 14 True pretrained_gnomix_models/chr14/model_chm_14.pkl > ./demo/pretrained_log.txt" command, I keep getting this error message:
ModuleNotFoundError: No module named 'calibration'

AttributeError: 'XGBClassifier' object has no attribute 'enable_categorical'

I appreciate the quick work on #8, but after pulling the latest version I've graduated to another problem

Inferring ancestry on query data...
/usr/local/lib/python3.9/site-packages/sklearn/base.py:209: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  warnings.warn('From version 0.24, get_params will raise an '
Traceback (most recent call last):
  File "gnomix.py", line 383, in <module>
    run_inference(base_args, model, 
  File "gnomix.py", line 55, in run_inference
    y_proba_query = model.smooth.predict_proba(B_query)
  File "src/Smooth/smooth.py", line 46, in predict_proba
    proba = self.model.predict_proba(B_s)
  File "/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py", line 1348, in predict_proba
    class_probs = super().predict(
  File "/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py", line 879, in predict
    if self._can_use_inplace_predict():
  File "/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py", line 813, in _can_use_inplace_predict
    not self.enable_categorical
AttributeError: 'XGBClassifier' object has no attribute 'enable_categorical'

PerformanceWarning message

Hi,

I am not sure if this is actually an issue that would compromise the analysis, but there is a message at the end of the analysis as follows:



--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, 
Daniel Mas Montserrat, Alexander G Ioannidis: 
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data" 
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 709/709 [01:04<00:00, 10.95it/s]                                      
Training smoother...
Fitting calibrator...
Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 709/709 [01:20<00:00,  8.83it/s]                                                            
Analyzing model performance...
Estimated train accuracy: 98.42%
Estimated val accuracy: 98.32%
Model, info and analysis saved at chr15/models/model_chm_chr15
--------------------------------------------------------------------------------
Launching inference...
Loading and processing query file...
- Number of SNPs from model: 1087854
- Number of SNPs from file: 30954
- Number of intersecting SNPs: 20036
- Percentage of model SNPs covered by query file: 1.8399999999999999%
Inferring ancestry on query data...
Phasing individual 493/493
Writing phased SNPs to disc...
/Users/debortoli/Downloads/arquivos_compactados/gnomix-main/src/utils.py:316: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df[data_samples[i]] = genotype_person
Saving results...

Specifically, it is the PerformanceWarning

I am getting a really weird result when running gnomix...
I have an admixed sample (Native American, African and European ancestries), with a global ancestry estimate around 70% EUR, 25% AFR and 5% NAM...
The Gnomix output .msp is assigning (I've tested only on one chromosome), AFR for all windows in all samples, which is not correct...

Not sure if this PerformanceWarning might be the culprit of this...

Thanks,

Strange results on 23 and me genotype data and pretrained models

I'm testing Gnomix on 23andme genotype data using the pretrained models, and the results I get seem completely incorrect. While the individual is of Eastern European and Jewish ancestry, but the predictions (a) consist of short segments, (b) most of the segments, at least on Chr 22, are predicted to be African.

The statistics in pretrained_log file (for Chr 22) are:
Loading and processing query file...

Number of SNPs from model: 317408
Number of SNPs from file: 2127
Number of intersecting SNPs: 2034
Percentage of model SNPs covered by query file: 0.64%

What could be the source of the error?

Error from predict_proba call

I'm afraid I barely know any python, so I can't debug this, but I'm getting the following error when using 1KG phased data.

Steve Buyske

python3  gnomix/gnomix.py ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz gnomix/demo/data/allchrs.b37.gmap test_chr_22 22 False gnomix/demo/model_chm_22.pkl.gz 
--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Kumar, A., Montserrat, D.M., Bustamante, C. and Ioannidis, A. 
"XGMix: Local-Ancestry Inference With Stacked XGBoost" 
International Conference on Learning Representations Workshops 
ICLR, 2020, Workshop AI4AH 
https://www.biorxiv.org/content/10.1101/2020.04.21.053876v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching in pre-trained mode...
Loading model...
Launching inference...
Loading and processing query file...
- Number of SNPs from model: 1103547
- Number of SNPs from file: 1103547
- Number of intersecting SNPs: 1102381
- Percentage of model SNPs covered by query file: 99.89%
Inferring ancestry on query data...
Traceback (most recent call last):
  File "/Users/buyske/Box/brant/gnomix/gnomix.py", line 372, in <module>
    run_inference(base_args, model, verbose=True)
  File "/Users/buyske/Box/brant/gnomix/gnomix.py", line 56, in run_inference
    B_query = model.base.predict_proba(X_query)
  File "/Users/buyske/Box/brant/gnomix/src/Base/base.py", line 143, in predict_proba
    B[:,i,self.models[i].classes_] = self.models[i].predict_proba(X_w)
KeyError: 0

Incomplete license

Could we have a LICENSE file in the repository please? The license text presently at https://github.com/AI-sandbox/gnomix#license is unlikely to get past our legal department.

Stucking at analysing model performance

Hi,

When i run gnomix it stuck at this stage. Seemingly no cpu usage...

Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 373/373 [00:02<00:00, 131.65it/s]
Training smoother...
/home/install/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py:251: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 373/373 [00:03<00:00, 123.36it/s]
Analyzing model performance...

Insane memory usage

I have just had a gnomix run die after attempting to use more than 1.4 TB of memory. Yes, terabytes. These are unimputed GSA microarray data phased using eagle. I am fitting the model myself using the suggested microarray configuration, and for now, I am only calculating local ancestry on chromosome 17. Based on reference overlap even after generating the local model, it looks like I will either need to filter the reference before model generation or impute the data.

I suspect the issue is partially sample size. I have 31,705 samples in that cohort. I am also running it on a GDA cohort (10,859 samples, and another cohort of 13 samples, and it did not die on the smallest cohort. I have a couple of questions on how to optimize this:

Firstly, it appears that the model generation only uses the reference dataset and not the sample to which it will be applied. I wrote a script to compare the models generated with different datasets and they appear to be identical. Is that the case? I ran without calibration, so is that the case with calibration?

Secondly, is there any problem with first generating the models, then applying them to all of the different datasets? Do you have a recommendation for a minimal dataset to use for that for the model generation to happen as fast as possible?

Confused by implausible results on array data

I'm very confused by the implausible results I'm getting on two African American cohorts, one genotyped on an Axiom array and one on an Omni array. For each, I phased them using the Michigan server and ran gnomix against the pretrained model with the default settings other than setting "inference: best" in config.yaml

However, I am getting means of .26 for the proportion of EUR ancestry, .25 for NAT, and .44 for AHG in one cohort, and .15, .18, and .65 respectively in the other cohort. There's not a lot of individual variability; for example, in the first cohort, the min and max for AHG are .41 and .47.

I get very similar results regardless of whether I have gnomix correct the phasing or whether I restrict to SNPs that are in the pretrained model. I'm sure I'm doing something wrong, but I'm baffled. I could use some guidance on how to debug this.

Inquire regarding Phasing Correction and LAI results

Hi,

I have been doing some LAI tests and I have some inquires regarding the outputs that I need some help understanding:

I have performed a Gnomix with the TRUE option for the Phasing correction enabled as well as with FALSE. I have also turned the option to generate the .lai file, which gives the ancestry per marker.

When I open the file .lai from the inference done with the TRUE option enabled for the phasing correction, I can see that the markers are randomly sorted. Why is it randomly sorted that way?
Also, when I take this .lai file from above and I sort the positions to reflect the same order as in the .lai file from the FALSE ran (for phasing correction), I noticed that some markers are assigned with completely different ancestry. I was expecting not a change in the ancestry assigned, but just a switch of the phase when turning the TRUE option on.

I don't know if that is something related to the fact that whenI ran with the TRUE option on I see that the .msp file has fixed windows, while for FALSE, the windows are different sizes...

Thanks,

Guilherme

sklearn version for pre-trained model for demo is different than requirements.txt

The requirements.txt installs scikit-learn version 0.23.1. However, when the demo evaluates gnomix.py with the pretrained model (./demo/model_chm_22.pkl.gz), the code crashes when looking for an attribute that seems to be unavailable in that version.

UserWarning: Trying to unpickle estimator LogisticRegression from version 0.24.1 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/home/eardil/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator IsotonicRegression from version 0.24.1 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
Traceback (most recent call last):
  File "gnomix.py", line 365, in <module>
    run_inference(base_args, model, visualize=config["inference"]["visualize_inference"], verbose=True)
  File "gnomix.py", line 55, in run_inference
    y_proba_query = model.smooth.predict_proba(B_query)
  File "/home/eardil/git/gnomix/src/Smooth/smooth.py", line 52, in predict_proba
    proba = self.calibrator.transform(proba)
  File "/home/eardil/git/gnomix/src/Smooth/Calibration.py", line 67, in transform
    iso_prob[:,i] = self.models[i].transform(proba_flatten[:,i])
  File "/home/eardil/.pyenv/versions/3.6.9/lib/python3.6/site-packages/sklearn/isotonic.py", line 358, in transform
    res = self.f_(T)
AttributeError: 'IsotonicRegression' object has no attribute 'f_'

Installing scikit-learn==0.24.1 solves this issue. If this is all it takes then maybe a change to requirements.txt would be enough. I'll create a pull request in case it is useful.

Feature Request: Option to output .bed files with contiguous segments for each sample in query file

Allows for better compression in some cases and makes downstream analysis easier (e.g. chromosome plotting).

Issue with the requirement.txt

Hi,

I am trying to install the dependencies and I am having a hard time with it.

$ pip install -r requirements.txt 
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Requirement already satisfied: cycler==0.10.0 in /usr/local/lib/python3.9/site-packages (from -r requirements.txt (line 1)) (0.10.0)
Collecting Cython==0.29.21
  Using cached Cython-0.29.21-py2.py3-none-any.whl (974 kB)
Collecting dask==2.20.0
  Using cached dask-2.20.0-py3-none-any.whl (826 kB)
ERROR: Could not find a version that satisfies the requirement dataclasses==0.8 (from versions: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6)
ERROR: No matching distribution found for dataclasses==0.8

I can't find a way to install this dataclasses version 0.8...

Also when I try to run gnomix.py I get this error:


$ python3 gnomix.py 
Traceback (most recent call last):
  File "gnomix.py", line 16, in <module>
    from src.model import Gnomix
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/gnomix-main/src/model.py", line 8, in <module>
    from src.Smooth.models import XGB_Smoother, CRF_Smoother
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/gnomix-main/src/Smooth/models.py", line 6, in <module>
    from src.Smooth.crf import CRF
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/gnomix-main/src/Smooth/crf.py", line 3, in <module>
    import sklearn_crfsuite
ModuleNotFoundError: No module named 'sklearn_crfsuite'

Not sure if it's related to my python or because I am using a Mac...

Any help is helpful,

Thank you.

demo.ipynb has a bad script

Hello gnomix team,

I was following the demo.ipynb file to train a model and I realize that it has a wrong script:

train_cmd = " ".join(["python3 gnomix.py", query_file, genetic_map_file, output_basename, chm, phase, reference_file, sample_map_file]) + \ " > ./demo/training_log.txt" print("Running in command line: \n\t", train_cmd) os.system(train_cmd)

The order is not the same as the one you suggest in the README.me which is:

$ python3 gnomix.py <query_file> <output_folder> <chr_nr> <phase> <genetic_map_file> <reference_file> <sample_map_file>

If you run the script as it is in the demo.ipynb it won't work. But if you change the order as stated in the README.me it will work.

Thanks for your attention!

Not Non-negative probabilities when training the model from scratch

Hi, I was trying to follow the demo and train the model from scratch for 1000G data ( after subsetting 1000 samples) and I get the following error:

Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Traceback (most recent call last):
  File "gnomix.py", line 392, in <module>
    simulate_splits(base_args, config, data_path) # will create the simulation_output folder
  File "gnomix.py", line 298, in simulate_splits
    return_out=False)
  File "/data/pshakya/COLORPBWT/gnomix/src/laidataset.py", line 410, in simulate
    maternal = admix(founders,founders_weight,gens[i],self.breakpoint_prob,self.num_snps,self.morgans)
  File "/data/pshakya/COLORPBWT/gnomix/src/laidataset.py", line 159, in admix
    p=breakpoint_probability)
  File "mtrand.pyx", line 931, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

Any ideas on how to fix this ?

Error while training

I have been testing gnomic in training mode, and got the next output:

...
--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, 
Daniel Mas Montserrat, Alexander G Ioannidis: 
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data" 
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "gnomix.py", line 358, in <module>
    config = yaml.load(file, Loader=yaml.UnsafeLoader)
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/constructor.py", line 41, in get_single_data
    node = self.get_single_node()
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/composer.py", line 35, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/parser.py", line 143, in parse_implicit_document_start
    StreamEndToken):
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/cm/shared/apps/anaconda3/2021.05/envs/amarin37/lib/python3.7/site-packages/yaml/scanner.py", line 260, in fetch_more_tokens
    self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "/path/training.smap", line 1, column 8

Which seems really weird to me. Indeed, the training.smap file does not contain a '\t' string, and clearly is doesn't have more than two columns.

AttributeError: 'LogisticRegressionBase' object has no attribute 'vectorize'

I'm getting stuck with the error "AttributeError: 'LogisticRegressionBase' object has no attribute 'vectorize'".

Here is my command and the output:

gnomix % python gnomix.py ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz demo/data/allchrs.b37.gmap chr_22 22 False pretrained_gnomix_models/chr22/model_chm_22.pkl

--------------------------------------------------------------------------------
-----------------------------------  Gnomix  -----------------------------------
--------------------------------------------------------------------------------
When using this software, please cite: 
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, 
Daniel Mas Montserrat, Alexander G Ioannidis: 
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data" 
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching in pre-trained mode...
Loading model...
Launching inference...
Loading and processing query file...
- Number of SNPs from model: 317408
- Number of SNPs from file: 1103547
- Number of intersecting SNPs: 317361
- Percentage of model SNPs covered by query file: 99.99%
- Found 8 (0.0025%) different reference variants. Adjusting...
Inferring ancestry on query data...
Traceback (most recent call last):
  File "gnomix.py", line 372, in <module>
    run_inference(base_args, model, 
  File "gnomix.py", line 53, in run_inference
    B_query = model.base.predict_proba(X_query)
  File "src/Base/base.py", line 136, in predict_proba
    if self.vectorize:
AttributeError: 'LogisticRegressionBase' object has no attribute 'vectorize'

And here are the details of my configuration

Macintosh OS 11.6 "Big Sur"
python 3.9.7

matplotlib 3.4.3
numpy 1.21.3
pandas 1.3.4
PyYAML 6.0
scikit-allel 1.3.5
scikit-learn 0.23.2
scipy 1.7.1
seaborn 0.11.2
sklearn 0.0
sklearn-crfsuite 0.3.6
tqdm 4.62.3
uncertainty-calibration 0.0.9
xgboost 1.5.0

Thanks for any help.
Steve

Gnofix standalone functionality

Hello! I am very interested in trying out your Gnofix phasing correction method on existing local ancestry calls (in RFMix2 format), but I saw in another closed issue that the standalone functionality is disabled for now ( #15 ). Are there plans to enable this functionality in the near future?

Preserve header and add ALT allele to query_file_phased.vcf

Feature request:

Hi all, I have two feature requests,

The first is if it would be possible to preserve the header of the input VCF file in the phasing error correction file?
The current column output for the query_file_phased.vcf is:
POS ID REF VAR QUAL FILTER INFO FORMAT

with all the values under VAR as NA

would it be possible to change it to
POS ID REF ALT QUAL FILTER INFO FORMAT

and report the alternative allele?

This would make it easier for downstream analysis with plink, bcftools, etc.

Thanks!

Gnofix

Hi! Thank you for open sourcing this great code base. I was interested in trying out your phasing correction method, gnofix, but I noticed that it has been disabled / commented out. Do you have plans to release this or are you working on different approaches? Thank you!