Coder Social home page Coder Social logo

gross's Introduction

GRoSS

Graph-aware Retrieval of Selective Sweeps

Here, I introduce a method to detect selective sweeps across the genome, when using many populations that are related via a complex admixture graph. I made some slight modifications to the QB statistic from Racimo, Berg and Pickrell (2018) which was originally meant to detect polygenic adaptation using admixture graphs (see https://github.com/FerRacimo/PolyGraph). The new statistic - which I call SB - does not need GWAS data and works with allele frequency data alone. It can be used to both scan the genome for regions under strong single-locus positive selection, and pinpoint where in the graph the selective event most likely took place. See the file detecting-single-locus.pdf for an explanation of how the statistic works.

If you end up using it, please cite the following paper: https://genome.cshlp.org/content/29/9/1506

Free preprint version: https://www.biorxiv.org/content/10.1101/453092v2

Required R Libraries

R
install.packages("msm")
install.packages("reshape2")
install.packages("pscl")
install.packages("parallel")
install.packages("data.table")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("devtools")
install.packages("readr")
install.packages("qqman")
install.packages("optparse")
install.packages("admixturegraph",repos=unique(c(getOption("repos"),repos="https://cran.microsoft.com/snapshot/2019-04-01/")))
devtools::install_github("mailund/graphparse")
source("https://bioconductor.org/biocLite.R"); biocLite("biomaRt")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("biomaRt"))

Examples

As a visual example of what GRoSS can do, here are two graphs showing -log10(p-values) for the SB statistic computed on SNPs along the genome. The first graph was made using populations from the 1000 Genomes Project Phase 3. The second graph was made using populations from Lazaridis et al. (2014), after imputation on the 1000 Genomes data.

Running GRoSS

Here is an example line for generating the above results. The main R script is GRoSS.R and it requires the user to specify:

  • an input file (*txt) specified with the -e option
    • The first column of this file is the chromosome name, the second column is the position, the third column is a SNP ID identifier. The identifier can just be equal to "[chromosome]-[position]" if no SNP id is available, but it must be unique to each SNP. All the other columns contain the number of reference and alternative alleles in each population, separated by a comma (e.g. 5,8 if there are 5 reference alleles and 8 alternative alleles in a given population at a given SNP).
    • NOTE: GRoSS will ignore any line where one or more population panels have missing data ("0,0").
  • a graph file describing the topology of the graph. This can be:
    • in the same format as the graph file that is used as input for qpGraph (Patterson et al. 2012) (note that the value of the fitted admixture weights must be correctly stated), specified with the -r option OR
    • in dotfile format (outputted from qpGraph after fitting), specified with the -d option
    • NOTE: do not add comments (lines starting with "#") to either of these files
  • an output file on which to write the results, specified with the -o option

We will use the same 1000 Genomes populations as in the example above, but limiting ourselves to chr22. First, unpack the input file KG_popfile_chr22.txt.gz.

gzip -c KG_popfile_chr22.txt.gz > KG_popfile_chr22.txt

Then, run the program:

Rscript GRoSS.R -e KG_popfile_chr22.txt -r 1KG_MSL_ESN_CDX_JPT_CEU_TSI_CHB.graph -o SNPstat_1KG_MSL_ESN_CDX_JPT_CEU_TSI_CHB_chr22.tsv

The output contains the chi-squared statistics and the corresponding p-values for each branch of the admixture graph.

The Plotting_and_Windowing.txt file contains some R commands to optionally merge SNPs into windows and plot the output.

For an example of how to create a graph file with admixture, see: https://github.com/DReichLab/AdmixTools/blob/master/examples.qpGraph/gr1x

A guide for converting a VCF file into a GRoSS input file

If you have a VCF file and would like to convert it to a GRoSS input file (e.g. as in KG_popfile_chr22.txt), then you can use this handy guide created by Gabriel Renaud: https://github.com/FerRacimo/GRoSS/blob/master/VCFtoGRoSS.md

New feature: correction for low sample sizes

We've recently implemented a modified of the Q_S statistic for cases of low sample sizes. A full description of it can be found here. The modified statistic uses a Normal approximation to the binomial distribution that accounts for the increased variance in sample allele frequencies, relative to population allele frequencies, as a consequence of finite sample sizes. We recommend the use of this feature over the "vanilla" version of GRoSS, especially when the number of (diploid) individuals in at least one of the populations is less than 20. It can be run by using the "-s" option:

Rscript GRoSS.R -e KG_popfile_chr22.txt -r 1KG_MSL_ESN_CDX_JPT_CEU_TSI_CHB.graph -o SNPstat_1KG_MSL_ESN_CDX_JPT_CEU_TSI_CHB_chr22.tsv -s

gross's People

Contributors

ferracimo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gross's Issues

Can we interpret significant gross result as "natural selection favored derived allele"?

HI,
I notice that the GROSS result is derived from chi statistics and is non-negative, thus I'm a bit puzzle about the direction of detected positive selection. For an SNP with a significant GROSS result, can we state that the derived allele of this SNP is favored by natural selection, and its frequency was elevating during the tested branch? Thanks for your help.

vcf file conversion

Is there a way to convert vcf files to the format required for the GRoSS package? Thank you!

Dependence on Mailund's matchbox

Hi,
I supposed to use GRoSS, but am not able to install matchbox (which is in dependecies). Thomas Mailund says that he knows about the problem, but can't fix it now. Is there some way get around usage of mathbox?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.