Coder Social home page Coder Social logo

drifttest-evaluation's Introduction

########################
# drifttest-Evaluation #
########################

drifttest-Evaluation is a set of R scripts to evaluate the performance of the temporal genome scan implemented in drifttest (doi:10.5281/zenodo.1194663). It simulates (via SLiM 1.8) population genetic data from a population sampled twice at two different times and runs drifttest on the simulated sample.

These scripts are distributed for replicability of the results from the work "Power and limits of selection genome scans on temporal data from a selfing population" by Miguel Navascués, Arnaud Becheler, Laurène Gay, Joëlle Ronfort, Karine Loridon and Renaud Vitalis doi:10.1101/2020.05.06.080895. However, note that these scripts use SLiM version 1.8. More recent versions of SLiM (>=2.0) include a scripting language that allows to do, in a more efficient way, some of the simulation features used in drifttest-Evoluation. For uses other than the replication of results from our article we strongly recommend using the latest version of SLiM.

Requirements
============

* Executable file for SLiM 1.8
* Executable file for drifttest (available at Zenodo; doi:10.5281/zenodo.1194663)
* R packages: batch, qqman, qvalue

How to run
==========

To run with the default options:

$ R --no-save < main.R

In order to run with other options use flag --args plus name_variable value_variable:

$ R --no-save --args simID "scenario1" sigma 0.5 < main.R


Options
=======

seed4random (default 1623656) seed for random number generator
$ R --no-save --args seed4random 1623656 < main.R

simID (default "test") name for the simulation, folder output takes this name
$ R --no-save --args simID "test" < main.R

sigma (default 0.95) selfing rate
$ R --no-save --args sigma 0.95 < main.R

u (default 1e-8) mutation rate per bp
$ R --no-save --args u 1e-8 < main.R

r (default 1e-8) recombination rate per bp
$ R --no-save --args r 1e-8 < main.R

genome_length (default 5e8) genome size in bp
$ R --no-save --args genome_length 5e8 < main.R

N (default 500) population size  in number of diploid individuals
$ R --no-save --args N 500 < main.R

sel_coef (default 0.5) selection coefficient of the beneficial allele
$ R --no-save --args sel_coef 9.5 < main.R

dominance_coef (default 1) dominance coefficient of the beneficial  allele
$ R --no-save --args dominance_coef 1 < main.R

number_of_times (default 20) number_of_times x N are the number of generations run to assure mutation drift equilibrium before sampling the population
$ R --no-save --args number_of_times 20 < main.R

selection_mode (default "NM"; possible values "NM", "SV") type of selective scenario: adaptation from new mutation, "NM", or from standing variation "SV"
$ R --no-save --args selection_mode "SV" < main.R

initial_frequency (default 0.8) Option valid only for selection_mode="SV". Initial allele frequency of the beneficial  allele. Values <=0 or >=1 indicate that an initial allele frequency is not present and a random locus is chosen among all polymorphic sites.
$ R --no-save --args selection_mode "SV" initial_frequency -9 < main.R

fixed_advantageous_allele (default "derived"; possible values "derived", "ancestral") this variable indicates which allele of the locus under selection is beneficial. This option is valid only for selection_mode="SV" and 0<initial_frequency<1.
$ R --no-save --args selection_mode "SV" initial_frequency 0.5 fixed_advantageous_allele "ancestral" < main.R

selection_period_duration (default 25) Number of generations between the two samples
$ R --no-save --args selection_period_duration 8 < main.R

sample_size (default "c(50,30)") Number of diploid individuals sampled at each of the two time samples.
$ R --no-save --args sample_size "c(100,100)" < main.R

sample_size_loci (default 10000) Number of loci genotyped in the sample
$ R --no-save --args sample_size_loci 3000 < main.R

MAF_threshold (default 0.05) Threshold for filtering loci with lower global minor allele frequency (MAF)
$ R --no-save --args MAF_threshold 0.01 < main.R

number_of_replicates (default 2) Number of times the simulaed scenario will be replicated
$ R --no-save --args number_of_replicates 100 < main.R

quiet (default "c(F)"; possible values "c(FALSE)", "c(TRUE)") Set to true for less progress messages
$ R --no-save --args quiet "c(T)" < main.R

remove_files (default "c(F)"; possible values "c(FALSE)", "c(TRUE)") Set to true to automatically remove some intermediate files, such as the input file for drifttest
$ R --no-save --args remove_files "c(T)" < main.R

num_of_threads (default 48) number of cores to use (by drifttest) if parallel computing is possible 
$ R --no-save --args num_of_threads 1 < main.R


Output
======

Output files are saved in folder Results/simID/ for summary results over all replicates, and Results/simID/replictaes for results from each replicate.

File Results/simID/simID_params.RData is a R binary file with contains the values of all the options used to run the simulations and analysis.

File Results/simID/simID_log.txt is a text file with some log information about the running of the simulations and analysis.

File Results/simID/simID_results.RData is a R binary file with contains the following objects:
* data.frame fst_with_distance contains estimates of Fst (and Ne) from loci at windows at increasinf distances from the locus under selection
* data.frame results contains, for each replicate, estimates of Fst, Fis and Ne from all loci, estimates of Fst and Ne from loci in the neutralk chromosome and estimates of Fst and Ne from loci in the selected chromosome
* data.frame results_per_locus with results for each locus for every replicate, including estimate of Fst, p-value, q-value, distance to the locus under selection (if in the same chromosome, in bp and in centimorgan), selective category (neutral or not), and whether it is among the 50 top SNPs (i.e. the 50 SNPs with lower p-value among their replicate).
* data.frame ROC with the values for the Receiver operating characteristic curve (FPR-false positive rate vs. TPR true positive rate for the whole range of possible significance thresholds)
* matrix SelectionFootprint_thresholdPvalue0.001 with estimate of positive rate in function of distance to the locus under selection when using a criterium for significance a threshold of 0.001 for the p-value
* matrix SelectionFootprint_thresholdqvalue0.05 with estimate of positive rate in function of distance to the locus under selection when using a criterium for significance a threshold of 0.05 for the q-value
* matrix SelectionFootprint_top50Pvalue with estimate of positive rate in function of distance to the locus under selection when using a criterium for significance being among the 50 loci with lower p-value
* num FPR_thresholdPvalue0.001 with estimate of false positive rate when using a criterium for significance a threshold of 0.001 for the p-value
* num FPR_thresholdqvalue0.05 with estimate of false positive rate when using a criterium for significance a threshold of 0.05 for the q-value
* num FPT_top50Pvalue with estimate of false positive rate when using a criterium for significance being among the 50 loci with lower p-value
* num POWER_thresholdPvalue0.001 with estimate of false positive rate when using a criterium for significance a threshold of 0.001 for the p-value
* num POWER_thresholdqvalue0.05 with estimate of false positive rate when using a criterium for significance a threshold of 0.05 for the q-value
* num POWER_top50Pvalue with estimate of false positive rate when using a criterium for significance being among the 50 loci with lower p-value

Files Results/simID/*.pdf, plots made from the values in file simID_results.RData.

File Results/simID/populationATequilibrium is a text file poroiduced by SLiM that contains a snapshot of the population state after 20N generations. It can be used to repeat the analysis without having to repeat the initial simulation phase to reach equilibrium.


File Results/simID/replicates/simID_i_data.RData (where i is a number that identifies the replicate) is a R binary file that contains data for the sampled loci and individuals in the two sampling generations. It contains the follwing objects:
* data.frame m2 contains information on the mutation under selection (position x, selection coefficient s, dominance coefficient h, time of appareacne, number of copies at time 1 n_pop.t1 and at time 2 n_pop.t2)
* list SNP_data contains: data.frame SNP_table with the same information as m2 for all loci in the sample (including the locus under selection); matrix genpotype_data with genotypes coded as required by drifttest input file (first column an individual identifier, subsequent columns number of copies of the reference allele for each locus); matrices sampled_haplotypes_1 and sampled_haplotypes_2 contain the first and seconf haplotypes of the genome of each individual in the sample.
* num trajectory with number of copies fo the allele under selection at each generation of the selection period

File Results/simID/replicates/simID_i_whole_pop.RData (where i is a number that identifies the replicate) is a R binary file that contains data for all loci and all individuals in the two sampling generations. It contains the follwing objects:
* data.frame m2 contains information on the mutation under selection (position x, selection coefficient s, dominance coefficient h, time of appareacne, number of copies at time 1 n_pop.t1 and at time 2 n_pop.t2)
* list whole_pop_data contains: data.frame SNP_table with the same information as m2 for all polymorphic loci in the population (including the locus under selection); matrix genpotype_data with genotypes for all individuals of the population coded as required by drifttest input file (first column an individual identifier, subsequent columns number of copies of the reference allele for each locus); matrices sampled_haplotypes_1 and sampled_haplotypes_2 contain the first and seconf haplotypes of the genome of each individual in the population.

Input and output files for drifttest: simID_i_drifttest_input, simID_i_drifttest_out_multilocus, simID_i_drifttest_out_locus_by_locus

drifttest-evaluation's People

Contributors

mnavascues avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.