popsim-consortium / stdpopsim Goto Github PK

View Code? Open in Web Editor NEW

121.0 121.0 86.0 3.4 MB

A library of standard population genetic models

License: GNU General Public License v3.0

Python 99.25% Jupyter Notebook 0.75%

stdpopsim's People

Stargazers

Watchers

Forkers

pmckenz1 mcveanlab andrewkern lohmuellerlab fbaumdicker jgallowa07 carjed luntergroup awohns ferracimo ivan-krukov agladstein grahamgower gtsambos chris1221 noscode arundurvasula vsbuffalo ndukler schnappi-wkl patience111 gibsonmatt iechevarriaz elifesciences-publications jeffspence petrelharp chookity-pokk mtomasini hyanwong apragsdale tatdattran hsiehph dduchen lindaongaro jeromekelleher tsackton lauterbur peterdfields igronau manolofperez freitas-lucas mrtnj attrna janaobsteter pbeerli evolgen brandonlind jeanrjc gregorgorjanc vitorpavinato aays ausseriridische02 percyfal vcaudill jradrion stickle-back-in-time viabard alipearson mysterioussonofgod izabelcavassim pa-gagnaire dani-davenport aabiddanda leonardoiasi bitarellolab ibreider88 xin-huang saurabhbelsare silastittes mufernando rwaples nspope kellenriall haono lntran26 kuhlwilm chriscrsmith ornobalam aaronruben minesrebollo ryangutenkunst a-ignatieva christianhuber ckyriazis jiseonmin

stdpopsim's Issues

Genome-wide recombination rate for `genome` class

Add a genome-wide recombination rate to the genome class. I'm happy to implement this today. The only question is, should it be created and stored as an attribute at the instantiation of a genome object, or, should there simply be a method which computes this when needed.

Add population metadata

Tskit allows arbitrary metadata to be assocated with populations. Msprime 0.7.1 just added the facility to use metdata with PopulationConfiguration objects, so we now can add some descriptive stucture to each population. This should really help with debugging.

I suggest we add something simple like, e.g.

metadata={
    "name": "AFR", 
    "description": "The contemporary African population"
}

Any thoughts @ndukler, @andrewkern?

Tracking list of published demographic models to include

Working on a big PR with several additional demographic models from the literature (mostly in human). Opening this issue to track progress. Here's the list I'm working on so far:

Human:

Gazave et al., 2014
Chen et al., 2015 (basically a revised Gazave model extended to include African population)
Browning et al., 2018 (admixture in American populations)
Ragsdale & Gravel, 2018 (modified Gutenkunst 3-pop model + Neanderthal and ghost archaic branches)
Gao & Keinan, 2016 (maybe) (this models recent growth in Europeans as faster-than-exponential, but I'm not sure msprime can handle this)

Chimp:

de Manuel et al., 2016 (4 chimp populations + bonobo outgroup)

Mutation rates/recombination rates required for humans

In homo_sapiens.py we have the length of each chromosome and placeholders for mutation and recombination rate data. Ideally, we'd like to have a mean rate that has been estimated for each chromosome. I'm not sure where we'd get this information for mutation rates.

Equality test for models

We want a way to compare two different stdpopsim demographic models for equality, as discussed in #12.

'recombination map not found' error for drosophila and arabidopsis models

I've been trying to run the Arabidopsis model with the snakemake pipeline and have been getting the error:

UserWarning: Warning: recombination map not found for chromosome: 'chr5' on map: 'Salome2012', substituting a zero-recombination map.

even though I have no problem downloading the maps otherwise. I noticed this error also occurs with the Drosophila model but not the human or e. coli models.

Are others able to reproduce this error or is it something on my end? I'm wondering if it may be caused by some of the recent updates to recombination map behavior since I did not run into this error previously.

General question about dev.

@jeromekelleher Is there an easy way to do the flake8 tests on code before it's committed?
I know that in .git/hooks/ there is a pre-commit.sample but simply changing the name doesn't achieve the same testing as circleCI.

I think we need something along the line of exec flake8 --max-line-length 89 ../../stdpopsim ../../tests within this executable, but I'm not sure.

could you maybe give a short explanation on how to do the CircleCI tests before committing?

Add --years option to CLI

Add an option to the CLI to rescale time into years in the output tree sequence using the scaling factors given in #73 and #74.

This is probably easiest done by dumping the tables and multiplying all the times by this factor.

If this option is used, there should be a reference printed out saying where the generation_time estimate came from and encouraging users to cite it.

Tennessen model QC

Just opening this up as a forum for the QC of the Tennessen two population OOA model as per the earlier discussion.

Output masking feature

On the conference call today, we decided to implement a feature to mask the tree sequences that we simulate from genetic maps.

Provide an option for the user to specify a set of genomic intervals (as a bed file, I guess?), which we "delete" (see below) from the simulated tree sequence.
By default, when we simulate from a genetic map we compute a simple mask based on regions of zero recombination within this map. We will derive some heuristics for this which may or may not be organism/chromosome specific.

A planned feature for the next version of stdpopsim would be to curate a set of masks and provide methods to download and cache these transparently, like we do for genetic maps.

The 'delete' (or maybe 'dice' - we're still discussing what it should be called) operation will take the tree sequence simulated under the raw recombination map, and remove all topology and site data from the masked regions. Therefore, any statistics computed on this will not be affected by the zero recombination rate sections. Anything plotted against genome coordinates will show big gaps (which is what we want, right?). This is being developed at tskit-dev/tskit#261

Change humans default generation time to 27 years

Currently 25, but

I've taken 27 years as a better default (reflecting long-term averages, maybe) for humans:
http://dx.doi.org/10.1002/ajpa.20188I've taken 27 years as a better default (reflecting long-term averages, maybe) for humans:
http://dx.doi.org/10.1002/ajpa.20188

Name abbreviation scheme

As discussed in the meeting on June 04, part of the clunkiness of the CLI can be attributed to the lack of concise names. It would be nice to have a reasonably consistent scheme for describing what things are.

Things we need to name:

Species

Should be easy: h_sap, d_mel etc? There must be well established notation for this somewhere that we can borrow?

Population models

It would be nice for the abbreviation to describe what the model is. Perhaps the first part would be the number of populations and the second some memorable abbreviation (3pop_ooa == Gutenkunst et al, 2pop_ooa == Tennyson)? Then perhaps if we have multiple 3pop_ooa models we could add a date of publication (3pop_ooa_2009), defaulting the shorter name to be synonymous with the more recent model?

Maybe the 'pop' bit in here is redundant: would ooa_3 be better? I.e., first part is a descriptive abbreviation and the second part is the number of populations?

Recombination maps

We could abbreviate HapMapIIGrch37 to hm_37?

License is actually GPL

The license for the package is currently marked as MIT, but after thinking about this a bit I think it's not correct. The stdpopsim package imports msprime, which is GPL, and stdpopsim is therefore also GPL. This is annoying, but can't be helped.

@molpopgen, do you agree with this?

Developing a standard demographic model documentation format

To make model documentation easier I would like to standardize the format. The current sections I am thinking of are:

Model description: describes general scenario
Population indices: lists populations in model and their indices
Source paper citation: provides a citation of the key paper for the model

Are there any other sections that we should have? Do we want to go into more detail about the model construction, etc. ?

QC for fly models

Hey @ndukler-- here are the paper's associated with the fly models:

Sheehan and Song:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004845

Li and Stephan:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020166

The Li and Stephan model is a two population case. In that case index 0 is the African population and index 1 is the european popn

Add Appveyor tests

API documentation broken?

Guys does the API documentation look broken to you? @ndukler @jeromekelleher

https://stdpopsim.readthedocs.io/en/latest/

It doesn't seem to be rendering correctly for some reason.

Improve rebasing documentation

Need more detail on how to do this.

QC for Browning et al., 2018

I'll try to get this done this weekend

Species.genome not rendering correctly on RTD

The documentation for the species.genome variable is supposed to give details abut the variable, but is giving details about the class instead. This is possibly due to RTD using an older version of sphinx.

One easy workaround is to put the genome documentation directly into the RST file rather than as source comments (probably less brittle as well, given the non-standard #: comment format required).

Output SLiM code for population models

The population models are current defined in terms of the msprime APIs for defining populations, migration matrices and demographic events. Ideally, we would like to make these apply to SLiM also, so that we don't have to write the population models twice. Is it possible, even in principle, to output fragments of Eidos code which would implement the intended demography?

Concretely, we'd have something like

import string
import stdpopsim
from stdpopsim import homo_sapiens

chrom = homo_sapiens.genome.chromosomes["chr22"]
model = homo_sapiens.GutenkunstThreePopOutOfAfrica()

slim_template = string.Template("""
initialize() {
    initializeTreeSeq();
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 1e8-1);
    initializeRecombinationRate($recombination_rate);
}

$demographic_model

5000late() {
     sim.treeSeqOutput("ex1_TS.trees");
}""")
slim_program = slim_template.substitute(
    recombination_rate=chrom.mean_recombination_rate,
    demographic_model=model.as_slim())
ts = pyslim.run(slim_program)

Any thoughts @petrelharp, @BenHaller?

QC for Sheehan-Song Three Epoch Model (Drosophila)

When you have a chance can you just throw in the paper of reference and the population id indices for the production msprime implementation @andrewkern. Thanks!

surgery branch?

I see we have a second branch in the repo called surgery. do we still need this?

Generic models

#105 adds a ZigZag model to the list of human models. But, it's really a generic model. @andrewkern had the elegant idea that we could define some mixin models that we can combine with basic parameters for various organisms to get basic parameters for generation time, Ne etc. This seems feasible.

I suggest we define two more generic models so that we're not generalising on the basis of one case. Can we get some suggestions for what these should be please?

Allow ``species.genome["chr22"]``?

Currently we use

from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome.chromosomes["chr22"]

to access a particular chromosome for a species. This is a bit clunky. We would set it up so that

from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome["chr22"]

does the same thing (but keep the .chromosomes as well, for clarity. We could also add synonyms for the chromosomes, such that homo_sapiens.genome["chr22"] and homo_sapiens.genome["22"] return the same thing.

Any thoughts?

QC for Lapierre constant model (E. coli)

I'll handle this really quickly :)

Generation times as an attribute of model class

I raised this issue over in analysis, but I probably should have posted it here. Andy and I noticed that generation times might be better implemented as a class attribute for each of the models. Currently these times are hard-coded into some of the models, however they are provided to some of the n_t analysis programs as user-specified arguments, leading to potential misspecification issues.

recombination map doing the right thing?

trying to get to the bottom of some generic model issues and ran in to the following case.

Why am I getting the same number of trees out here when the maps are so different? @jeromekelleher @petrelharp

"""
test of genetic maps
"""
import stdpopsim as stp
import msprime as msp

seed = 12345
sample_size = 10
population = 0
mod = stp.homo_sapiens.GenericConstantSize()
chrStr = "chr20"
chrom = stp.homo_sapiens.genome.chromosomes[chrStr]
samples = [msp.Sample(population=population, time=0)] * sample_size
map1 = chrom.recombination_map()
ts1 = msp.simulate(
        samples=samples,
        recombination_map=map1,
        mutation_rate=chrom.default_mutation_rate,
        random_seed=seed,
        **mod.asdict())

map2 = msp.RecombinationMap.uniform_map(chrom.length,
                                        chrom.default_recombination_rate)
ts2 = msp.simulate(
        samples=samples,
        recombination_map=map2,
        mutation_rate=chrom.default_mutation_rate,
        random_seed=seed,
        **mod.asdict())
print(map1.get_rates()[0:10])
print(map2.get_rates()[0:10])

print(ts1.num_trees)
print(ts2.num_trees)

# returns
# [0.0, 7.357160000000001e-09, 7.34243e-09, 7.34147e-09, 7.34513e-09, 7.42335e-09, # 7.42618e-09, 7.46463e-09, 7.19975e-08, 7.220007e-08]
# [1.7178269031631664e-08, 0.0]
# 
# 96538
# 96538

Add Travis CI tests

Where to locate standard simulation scripts (not the models)?

If I understand the repo layout correctly,

stdpopsim holds models that can be used for generating simulations.
The example scripts in top-level folder are example scripts and don't create standardized simulations (which is why they don't dump the tree sequences at the end)

Where do the scripts for generating standard simulations belong? In stdpopsim or analysis?

A. thalina QC

Does anyone have QC code for the Arabidopsis models? I can also do it myself if there isn't any. As soon as the QC goes up I can add some documentation.

Formalise QC process

Opening this thread to get discussion going on how we can formalise the QC process. Once we've reached a consensus here we can add this to the developer documentation.

Here's a draft to get things going:

Developer A submits PR to add a new model. This must include basic unit tests and full documentation (i.e., the documentation that will be rendered on rtd). The code is checked for any obvious problems/style issues etc by maintainer M and merged when it meets these basic standards. The model is considered 'preliminary' and not linked from the online documentation and not included in the CLI.
Maintainer M creates an issue created tracking the QC for the model, and assigns a developer B (after looking for volunteers) to do a blind implementation of the model. M also creates an issue for the CLI implementation of the model.
Developer B creates a blind implementation of the model as PR within the unit testing module, and adds the automatic checking of this model for equality with the production model to the suite of unit tests. All being good, this PR is merged and the QC issue is closed.
Someone then makes a PR updating the CLI, checking that the documentation, citations etc all work properly, and adds the model to the list of documented models.

There's no mention of an analysis notebook here though: do we need then, or simply implementing the model blind enough?

Arbitration

When developers A and B disagree on the model implementation, the process is to

Try to hash out the details between them on the original issue thread
If this fails, contact the authors of the original publication to resolve ambiguities.

Refactor chromosome definitions

PR #45 adds the ability to reproducibly run standard simulations from the command line, which is a real win in terms of usability. During the process of working through this change, it's become clear that the way that genome/chromosome/genetic map information is laid out is really quite poor and we should refactor it.

One possibility I've tried out is to have a 'factory function' which would return a chromosome with the appropriate instance variables set. See a rough version here. Clearly we'd need something that would work for different species as well, but I guess it wouldn't be the end-of-the-world to define a chromosome_factory function in each of the species modules. So, from a users perspective, they call

chrom = homo_sapiens.chromosome_factory("chr22", "HapMapII_GRCh37")

and the returned object will have a recombination map created from the appropriate genetic map and the default mutation rate for this chromosome. If the user calls:

chrom = homo_sapiens.chromosome_factory("chr12")

They'll get back a chromosome object with the length appropriate for chr12, the default mutation rate and a flat recombination map with the default recombination rate.

One thing that turned out to be pretty useful for testing the CLI is to add an option to return a fraction of a chromosome. So, if we did

chrom = homo_sapiens.chromosome_factory("chr12", length_multiplier=0.1)

which will return a chromosome that's 1/10th of the length of chr12. I think this is generally a useful thing to be able to do --- sometimes there's really no point in simulating a whole chromosome.

Does this refactoring sound OK to everyone? Any thoughts on what else we'd need to include in the API, or a better naming scheme?

QC for Li-Stephans model (Drosophila)

When you have a chance can you just throw in the paper of reference and the population id indices for the production msprime implementation @andrewkern. Thanks!

Add Tennyson et. al model

Port the description from here. Should include documentation and analysis notebook.

Drosophila two population msprime simulations slow with recombination map

I've been trying to add drosophila for the two population analysis, however it seems that incorporating the recombination map into the msprime simulations is slowing it down a lot (ex. I have a simulation with 2 samples per population running and has been running for >5 days).

Is this expected or is there an error on my part? See zipped folder with a standalone script (dmel_two_pop_recomb_example.py) and scripts needed to run snakemake (edited to just run msprime).

dmel_two_pop_recomb_test.zip

Let me know if there is any other information that may be useful.

Update:
It took 5 to 6 days for the msprime simulations with two samples per-population to finish.

appdirs dependency not correctly defined?

I installed stdpopsim from my cloned repo and tried to import it but got the following error message:

ModuleNotFoundError: No module named 'appdirs'

I'm not really familiar with python but I'd imagine this indicates that something is missing in the dependency description for stdpopsim?

Documentation needed for GutenkunstThreePopOutOfAfrica

Need to fill out the documentation for the class, which is served here

"true" Ne

Over at tskit-dev/msprime#779 I have implemented a method that calculates the "true" Ne at a particular time - as the mean TMRCA - for comparison with methods that infer a single Ne (as opposed to a coal rate as a function of time). When that gets merged to msprime, shall we add them to @ckyriazis's plots?

Analysis notebook needed for GutenkunstThreePopOutOfAfrica model

We need some analysis of the GutenkunstThreePopOutOfAfrica model to verify that it is correctly implemented. This should be in the form of a Jupyter notebook (stored in a new directory, say notebooks). This can serve as a example for our model QC process.

spikes in the SFS when simulating with genetic map

Hey all--

@jradrion and I have been looking at the SFS produced by human models in an effort to debug stairwayplot a bit. We've noticed that simulating under a genetic map instead of a constant rates leads to strange spikes in the SFS.

Presumably these spikes are caused by low recombination regions of the simulated chromosome? As an empiricist, one would want to drop such regions from downstream analysis- should we consider doing the same here?

A minimum working example of this comparison can be found here. Run it a few times if you don't see a bigger spike in the genetic map case in the first run or two.

representative output looks like this

$ python testSFS.py
SFS with genetic map
[0.         0.41105418 0.15667424 0.10971423 0.10006515 0.03928301
 0.10844646 0.02769708 0.02285493 0.02421073]
SFS with constant rate
[0.         0.43990996 0.17313905 0.10304832 0.07146539 0.05412743
 0.04816482 0.03857653 0.0367551  0.03481339]

notice the freq=6 bin with the genetic map

thoughts?

include species.generation_time, in each module

While doing some analysis with stdpopsim I realized it might be nice (mostly for generalized code) to have default generation time associated with each model. Thoughts @jeromekelleher @andrewkern @petrelharp?

Add older human genetic maps

As discussed in meeting May 7th. Are you leading this one @andrewkern?

No `genetic_map_GRCh37_chrY.txt` in HapmapII_GRCh37_RecombinationHotspots.tar.gz

My lack of bioInformatics knowledge leads me to believe leaving out the Y chromosome is intentional. There also happens to be 3 files for the X Chromosome.

Either way,

180         map_file = os.path.join(self.map_cache_dir, self.file_pattern.format(name=name))
181         return msprime.RecombinationMap.read_hapmap(map_file)

in stdpopsim/genetic_maps heuristic won't quite work here (I think).

It will throw a FileNotFoundError if you do the following

import msprime
from stdpopsim import homo_sapiens

for chrom in homo_sapiens.genome.chromosomes:
    chrom = homo_sapiens.genome.chromosomes[chrom]
    chrom.recombination_map()

How can non-python programmers help?

I don't program in Python, and seems that this project is going to be Python based. Any recommendations on how I can help?

Change model.equals to 'verify' and raise an exception

Copied from #52.

@jeromekelleher As I kinda expected, my implementation of the model did not match the production version. Would it be a good idea to have a diff() function that would report model differences per epoch in a format similar to the DemographyDebugger() program? Otherwise I'm just comparing two DemographyDebugger() outputs which is doable, but feels clumsy.

What we could do is change the definition of Model.equals so that it raises an exception if models differ rather than returning true/false; this would at least say where the difference occurs). So, we'd change the name to check_equal, which would raise a ValueError with meaningful messages telling us which bits of the models differ. The existing tests can be rejigged to use this easily enough.

Would this be useful for the QC process @carjed, @ndukler?

Docs/CLI should list recombination maps in reverse time

If a user needs to use a recombination map usually what they want is the most recently defined one. So, documentation should list the maps so that the most recently defined map is listed first, and prominently display the year it was defined. This also goes for the help in the CLI, which should list available recombination maps with a short description of what they are, and their year.

Download/cache genetic maps

I've made some initial infrastructure to transparently download genetic maps on demand, and store them in a cache for future use. The code is here. The idea is that you define a subclass of the genetic map which gives a URL where it can downloaded. When you ask for a specific genetic map for a chromosome, it first checks the cache to see if the genetic map has already been downloaded. If so, you load the map directly from the cache. If not, you download the genetic map from its URL. I've implemented this for the HapMapII genetic map in humans, and it seems to work pretty well. In use, it looks like

genetic_map = stdpopsim.get_genetic_map("h_sapiens", "HapmapII_GRCh37")
recomb_map = genetic_map.get_chromosome_map("chr20")
ts = msprime.simulate(10, recombination_map=recomb_map)

There's some more work to be done in figuring out how this interacts with the chromosome definitions (which is currently clunky), but I think the basic infrastructure for downloading and managing the maps is good. It would be good to get some opinions on whether this is a good idea before I go any further with it: any opinions @popgensims/all?

The reason we need this sort of caching infrastructure is because the maps are too big to bundle with the code. The gzipped HapMap genetic map is 35M, which is already too much to bundle with a Python package. Multiply this by several different maps across multiple species and it's definitely way too much.

If we follow this approach, it might be worth thinking about putting all the maps that we use in one location --- this would surely be an easy thing to convince Amazon to store as a public dataset.

Ragsdale QC

Just checking @apragsdale , but N_0 is the population size of the archaic populations?

Renaming does not work properly if several threads are downloading maps concurrently.

The code for figuring out when to download a genetic map depends on checking for the corresponding files. So does substituting default zero maps, as introduced in #33. This means we are always returning zero maps at the moment.

(HT to @petrelharp here; I wouldn't have spotted this except for your warning!)