popsim-consortium / stdpopsim Goto Github PK
View Code? Open in Web Editor NEWA library of standard population genetic models
License: GNU General Public License v3.0
A library of standard population genetic models
License: GNU General Public License v3.0
Add a genome-wide recombination rate to the genome
class. I'm happy to implement this today. The only question is, should it be created and stored as an attribute at the instantiation of a genome
object, or, should there simply be a method which computes this when needed.
Tskit allows arbitrary metadata to be assocated with populations. Msprime 0.7.1 just added the facility to use metdata with PopulationConfiguration objects, so we now can add some descriptive stucture to each population. This should really help with debugging.
I suggest we add something simple like, e.g.
metadata={
"name": "AFR",
"description": "The contemporary African population"
}
Any thoughts @ndukler, @andrewkern?
Working on a big PR with several additional demographic models from the literature (mostly in human). Opening this issue to track progress. Here's the list I'm working on so far:
Human:
msprime
can handle this)Chimp:
In homo_sapiens.py we have the length of each chromosome and placeholders for mutation and recombination rate data. Ideally, we'd like to have a mean rate that has been estimated for each chromosome. I'm not sure where we'd get this information for mutation rates.
We want a way to compare two different stdpopsim demographic models for equality, as discussed in #12.
I've been trying to run the Arabidopsis model with the snakemake pipeline and have been getting the error:
UserWarning: Warning: recombination map not found for chromosome: 'chr5' on map: 'Salome2012', substituting a zero-recombination map.
even though I have no problem downloading the maps otherwise. I noticed this error also occurs with the Drosophila model but not the human or e. coli models.
Are others able to reproduce this error or is it something on my end? I'm wondering if it may be caused by some of the recent updates to recombination map behavior since I did not run into this error previously.
@jeromekelleher Is there an easy way to do the flake8 tests on code before it's committed?
I know that in .git/hooks/
there is a pre-commit.sample
but simply changing the name doesn't achieve the same testing as circleCI.
I think we need something along the line of exec flake8 --max-line-length 89 ../../stdpopsim ../../tests
within this executable, but I'm not sure.
could you maybe give a short explanation on how to do the CircleCI tests before committing?
Add an option to the CLI to rescale time into years in the output tree sequence using the scaling factors given in #73 and #74.
This is probably easiest done by dumping the tables and multiplying all the times by this factor.
If this option is used, there should be a reference printed out saying where the generation_time
estimate came from and encouraging users to cite it.
Just opening this up as a forum for the QC of the Tennessen two population OOA model as per the earlier discussion.
On the conference call today, we decided to implement a feature to mask the tree sequences that we simulate from genetic maps.
A planned feature for the next version of stdpopsim would be to curate a set of masks and provide methods to download and cache these transparently, like we do for genetic maps.
The 'delete' (or maybe 'dice' - we're still discussing what it should be called) operation will take the tree sequence simulated under the raw recombination map, and remove all topology and site data from the masked regions. Therefore, any statistics computed on this will not be affected by the zero recombination rate sections. Anything plotted against genome coordinates will show big gaps (which is what we want, right?). This is being developed at tskit-dev/tskit#261
Currently 25, but
I've taken 27 years as a better default (reflecting long-term averages, maybe) for humans:
http://dx.doi.org/10.1002/ajpa.20188I've taken 27 years as a better default (reflecting long-term averages, maybe) for humans:
http://dx.doi.org/10.1002/ajpa.20188
As discussed in the meeting on June 04, part of the clunkiness of the CLI can be attributed to the lack of concise names. It would be nice to have a reasonably consistent scheme for describing what things are.
Things we need to name:
Should be easy: h_sap
, d_mel
etc? There must be well established notation for this somewhere that we can borrow?
It would be nice for the abbreviation to describe what the model is. Perhaps the first part would be the number of populations and the second some memorable abbreviation (3pop_ooa == Gutenkunst et al, 2pop_ooa == Tennyson)? Then perhaps if we have multiple 3pop_ooa models we could add a date of publication (3pop_ooa_2009), defaulting the shorter name to be synonymous with the more recent model?
Maybe the 'pop' bit in here is redundant: would ooa_3 be better? I.e., first part is a descriptive abbreviation and the second part is the number of populations?
We could abbreviate HapMapIIGrch37 to hm_37
?
The license for the package is currently marked as MIT, but after thinking about this a bit I think it's not correct. The stdpopsim package imports msprime, which is GPL, and stdpopsim is therefore also GPL. This is annoying, but can't be helped.
@molpopgen, do you agree with this?
To make model documentation easier I would like to standardize the format. The current sections I am thinking of are:
Are there any other sections that we should have? Do we want to go into more detail about the model construction, etc. ?
Hey @ndukler-- here are the paper's associated with the fly models:
Sheehan and Song:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004845
Li and Stephan:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020166
The Li and Stephan model is a two population case. In that case index 0 is the African population and index 1 is the european popn
Guys does the API documentation look broken to you? @ndukler @jeromekelleher
https://stdpopsim.readthedocs.io/en/latest/
It doesn't seem to be rendering correctly for some reason.
Need more detail on how to do this.
I'll try to get this done this weekend
The documentation for the species.genome
variable is supposed to give details abut the variable, but is giving details about the class instead. This is possibly due to RTD using an older version of sphinx.
One easy workaround is to put the genome documentation directly into the RST file rather than as source comments (probably less brittle as well, given the non-standard #: comment format required).
The population models are current defined in terms of the msprime APIs for defining populations, migration matrices and demographic events. Ideally, we would like to make these apply to SLiM also, so that we don't have to write the population models twice. Is it possible, even in principle, to output fragments of Eidos code which would implement the intended demography?
Concretely, we'd have something like
import string
import stdpopsim
from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome.chromosomes["chr22"]
model = homo_sapiens.GutenkunstThreePopOutOfAfrica()
slim_template = string.Template("""
initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e8-1);
initializeRecombinationRate($recombination_rate);
}
$demographic_model
5000late() {
sim.treeSeqOutput("ex1_TS.trees");
}""")
slim_program = slim_template.substitute(
recombination_rate=chrom.mean_recombination_rate,
demographic_model=model.as_slim())
ts = pyslim.run(slim_program)
Any thoughts @petrelharp, @BenHaller?
When you have a chance can you just throw in the paper of reference and the population id indices for the production msprime implementation @andrewkern. Thanks!
I see we have a second branch in the repo called surgery
. do we still need this?
#105 adds a ZigZag model to the list of human models. But, it's really a generic model. @andrewkern had the elegant idea that we could define some mixin models that we can combine with basic parameters for various organisms to get basic parameters for generation time, Ne etc. This seems feasible.
I suggest we define two more generic models so that we're not generalising on the basis of one case. Can we get some suggestions for what these should be please?
Currently we use
from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome.chromosomes["chr22"]
to access a particular chromosome for a species. This is a bit clunky. We would set it up so that
from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome["chr22"]
does the same thing (but keep the .chromosomes
as well, for clarity. We could also add synonyms for the chromosomes, such that homo_sapiens.genome["chr22"]
and homo_sapiens.genome["22"]
return the same thing.
Any thoughts?
I'll handle this really quickly :)
I raised this issue over in analysis, but I probably should have posted it here. Andy and I noticed that generation times might be better implemented as a class attribute for each of the models. Currently these times are hard-coded into some of the models, however they are provided to some of the n_t analysis programs as user-specified arguments, leading to potential misspecification issues.
trying to get to the bottom of some generic model issues and ran in to the following case.
Why am I getting the same number of trees out here when the maps are so different? @jeromekelleher @petrelharp
"""
test of genetic maps
"""
import stdpopsim as stp
import msprime as msp
seed = 12345
sample_size = 10
population = 0
mod = stp.homo_sapiens.GenericConstantSize()
chrStr = "chr20"
chrom = stp.homo_sapiens.genome.chromosomes[chrStr]
samples = [msp.Sample(population=population, time=0)] * sample_size
map1 = chrom.recombination_map()
ts1 = msp.simulate(
samples=samples,
recombination_map=map1,
mutation_rate=chrom.default_mutation_rate,
random_seed=seed,
**mod.asdict())
map2 = msp.RecombinationMap.uniform_map(chrom.length,
chrom.default_recombination_rate)
ts2 = msp.simulate(
samples=samples,
recombination_map=map2,
mutation_rate=chrom.default_mutation_rate,
random_seed=seed,
**mod.asdict())
print(map1.get_rates()[0:10])
print(map2.get_rates()[0:10])
print(ts1.num_trees)
print(ts2.num_trees)
# returns
# [0.0, 7.357160000000001e-09, 7.34243e-09, 7.34147e-09, 7.34513e-09, 7.42335e-09, # 7.42618e-09, 7.46463e-09, 7.19975e-08, 7.220007e-08]
# [1.7178269031631664e-08, 0.0]
#
# 96538
# 96538
If I understand the repo layout correctly,
stdpopsim
holds models that can be used for generating simulations.Where do the scripts for generating standard simulations belong? In stdpopsim
or analysis
?
Does anyone have QC code for the Arabidopsis models? I can also do it myself if there isn't any. As soon as the QC goes up I can add some documentation.
Opening this thread to get discussion going on how we can formalise the QC process. Once we've reached a consensus here we can add this to the developer documentation.
Here's a draft to get things going:
There's no mention of an analysis notebook here though: do we need then, or simply implementing the model blind enough?
When developers A and B disagree on the model implementation, the process is to
PR #45 adds the ability to reproducibly run standard simulations from the command line, which is a real win in terms of usability. During the process of working through this change, it's become clear that the way that genome/chromosome/genetic map information is laid out is really quite poor and we should refactor it.
One possibility I've tried out is to have a 'factory function' which would return a chromosome with the appropriate instance variables set. See a rough version here. Clearly we'd need something that would work for different species as well, but I guess it wouldn't be the end-of-the-world to define a chromosome_factory
function in each of the species modules. So, from a users perspective, they call
chrom = homo_sapiens.chromosome_factory("chr22", "HapMapII_GRCh37")
and the returned object will have a recombination map created from the appropriate genetic map and the default mutation rate for this chromosome. If the user calls:
chrom = homo_sapiens.chromosome_factory("chr12")
They'll get back a chromosome object with the length appropriate for chr12, the default mutation rate and a flat recombination map with the default recombination rate.
One thing that turned out to be pretty useful for testing the CLI is to add an option to return a fraction of a chromosome. So, if we did
chrom = homo_sapiens.chromosome_factory("chr12", length_multiplier=0.1)
which will return a chromosome that's 1/10th of the length of chr12. I think this is generally a useful thing to be able to do --- sometimes there's really no point in simulating a whole chromosome.
Does this refactoring sound OK to everyone? Any thoughts on what else we'd need to include in the API, or a better naming scheme?
When you have a chance can you just throw in the paper of reference and the population id indices for the production msprime implementation @andrewkern. Thanks!
I've been trying to add drosophila for the two population analysis, however it seems that incorporating the recombination map into the msprime simulations is slowing it down a lot (ex. I have a simulation with 2 samples per population running and has been running for >5 days).
Is this expected or is there an error on my part? See zipped folder with a standalone script (dmel_two_pop_recomb_example.py) and scripts needed to run snakemake (edited to just run msprime).
Let me know if there is any other information that may be useful.
Update:
It took 5 to 6 days for the msprime simulations with two samples per-population to finish.
I installed stdpopsim from my cloned repo and tried to import it but got the following error message:
ModuleNotFoundError: No module named 'appdirs'
I'm not really familiar with python but I'd imagine this indicates that something is missing in the dependency description for stdpopsim?
Need to fill out the documentation for the class, which is served here
Over at tskit-dev/msprime#779 I have implemented a method that calculates the "true" Ne at a particular time - as the mean TMRCA - for comparison with methods that infer a single Ne (as opposed to a coal rate as a function of time). When that gets merged to msprime, shall we add them to @ckyriazis's plots?
We need some analysis of the GutenkunstThreePopOutOfAfrica model to verify that it is correctly implemented. This should be in the form of a Jupyter notebook (stored in a new directory, say notebooks
). This can serve as a example for our model QC process.
Hey all--
@jradrion and I have been looking at the SFS produced by human models in an effort to debug stairwayplot a bit. We've noticed that simulating under a genetic map instead of a constant rates leads to strange spikes in the SFS.
Presumably these spikes are caused by low recombination regions of the simulated chromosome? As an empiricist, one would want to drop such regions from downstream analysis- should we consider doing the same here?
A minimum working example of this comparison can be found here. Run it a few times if you don't see a bigger spike in the genetic map case in the first run or two.
representative output looks like this
$ python testSFS.py
SFS with genetic map
[0. 0.41105418 0.15667424 0.10971423 0.10006515 0.03928301
0.10844646 0.02769708 0.02285493 0.02421073]
SFS with constant rate
[0. 0.43990996 0.17313905 0.10304832 0.07146539 0.05412743
0.04816482 0.03857653 0.0367551 0.03481339]
notice the freq=6 bin with the genetic map
thoughts?
While doing some analysis with stdpopsim
I realized it might be nice (mostly for generalized code) to have default generation time associated with each model. Thoughts @jeromekelleher @andrewkern @petrelharp?
As discussed in meeting May 7th. Are you leading this one @andrewkern?
My lack of bioInformatics knowledge leads me to believe leaving out the Y chromosome is intentional. There also happens to be 3 files for the X Chromosome.
Either way,
180 map_file = os.path.join(self.map_cache_dir, self.file_pattern.format(name=name))
181 return msprime.RecombinationMap.read_hapmap(map_file)
in stdpopsim/genetic_maps
heuristic won't quite work here (I think).
It will throw a FileNotFoundError
if you do the following
import msprime
from stdpopsim import homo_sapiens
for chrom in homo_sapiens.genome.chromosomes:
chrom = homo_sapiens.genome.chromosomes[chrom]
chrom.recombination_map()
I don't program in Python, and seems that this project is going to be Python based. Any recommendations on how I can help?
Copied from #52.
@jeromekelleher As I kinda expected, my implementation of the model did not match the production version. Would it be a good idea to have a diff() function that would report model differences per epoch in a format similar to the DemographyDebugger() program? Otherwise I'm just comparing two DemographyDebugger() outputs which is doable, but feels clumsy.
What we could do is change the definition of Model.equals
so that it raises an exception if models differ rather than returning true/false; this would at least say where the difference occurs). So, we'd change the name to check_equal
, which would raise a ValueError with meaningful messages telling us which bits of the models differ. The existing tests can be rejigged to use this easily enough.
If a user needs to use a recombination map usually what they want is the most recently defined one. So, documentation should list the maps so that the most recently defined map is listed first, and prominently display the year it was defined. This also goes for the help in the CLI, which should list available recombination maps with a short description of what they are, and their year.
I've made some initial infrastructure to transparently download genetic maps on demand, and store them in a cache for future use. The code is here. The idea is that you define a subclass of the genetic map which gives a URL where it can downloaded. When you ask for a specific genetic map for a chromosome, it first checks the cache to see if the genetic map has already been downloaded. If so, you load the map directly from the cache. If not, you download the genetic map from its URL. I've implemented this for the HapMapII genetic map in humans, and it seems to work pretty well. In use, it looks like
genetic_map = stdpopsim.get_genetic_map("h_sapiens", "HapmapII_GRCh37")
recomb_map = genetic_map.get_chromosome_map("chr20")
ts = msprime.simulate(10, recombination_map=recomb_map)
There's some more work to be done in figuring out how this interacts with the chromosome definitions (which is currently clunky), but I think the basic infrastructure for downloading and managing the maps is good. It would be good to get some opinions on whether this is a good idea before I go any further with it: any opinions @popgensims/all?
The reason we need this sort of caching infrastructure is because the maps are too big to bundle with the code. The gzipped HapMap genetic map is 35M, which is already too much to bundle with a Python package. Multiply this by several different maps across multiple species and it's definitely way too much.
If we follow this approach, it might be worth thinking about putting all the maps that we use in one location --- this would surely be an easy thing to convince Amazon to store as a public dataset.
Just checking @apragsdale , but N_0 is the population size of the archaic populations?
The code for figuring out when to download a genetic map depends on checking for the corresponding files. So does substituting default zero maps, as introduced in #33. This means we are always returning zero maps at the moment.
(HT to @petrelharp here; I wouldn't have spotted this except for your warning!)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.