Coder Social home page Coder Social logo

stdpopsim's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stdpopsim's Issues

Genome-wide recombination rate for `genome` class

Add a genome-wide recombination rate to the genome class. I'm happy to implement this today. The only question is, should it be created and stored as an attribute at the instantiation of a genome object, or, should there simply be a method which computes this when needed.

Add population metadata

Tskit allows arbitrary metadata to be assocated with populations. Msprime 0.7.1 just added the facility to use metdata with PopulationConfiguration objects, so we now can add some descriptive stucture to each population. This should really help with debugging.

I suggest we add something simple like, e.g.

metadata={
    "name": "AFR", 
    "description": "The contemporary African population"
}

Any thoughts @ndukler, @andrewkern?

Tracking list of published demographic models to include

Working on a big PR with several additional demographic models from the literature (mostly in human). Opening this issue to track progress. Here's the list I'm working on so far:

Human:

Chimp:

Mutation rates/recombination rates required for humans

In homo_sapiens.py we have the length of each chromosome and placeholders for mutation and recombination rate data. Ideally, we'd like to have a mean rate that has been estimated for each chromosome. I'm not sure where we'd get this information for mutation rates.

'recombination map not found' error for drosophila and arabidopsis models

I've been trying to run the Arabidopsis model with the snakemake pipeline and have been getting the error:

UserWarning: Warning: recombination map not found for chromosome: 'chr5' on map: 'Salome2012', substituting a zero-recombination map.

even though I have no problem downloading the maps otherwise. I noticed this error also occurs with the Drosophila model but not the human or e. coli models.

Are others able to reproduce this error or is it something on my end? I'm wondering if it may be caused by some of the recent updates to recombination map behavior since I did not run into this error previously.

General question about dev.

@jeromekelleher Is there an easy way to do the flake8 tests on code before it's committed?
I know that in .git/hooks/ there is a pre-commit.sample but simply changing the name doesn't achieve the same testing as circleCI.

I think we need something along the line of exec flake8 --max-line-length 89 ../../stdpopsim ../../tests within this executable, but I'm not sure.

could you maybe give a short explanation on how to do the CircleCI tests before committing?

Add --years option to CLI

Add an option to the CLI to rescale time into years in the output tree sequence using the scaling factors given in #73 and #74.

This is probably easiest done by dumping the tables and multiplying all the times by this factor.

If this option is used, there should be a reference printed out saying where the generation_time estimate came from and encouraging users to cite it.

Tennessen model QC

Just opening this up as a forum for the QC of the Tennessen two population OOA model as per the earlier discussion.

Output masking feature

On the conference call today, we decided to implement a feature to mask the tree sequences that we simulate from genetic maps.

  1. Provide an option for the user to specify a set of genomic intervals (as a bed file, I guess?), which we "delete" (see below) from the simulated tree sequence.
  2. By default, when we simulate from a genetic map we compute a simple mask based on regions of zero recombination within this map. We will derive some heuristics for this which may or may not be organism/chromosome specific.

A planned feature for the next version of stdpopsim would be to curate a set of masks and provide methods to download and cache these transparently, like we do for genetic maps.

The 'delete' (or maybe 'dice' - we're still discussing what it should be called) operation will take the tree sequence simulated under the raw recombination map, and remove all topology and site data from the masked regions. Therefore, any statistics computed on this will not be affected by the zero recombination rate sections. Anything plotted against genome coordinates will show big gaps (which is what we want, right?). This is being developed at tskit-dev/tskit#261

Name abbreviation scheme

As discussed in the meeting on June 04, part of the clunkiness of the CLI can be attributed to the lack of concise names. It would be nice to have a reasonably consistent scheme for describing what things are.

Things we need to name:

Species

Should be easy: h_sap, d_mel etc? There must be well established notation for this somewhere that we can borrow?

Population models

It would be nice for the abbreviation to describe what the model is. Perhaps the first part would be the number of populations and the second some memorable abbreviation (3pop_ooa == Gutenkunst et al, 2pop_ooa == Tennyson)? Then perhaps if we have multiple 3pop_ooa models we could add a date of publication (3pop_ooa_2009), defaulting the shorter name to be synonymous with the more recent model?

Maybe the 'pop' bit in here is redundant: would ooa_3 be better? I.e., first part is a descriptive abbreviation and the second part is the number of populations?

Recombination maps

We could abbreviate HapMapIIGrch37 to hm_37?

License is actually GPL

The license for the package is currently marked as MIT, but after thinking about this a bit I think it's not correct. The stdpopsim package imports msprime, which is GPL, and stdpopsim is therefore also GPL. This is annoying, but can't be helped.

@molpopgen, do you agree with this?

Developing a standard demographic model documentation format

To make model documentation easier I would like to standardize the format. The current sections I am thinking of are:

  • Model description: describes general scenario
  • Population indices: lists populations in model and their indices
  • Source paper citation: provides a citation of the key paper for the model

Are there any other sections that we should have? Do we want to go into more detail about the model construction, etc. ?

Species.genome not rendering correctly on RTD

The documentation for the species.genome variable is supposed to give details abut the variable, but is giving details about the class instead. This is possibly due to RTD using an older version of sphinx.

One easy workaround is to put the genome documentation directly into the RST file rather than as source comments (probably less brittle as well, given the non-standard #: comment format required).

Output SLiM code for population models

The population models are current defined in terms of the msprime APIs for defining populations, migration matrices and demographic events. Ideally, we would like to make these apply to SLiM also, so that we don't have to write the population models twice. Is it possible, even in principle, to output fragments of Eidos code which would implement the intended demography?

Concretely, we'd have something like

import string
import stdpopsim
from stdpopsim import homo_sapiens

chrom = homo_sapiens.genome.chromosomes["chr22"]
model = homo_sapiens.GutenkunstThreePopOutOfAfrica()

slim_template = string.Template("""
initialize() {
    initializeTreeSeq();
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 1e8-1);
    initializeRecombinationRate($recombination_rate);
}

$demographic_model

5000late() {
     sim.treeSeqOutput("ex1_TS.trees");
}""")
slim_program = slim_template.substitute(
    recombination_rate=chrom.mean_recombination_rate,
    demographic_model=model.as_slim())
ts = pyslim.run(slim_program)

Any thoughts @petrelharp, @BenHaller?

surgery branch?

I see we have a second branch in the repo called surgery. do we still need this?

Generic models

#105 adds a ZigZag model to the list of human models. But, it's really a generic model. @andrewkern had the elegant idea that we could define some mixin models that we can combine with basic parameters for various organisms to get basic parameters for generation time, Ne etc. This seems feasible.

I suggest we define two more generic models so that we're not generalising on the basis of one case. Can we get some suggestions for what these should be please?

Allow ``species.genome["chr22"]``?

Currently we use

from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome.chromosomes["chr22"]

to access a particular chromosome for a species. This is a bit clunky. We would set it up so that

from stdpopsim import homo_sapiens
chrom = homo_sapiens.genome["chr22"]

does the same thing (but keep the .chromosomes as well, for clarity. We could also add synonyms for the chromosomes, such that homo_sapiens.genome["chr22"] and homo_sapiens.genome["22"] return the same thing.

Any thoughts?

Generation times as an attribute of model class

I raised this issue over in analysis, but I probably should have posted it here. Andy and I noticed that generation times might be better implemented as a class attribute for each of the models. Currently these times are hard-coded into some of the models, however they are provided to some of the n_t analysis programs as user-specified arguments, leading to potential misspecification issues.

recombination map doing the right thing?

trying to get to the bottom of some generic model issues and ran in to the following case.

Why am I getting the same number of trees out here when the maps are so different? @jeromekelleher @petrelharp

"""
test of genetic maps
"""
import stdpopsim as stp
import msprime as msp

seed = 12345
sample_size = 10
population = 0
mod = stp.homo_sapiens.GenericConstantSize()
chrStr = "chr20"
chrom = stp.homo_sapiens.genome.chromosomes[chrStr]
samples = [msp.Sample(population=population, time=0)] * sample_size
map1 = chrom.recombination_map()
ts1 = msp.simulate(
        samples=samples,
        recombination_map=map1,
        mutation_rate=chrom.default_mutation_rate,
        random_seed=seed,
        **mod.asdict())

map2 = msp.RecombinationMap.uniform_map(chrom.length,
                                        chrom.default_recombination_rate)
ts2 = msp.simulate(
        samples=samples,
        recombination_map=map2,
        mutation_rate=chrom.default_mutation_rate,
        random_seed=seed,
        **mod.asdict())
print(map1.get_rates()[0:10])
print(map2.get_rates()[0:10])

print(ts1.num_trees)
print(ts2.num_trees)

# returns
# [0.0, 7.357160000000001e-09, 7.34243e-09, 7.34147e-09, 7.34513e-09, 7.42335e-09, # 7.42618e-09, 7.46463e-09, 7.19975e-08, 7.220007e-08]
# [1.7178269031631664e-08, 0.0]
# 
# 96538
# 96538

Where to locate standard simulation scripts (not the models)?

If I understand the repo layout correctly,

  1. stdpopsim holds models that can be used for generating simulations.
  2. The example scripts in top-level folder are example scripts and don't create standardized simulations (which is why they don't dump the tree sequences at the end)

Where do the scripts for generating standard simulations belong? In stdpopsim or analysis?

A. thalina QC

Does anyone have QC code for the Arabidopsis models? I can also do it myself if there isn't any. As soon as the QC goes up I can add some documentation.

Formalise QC process

Opening this thread to get discussion going on how we can formalise the QC process. Once we've reached a consensus here we can add this to the developer documentation.

Here's a draft to get things going:

  1. Developer A submits PR to add a new model. This must include basic unit tests and full documentation (i.e., the documentation that will be rendered on rtd). The code is checked for any obvious problems/style issues etc by maintainer M and merged when it meets these basic standards. The model is considered 'preliminary' and not linked from the online documentation and not included in the CLI.
  2. Maintainer M creates an issue created tracking the QC for the model, and assigns a developer B (after looking for volunteers) to do a blind implementation of the model. M also creates an issue for the CLI implementation of the model.
  3. Developer B creates a blind implementation of the model as PR within the unit testing module, and adds the automatic checking of this model for equality with the production model to the suite of unit tests. All being good, this PR is merged and the QC issue is closed.
  4. Someone then makes a PR updating the CLI, checking that the documentation, citations etc all work properly, and adds the model to the list of documented models.

There's no mention of an analysis notebook here though: do we need then, or simply implementing the model blind enough?

Arbitration

When developers A and B disagree on the model implementation, the process is to

  1. Try to hash out the details between them on the original issue thread
  2. If this fails, contact the authors of the original publication to resolve ambiguities.

Refactor chromosome definitions

PR #45 adds the ability to reproducibly run standard simulations from the command line, which is a real win in terms of usability. During the process of working through this change, it's become clear that the way that genome/chromosome/genetic map information is laid out is really quite poor and we should refactor it.

One possibility I've tried out is to have a 'factory function' which would return a chromosome with the appropriate instance variables set. See a rough version here. Clearly we'd need something that would work for different species as well, but I guess it wouldn't be the end-of-the-world to define a chromosome_factory function in each of the species modules. So, from a users perspective, they call

chrom = homo_sapiens.chromosome_factory("chr22", "HapMapII_GRCh37")

and the returned object will have a recombination map created from the appropriate genetic map and the default mutation rate for this chromosome. If the user calls:

chrom = homo_sapiens.chromosome_factory("chr12")

They'll get back a chromosome object with the length appropriate for chr12, the default mutation rate and a flat recombination map with the default recombination rate.

One thing that turned out to be pretty useful for testing the CLI is to add an option to return a fraction of a chromosome. So, if we did

chrom = homo_sapiens.chromosome_factory("chr12", length_multiplier=0.1)

which will return a chromosome that's 1/10th of the length of chr12. I think this is generally a useful thing to be able to do --- sometimes there's really no point in simulating a whole chromosome.

Does this refactoring sound OK to everyone? Any thoughts on what else we'd need to include in the API, or a better naming scheme?

Drosophila two population msprime simulations slow with recombination map

I've been trying to add drosophila for the two population analysis, however it seems that incorporating the recombination map into the msprime simulations is slowing it down a lot (ex. I have a simulation with 2 samples per population running and has been running for >5 days).

Is this expected or is there an error on my part? See zipped folder with a standalone script (dmel_two_pop_recomb_example.py) and scripts needed to run snakemake (edited to just run msprime).

dmel_two_pop_recomb_test.zip

Let me know if there is any other information that may be useful.

Update:
It took 5 to 6 days for the msprime simulations with two samples per-population to finish.

appdirs dependency not correctly defined?

I installed stdpopsim from my cloned repo and tried to import it but got the following error message:

ModuleNotFoundError: No module named 'appdirs'

I'm not really familiar with python but I'd imagine this indicates that something is missing in the dependency description for stdpopsim?

"true" Ne

Over at tskit-dev/msprime#779 I have implemented a method that calculates the "true" Ne at a particular time - as the mean TMRCA - for comparison with methods that infer a single Ne (as opposed to a coal rate as a function of time). When that gets merged to msprime, shall we add them to @ckyriazis's plots?

Analysis notebook needed for GutenkunstThreePopOutOfAfrica model

We need some analysis of the GutenkunstThreePopOutOfAfrica model to verify that it is correctly implemented. This should be in the form of a Jupyter notebook (stored in a new directory, say notebooks). This can serve as a example for our model QC process.

spikes in the SFS when simulating with genetic map

Hey all--

@jradrion and I have been looking at the SFS produced by human models in an effort to debug stairwayplot a bit. We've noticed that simulating under a genetic map instead of a constant rates leads to strange spikes in the SFS.

Presumably these spikes are caused by low recombination regions of the simulated chromosome? As an empiricist, one would want to drop such regions from downstream analysis- should we consider doing the same here?

A minimum working example of this comparison can be found here. Run it a few times if you don't see a bigger spike in the genetic map case in the first run or two.

representative output looks like this

$ python testSFS.py
SFS with genetic map
[0.         0.41105418 0.15667424 0.10971423 0.10006515 0.03928301
 0.10844646 0.02769708 0.02285493 0.02421073]
SFS with constant rate
[0.         0.43990996 0.17313905 0.10304832 0.07146539 0.05412743
 0.04816482 0.03857653 0.0367551  0.03481339]

notice the freq=6 bin with the genetic map

thoughts?

No `genetic_map_GRCh37_chrY.txt` in HapmapII_GRCh37_RecombinationHotspots.tar.gz

My lack of bioInformatics knowledge leads me to believe leaving out the Y chromosome is intentional. There also happens to be 3 files for the X Chromosome.

Either way,

180         map_file = os.path.join(self.map_cache_dir, self.file_pattern.format(name=name))
181         return msprime.RecombinationMap.read_hapmap(map_file)

in stdpopsim/genetic_maps heuristic won't quite work here (I think).

It will throw a FileNotFoundError if you do the following

import msprime
from stdpopsim import homo_sapiens

for chrom in homo_sapiens.genome.chromosomes:
    chrom = homo_sapiens.genome.chromosomes[chrom]
    chrom.recombination_map()

Change model.equals to 'verify' and raise an exception

Copied from #52.

@jeromekelleher As I kinda expected, my implementation of the model did not match the production version. Would it be a good idea to have a diff() function that would report model differences per epoch in a format similar to the DemographyDebugger() program? Otherwise I'm just comparing two DemographyDebugger() outputs which is doable, but feels clumsy.

What we could do is change the definition of Model.equals so that it raises an exception if models differ rather than returning true/false; this would at least say where the difference occurs). So, we'd change the name to check_equal, which would raise a ValueError with meaningful messages telling us which bits of the models differ. The existing tests can be rejigged to use this easily enough.

Would this be useful for the QC process @carjed, @ndukler?

Docs/CLI should list recombination maps in reverse time

If a user needs to use a recombination map usually what they want is the most recently defined one. So, documentation should list the maps so that the most recently defined map is listed first, and prominently display the year it was defined. This also goes for the help in the CLI, which should list available recombination maps with a short description of what they are, and their year.

Download/cache genetic maps

I've made some initial infrastructure to transparently download genetic maps on demand, and store them in a cache for future use. The code is here. The idea is that you define a subclass of the genetic map which gives a URL where it can downloaded. When you ask for a specific genetic map for a chromosome, it first checks the cache to see if the genetic map has already been downloaded. If so, you load the map directly from the cache. If not, you download the genetic map from its URL. I've implemented this for the HapMapII genetic map in humans, and it seems to work pretty well. In use, it looks like

genetic_map = stdpopsim.get_genetic_map("h_sapiens", "HapmapII_GRCh37")
recomb_map = genetic_map.get_chromosome_map("chr20")
ts = msprime.simulate(10, recombination_map=recomb_map)

There's some more work to be done in figuring out how this interacts with the chromosome definitions (which is currently clunky), but I think the basic infrastructure for downloading and managing the maps is good. It would be good to get some opinions on whether this is a good idea before I go any further with it: any opinions @popgensims/all?

The reason we need this sort of caching infrastructure is because the maps are too big to bundle with the code. The gzipped HapMap genetic map is 35M, which is already too much to bundle with a Python package. Multiply this by several different maps across multiple species and it's definitely way too much.

If we follow this approach, it might be worth thinking about putting all the maps that we use in one location --- this would surely be an easy thing to convince Amazon to store as a public dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.