tskit-dev / msprime-1.0-paper Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 20.0 1.2 MB

Publication describing msprime 1.0

Makefile 0.98% TeX 76.56% Python 8.77% C 13.70%

msprime-1.0-paper's People

Contributors

Stargazers

Watchers

msprime-1.0-paper's Issues

Performance figure - is Ne*L correct?

@petrelharp @jeromekelleher I am looking at the performance figure https://github.com/tskit-dev/msprime-1.0-paper/blob/main/figures/ancestry-perf.pdf and I wonder how do we get the vertical line for Homo Sapiens at 25K? If say Ne is 10K and L is ~30, then Ne * L would be 10K*30 = 300K, but the vertical line is at 25K (one order of magnitude off)?

Discussion of coalescent scaling

It would be useful to give people an idea of when msprime is fast and when it isn't, and why.

Some version of this plot from @petrelharp and a discussion would be great.

Feature section: mutation models

We need a section describing the general mutation models in msprime. If we get the SLiM mutation model implemented we should also discuss this (@petrelharp)

Should cite:

Seq-Gen
PyVolve

Any others?

@GertjanBisschop and @petrelharp, you seem like the right people to take on this section?

See #7 for detail on what a feature section should look like.

Unpack "extensible" a bit?

The paper title included the word "extensible", but I'm not quite sure what that means in this context: do we mean that people can join the dev team and add features, or we provide some sort of plug-in methodology, or that we can use tskit as an interchange format, or quite what? I think if this word is to be used in the title it needs a little more explanation - it's currently only mentioned in 2 other places in the text.

Citation data from SCOPUS

I looked this up out of curiosity - it's probably of no interest for the paper, but I thought I'd record it here in case it was.

This is from data collected 2021-06-27

Data:

year,msprime,ms,scrm,macs,cosi2
2014,0,119,0,26,0
2015,0,93,2,24,3
2016,1,131,8,21,6
2017,7,82,6,20,1
2018,24,96,7,18,1
2019,39,78,14,19,2
2020,49,77,7,22,1
2021,27,27,6,7,1

Code:

import pandas as pd
import seaborn as sns

df = pd.read_csv("msprime.csv")
df = df.set_index("year")

ax = sns.lineplot(data=df, markers=True, dashes=False)
ax.set_ylabel("Citations per year (SCOPUS)")
fig = ax.get_figure()
fig.savefig("plot.png")

Comparison with ARGON

@gtsambos has been doing some work comparing msprime with ARGON, so it'd be nice to include a plot somewhere illustrating.

@gtsambos, can you paste in here the figure you have currently please so we can discuss?

ARG illustration

From #83:

In the ARG section, we might make a diagram showing something that's in the full ARG but not the minimal tree sequence, to make the distinction clear? As part (b) of Figure 5?

This sounds like an excellent idea to me. How about something like this?

This would make a good two panel figure then I think, (a) illustrating an ARG and the full-arg representation in msprime, and (b) showing how the extra ARG nodes come to dominate in large simulations.

Perhaps we could also show the simplified, minimal tree sequence too?

Any thoughts @JereKoskela?

Figure: performance of mutation simulations

Plot how long it takes to impose mutations on simulations for a range of values.

Add TOC to pdf?

When editing and looking at the overall structure, I think it might be really handy to have a table of contents in the thumbnail bar (often shown at the left of the pdf viewer).

According to https://tex.stackexchange.com/questions/42343/how-to-add-a-navigation-window-to-a-latex-generated-pdf-document we could do this by adding something like this to the preamble:

\usepackage[]{hyperref}
\hypersetup{
    pdftitle={Your title here},
    pdfauthor={Your name here},
    pdfsubject={Your subject here},
    pdfkeywords={keyword1, keyword2},
    bookmarksnumbered=true,     
    bookmarksopen=true,         
    bookmarksopenlevel=1,       
    colorlinks=true,            
    pdfstartview=Fit,           
    pdfpagemode=UseOutlines,    % this is the option you were lookin for
    pdfpagelayout=TwoPageRight
}

Is that worth doing, or would that mess too much with the established document layout format that has been given to us? Or could we try it out and remove it if necessary before submission? I'm happy to try it out if given the go-ahead.

Title

Current title is: Msprime 1.0: an efficient and extensible coalescent simulation framework

I'm not totally in love with this (see #46 for discussion of what "extensible" means, and why it's maybe no so important). Any other suggestions?

What about Population scale ancestry and mutation simulation with msprime 1.0?

GC section signoff

I updated the GC section in #118 - can you read through this please @fbaumdicker and @hyanwong and sign off on it (or comment/make changes, as needed).

The figure should be done now apart from some visual tweaks.

Introduce the distinction between simulating ancestry & mutations earlier on?

It seems like we could do with talking about the difference between simulating ancestry and mutations near the top of the paper, and link to the API section for people who want further details.

Details for selection figure caption

We need to describe what we're simulating in the selection figure caption. @andrewkern, this is all yours.

Initial draft

I'm working on the initial draft and will update the repo once it's in reasonable shape (also depends on #1)

Figure: impact of including gene conversion

E.g. for human/mammalian levels of GC, which are usually about 10x the recombination rate.

Should show the effect on runtime (e.g. of a stdpopsim model) and on the size of the resulting tree sequence.

We might want to also show how much of the genome is actually affected by this (i.e. not much => you can often get away with not doing it), but that's getting into the weeds of analysing GC itself, rather than the implementation in msprime, so possibly not worth it.

Comparison with bacterial simulators

We would like to compare the performance of running msprime with gene conversion with some specialised bacterial simulators. See the original thread for details:

Originally posted by @fbaumdicker in #4 (comment)

@fbaumdicker, could you sketch out what would be needed here please? We're looking for a single figure here, that won't take too long to run or involve too much work.

Finalise ARG illustration figure

Here's what the ARG illustration figure currently looks like:

It would be good if the three panels had a common y axis, so that the nodes had the same height in each one.

Also would be nice if the (A), (B), (C) labels could be on top, for concistency with other figures.

make figures in real paper sizes

To get font sizes right we ought to be making our figures so that they don't have to be shrunk down by latex, so for instance, the figsize should have width <= 6.5 (which is apparently in inches). I'm happy to do this when it's time.

runtime scaling of dtwf?

In #88, @sgravel said that dtwf scales like O(Ne^2 L) instead of O(Ne^2 L^2); currently there's no discussion of this. Should we add this? Is it discussed in the DTWF paper? Empirically, from that paper DTWF doesn't get faster than Hudson until quite long genomes (like, 30 Morgans, IIRC), so I think we may have decided this isn't a point we want to get into right now, since it'll be confusing in practice.

Confusing sentence in mutation appendix

l.1737:
If an edge spans a region of the genome with more than one mutation rate, this is done separately for each distinct region.

We use "region" twice in this sentence for two different things. I think we mean either:

If an edge spans regions of the genome with distinct mutation rates, this is done separately for each distinct region.

Or, more likely,

If an edge spans a region of the genome with more than one mutation rate, this is done separately for each mutation rate.

Perhaps it would be more clear to give an example of what multiple rates might refer to:

If an edge spans a region of the genome with more than one mutation rate, as may happen if we model multiple mutational processes, this is done separately for each mutation rate.

Review instantaneous bottlenecks section

@KLohse, the "Instantaneous bottlenecks" section in the paper.tex document should be pretty much done now. Can you review please, and either make any edits required or confirm you're happy with it here please?

Author ORCIDs

If you have an ORCID and you would like it to be used for submission purposes, can you reply to this issue with the ID please (full URL ideally, so I can verify easily)

@agladstein
@andrewkern
@apragsdale
@awohns
@benjeffery
@castedo
@cdquinto
@daniel-goldstein
@DomNelson
@eldonb
@fbaumdicker
@gbinux
@GertjanBisschop
@grahamgower
@gregorgorjanc
@gtsambos
@hugovk
@hyanwong
@ivan-krukov
@JereKoskela
@jeromekelleher
@jgallowa07
@KLohse
@mmatschiner
@molpopgen
@mufernando
@nspope
@petrelharp
@saunack
@sgravel
@shajoezhu
@TPPSellinger
@winni2k

Feature section: selection

We need a feature section for the selective sweeps feature. See #7 for an overview of feature section structures.

@andrewkern, you seem like the obvious person for this?

Hudson's algorithm time complexity

There's been a bit of discussion on this, so I want to get the pieces together. Here's a plot from the 2016 paper:

The dots on the left are a fitted quadratic. I used this to conjecture that Hudson is quadratic in rho - which the HSW bound confirms - we think it should be O(rho^2 log^2 n).

What would we want to do to check if the number of events really is well characterised by this expression? Would fitting it jointly to all the data here be convincing?

Mutation illustration figure

Figure 2 currently looks like this:

We need to decide what to do with it. I think at the least we want to greatly reduce the number of samples.

@petrelharp, did you have some plans here?

I'm pretty lukewarm on the necessity of it, but happy to go with the consensus.

Discrepancies between discoal paper and selection appendix

There appear to be some discrepancies between the equations we're giving in the selection appendix to the discoal paper. Specifically, the denominator of mu(p) definition differs.

@andrewkern, can you comment please?

If the model is identical to what's given in the discoal paper, is there any point in reproducing it here? It's not like we refer to the equations or anything. If not, then I'm not sure there's a lot of value in the selection appendix, as the non-equation bits could easily be merged into the main-text section.

a few small comments

Thanks Jerome -
a few very minor comments below; and I would maybe like to go once more over it

use line numbers - put
\usepackage[right]{lineno}
in preamble, and then
\linenumbers
where they should start, e.g. before Introduction;
I think this will help the reviewers

the line numbers below refer to the line numbers for paper.tex on github :

156: Stochastic simulation is a key tool in population genetics.
The models involved are often analytically intractable, and simulation is then
the only way of evaluating a given inference.

184: The coalescent process (maybe refer here also to the papers by Kingman from 1982, who formally derived
the coalescent; I think Hudson did so only informally)

@Article{K82,
author = {J F C Kingman},
journal = {Stoch Proc Appl},
pages = {235--248},
title = {The coalescent},
volume = 13,
year = 1982
}

@Article{K82b,
author = {J F C Kingman},
journal = {J App Probab},
pages = {27--43},
title = {On the genealogy of large populations},
volume = {{19A}},
year = 1982
}

406: Let us define a `genome' as the chromosome ...

I usually think of a genome' as all the chromosomes of a diploid (or polyploid) individual, so maybe for sake of illustration simply talk about chromosomes' ?

654: could we explain what we mean by a `mature' code ?

line 1169: include a reference to Schweinsberg (2003), where the Beta-coalescent is first mentioned

@Article{schweinsberg03,
author = {J Schweinsberg},
journal = {Stoch Proc Appl},
pages = {107--139},
title = {Coalescent processes obtained from supercritical {G}alton-{W}atson processes},
volume = 106,
year = 2003
}

line 1174: include a reference to Birkner etal where the Dirac-coalescent is first mentioned:

@Article{BBE13,
author = {M Birkner and J Blath and B Eldon},
journal = {Genetics},
pages = {255--290},
title = {An ancestral recombination graph for diploid populations with skewed offspring distribution},
volume = 193,
year = 2013
}

1188: `as a representation of the coalescent with recombination as a graph'

1713: the `Appendix' indicating the start of Appendix does not appear

2046: should it be \tan in the equation for \mu(p)
in the Selective sweeps trajectories ?

Feature section structure

The current narrative is based around having a section describing each of the major features in msprime. These should do the following:

Explain what the feature is, and why it's important.
Link into the literature for this feature, and especially list any existing simulators that implement this feature.
Describe (briefly) how this feature is implemented in msprime, and how it can be used.

G3 does encourage the use of code snippets to illustrate things, so we might consider also adding a short example of the Python API for each feature.

This is just a proposal, so please respond below if you think we should do things differently.

Author details

Since the author list got added in #121, we need to get the affiliations and acknowledgements. Can each author check in the tex file

The spelling of their name
Their position in the alphabetical list
Their affiliation
That they have acknowledged any appropriate grants in the Acknowledgements

We hope to submit the preprint in a week or so, so please do check these details ASAP.

Standarise colours to matplotlib palette?

I quite like having a set palette of colours - would it be OK to update the tree sequence example figure to use the standard matplotlib palette @petrelharp?

Cosmetic suggestions for fig 3

Figure 3 suggestion: I would spell out "Mutation rate" as a title to the legend to avoid redundancy and make the graph more visually self-contained.
I would also consider flipping the order of the items in the legend to have same sorting as in the data.

"Results" section - should we include example code / benchmarks?

I mentioned this at the end of the online meeting, but I bring it up again after having read the commented-out section in the paper layout that says, of the "Results" section:

For software, the authors are expected to present a
few examples that demonstrate the use of the software and benchmarks.

Does this mean that benchmarks etc (which could, or could not, be taken as meaning some of the plots, such as 1A) should go in this "Results & discussion" section? Or should we simply have a few examples showcasing the paragraphs in the methods - which we could do with "boxes with scripts supported by a narrative" as they recommend.

Or is all this too dependent on the G3 format, which is, I presume, where this structure comes from?

Some more details

Finding anything wrong with this manuscript is really hard, it's excellently written.

But here are four details that could be addressed:

In the sentence "For example, ms has been forked to output information about migrating segments (Rosenzweig et al., 2016), ancestral lineages (Chen and Chen, 2013), and ms’s fork msHOT (Hellenthal and Stephens, 2007), has in turn been forked to output information on local ancestry (Racimo et al., 2017).", the comma before "has in turn been" should probably be removed.
In the legend of Figure 3, "and recombination rate of 10-8" could be removed because apparently three different recombination rates were used. It might be worth specifying "recombination rate" instead of just "rate" in the inset.
In the sentence "Taking a typical chromosome to be 1 Morgan in length, these plots show, roughly, that simulating chromosome-length samples from a population of thousands of individuals takes seconds, while samples from a population of tens of thousands takes minutes.", the last "takes" should be replaced with "take".
In the sentence "However, the generation-by- generation approach of the DTWF is less efficient than the coalescent with recombination when the number of lineages is significantly less than the population size (the regime where the coalescent is an accurate approximation), which usually happens in the quite recent past Bhaskar et al. (2014).", the opening parenthesis of the Bhaskar et al. reference is misplaced.

Change Appendix to Methods?

At the moment the Appendix is a list of "how we do stuff for X". We could change this to a Methods section instead?

I don't have a strong opinion, just putting the idea out there.

Feature section: forwards simulators

We need a feature section for integration with forward sims. See #7 for an overview of feature section structures.

@petrelharp, would you like to field this, or will I bulk it out?

Author approval for submission

Dear authors,

The manuscript is very close to being finished, and I think it's ready to go after a final round of (hopefully) minor updates and tweaks. I've attached the current draft. Can you please read through, and if you are happy for it to be preprinted on Biorxiv and submitted to Genetics, please comment below in this thead. If you would like to some changes made or have some comments to address before you're happy to approve, please either open a PR with those changes or a separate issue to discuss, or discuss on the overleaf version (basically, please don't start a discussion thread here about the paper, so I can track who has approved an not easily).

paper.pdf

Paper format and journal

The paper is envisaged as a description of msprime all of the features that it has, with comprehensive references to the literature of existing simulators, to emphasise msprime's large feature set and fragmented state of the simulation ecosystem.

Some options:

Bioinformatics application note

A bioinformations application note is a pretty standard format for software papers like this. It would be hard to argue against publishing this paper in Bioinformatics, given the number of simulators that are already published in this format.

The downside is the format is very short. My current, very incomplete, draft is already longer than this. Given the number of references and the number of authors, I think it would be very terse indeed.

Heredity computer note

Heredity have recently launched a computer note format which might be appropriate.

Performance figure

A key argument in favour of msprime is that it's fast. It's faster than lots of other simulators that only have a subset of its features. We want a figure demonstrating this.

@daniel-goldstein and I just sketched out this:

The idea is that we compare msprime against other specialised simulators on a variety of applications, and show that it's faster (hopefully).

At the moment, we're looking at 4 panels. We have time on the y axis, and sample size on the x. Fix other parameters at human-ish, and tweak them for each panel so that we can actually run the simulations.

Standard coalescent: ms, msms, msprime
SMC approximations: MaCS, scrm, msprime_smc, msprime hudson
DTWF: Argon, msprime_dtwf
Sweeps: msms, discoal, msprime sweep

I'm least clear about the sweeps example: @andrewkern, what do you think we should do here? Any other simulators to consider?

Any other simulators we should consider?

Haldane's sieve reference

Actually, now that I think about it, I don't get the Haldane's sieve reference in the discussion. What's the connection to dominant and recessive alleles @petrelharp?

This is unsurprising, since the majority of simulation software development is performed by students, who are themselves learning to code, often without formal training in software development. Although this “Haldane’s sieve” of software has produced many good tools and enabled decades of research, it also represents a missed opportunity to invest as a community in shared infrastructure and mentorship in good software development practice.

Feature section: DTWF

We need a feature section for DTWF. See #7 for an overview of feature section structures.

Are there other backwards time, generation by generation simulators we should be citing here?

@DomNelson, can you pick this one up please?

Mention that msprime 1.0 can simulate large numbers of populations now

Version 0.x was very bad at simulating large demographies, worth mentioning that it's much better now.

Feature section: ground truth for N(t) methods and where are lineages

As part of the stdpopsim work, we added some very nifty features for computing ground truth values and related functions for computing the location of lineages over time. We should add a features section (see #7 for what this means) describing this.

@petrelharp, @apragsdale, you're the main people involved here; any thoughts?

Review paragraph on ABC/machine learning

#34 added a paragraph that's intended to motivate the need for efficient interchange of simulated data, and talks about what people do for ABC and machine learning. It would be very helpful if folks who know more about this could read over it, and update with any changes.

First paragraph of subsection called "Data interchange and interoperability"

Pinging @gtsambos, @grahamgower, @andrewkern

Feature section: multiple merger coalescents

We need a feature section for multiple merger coalescents. See #7 for an overview of feature section structures.

@JereKoskela, @TPPSellinger, @eldonb would one or more of you like to volunteer for this please?

Finalise recombination perf figure

@petrelharp, what do you think of the current version of the recombination perf figure? Do you want to make some tweaks, or can we fix up the right hand panel x axis labels and be done?

Authorship

Anyone who has contributed significantly to msprime will be an author on the paper describing the simulator (see #1 for discussion on journal/format). An author is someone who has contributed to the code, documentation, packaging or algorithm development. Minor doc fixes don't qualify (but this is a proposal, so if you don't agree please shout out).

Note that we're not currently including contributions to code that is now part of tskit. Such contributions will be acknowledged in a planned tskit paper at some point in the future when tskit has accumulated lots of features.

Here is the current proposal, mainly based on the contributions github page:

Ashander, Jaime, @ashander
Baumdicker, Franz, @fbaumdicker
Bisschop, Gertjan, @GertjanBisschop
Eldon, Bjarki, @eldonb
Ellerman, E. Castedo, @castedo
Galloway, Jared, @jgallowa07
Gladstein, Ariella, @agladstein
Goldstein, Daniel, @daniel-goldstein
Gorjanc, Gregor, @gregorgorjanc
Gower, Graham, @grahamgower
Gravel, Simon, @sgravel
Guo, Bing, @gbinux
Jeffery, Ben, @benjeffery
Kelleher, Jerome, @jeromekelleher
Kern, Andrew, @andrewkern
Koskela, Jere, @JereKoskela
Kretzschmar, Warren W., @winni2k
Krukov, Ivan, @ivan-krukov
Lohse, Konrad, @KLohse
Matschiner, Michael, @mmatschiner
Nelson, Dominic, @DomNelson
Pope, Nathaniel, @nspope
Quinto-Cortés, Consuelo D., @cdquinto
Ragsdale, Aaron, @apragsdale
Ralph, Peter, @petrelharp
Rodrigues, Murillo, F. @mufernando
Saunack, Kumar, @saunack
Sellinger, Thibaut, @TPPSellinger
Thornton, Kevin, @molpopgen
Tsambos, Georgia, @gtsambos
Van Kemenade, Hugo, @hugovk
Wohns, Anthony W, @awohns
Wong, Yan, @hyanwong
Zhu, Sha, @shajoezhu

If you do not wish to be an author, please let me know and I will remove you from the list. Otherwise, please click on the "Watch" button above to follow updates on this repo.

Author order

~~All authors will be listed in alphabetical order of surname, as this seems like the least unfair way to do it.~~

Updated: as discussed here, we're adopting a 4-category author system, with authors listed alphabetically within those categories.

Feature section: simultaneous bottlenecks

We need a feature section for simultaneous bottlenecks. See #7 for an overview of feature section structures.

This one's all yours @KLohse

Finalise tree sequence definition section

#134 and #138 tried to clarify the tree sequence definition section, which I tried to merge into #141. In particular, this added the idea of a "ploid", which we explain before going on to talk about nodes and edges. I think this is the right approach, but the current discussion of ploids is half hearted.

Anyone want to make a pass (@sgravel @petrelharp @andrewkern)?

Appendix for mutation algorithm

From #83:

Should we describe how the mutation simulation algorithm works? In an appendix?

This sounds like a good idea to me, and perfect for an appendix (I'm fine with lots of appedixes). I wouldn't be able to write it, though, so I'm assigning this one to you @petrelharp. If you don't want to do it, we'll just close.

big-picture comments

I've done a general read-through of the paper, and am recording big-picture comments here (for lack fo a better place?).

Generally: it looks great! I'd say that right now it is a general overview of the new features, with demonstrations that these new features are indeed fast. There isn't much detail on how things actually work, except in the Appendix. I think this is probably good - how much detail we want depends on the audience, and adding more detail would make it too long.

In the "tree sequence" section, it doesn't yet explain how mutations are put in there.
However, the next section does, so maybe that section should be titled "mutations" not "simulating mutations"
We've got nothing showing the Demography features. We might have a table of Demography ingredients? Or a diagram of a demography with a few events, with the code to produce it alongside? Right now it feels a bit abstract, and arriving at "instantaneous bottlenecks" feels like then there will be sections for many more demographic ingredients (but there is not). Alternatively, something people will wonder is "will my old demography code work" and/or "how do I translate to the new demography"; perhaps we can address that here?
In the ARG section, we might make a diagram showing something that's in the full ARG but not the minimal tree sequence, to make the distinction clear? As part (b) of Figure 5?
In Figure 7, the y-axis is on a log scale. And, maybe we should extend the x-axis to 100Mb?
Maybe "simulation interface" and "development model" should go first?
In "development model" we could give an example of how we test things, to make it clear that it's rigorous.
There's no discussion of when you need to do forwards simulation (selection, continuous space, etc); maybe this belongs in the Introduction?
Should we describe how the mutation simulation algorithm works? In an appendix?
I wonder if there's some descriptive figures we'd like to include, e.g. a conceptual one that shows simulating trees and then adding mutations (using the nice tskit svg drawing capability), or example trees simulated with and without multiple mergers. This could spice up the performance figures (which all look the same).

I'm happy to do some of these things, but thought I should make one issue instead of ten - but, maybe if others agree a thing should be done, it should be made a separate issue? I'm not sure how to deal with this.

tskit-dev / msprime-1.0-paper Goto Github PK

msprime-1.0-paper's People

Contributors

Stargazers

Watchers

Forkers

msprime-1.0-paper's Issues

Bioinformatics application note

Heredity computer note

Author order

Recommend Projects

Recommend Topics

Recommend Org