Coder Social home page Coder Social logo

msprime-1.0-paper's People

Contributors

andrewkern avatar apragsdale avatar castedo avatar cdquinto avatar daniel-goldstein avatar eldonb avatar fbaumdicker avatar grahamgower avatar gregorgorjanc avatar gtsambos avatar hugovk avatar hyanwong avatar jeromekelleher avatar klohse avatar mufernando avatar petrelharp avatar sgravel avatar tppsellinger avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msprime-1.0-paper's Issues

Unpack "extensible" a bit?

The paper title included the word "extensible", but I'm not quite sure what that means in this context: do we mean that people can join the dev team and add features, or we provide some sort of plug-in methodology, or that we can use tskit as an interchange format, or quite what? I think if this word is to be used in the title it needs a little more explanation - it's currently only mentioned in 2 other places in the text.

Citation data from SCOPUS

I looked this up out of curiosity - it's probably of no interest for the paper, but I thought I'd record it here in case it was.

This is from data collected 2021-06-27

plot

Data:

year,msprime,ms,scrm,macs,cosi2
2014,0,119,0,26,0
2015,0,93,2,24,3
2016,1,131,8,21,6
2017,7,82,6,20,1
2018,24,96,7,18,1
2019,39,78,14,19,2
2020,49,77,7,22,1
2021,27,27,6,7,1

Code:

import pandas as pd
import seaborn as sns

df = pd.read_csv("msprime.csv")
df = df.set_index("year")

ax = sns.lineplot(data=df, markers=True, dashes=False)
ax.set_ylabel("Citations per year (SCOPUS)")
fig = ax.get_figure()
fig.savefig("plot.png")

Comparison with ARGON

@gtsambos has been doing some work comparing msprime with ARGON, so it'd be nice to include a plot somewhere illustrating.

@gtsambos, can you paste in here the figure you have currently please so we can discuss?

ARG illustration

From #83:

In the ARG section, we might make a diagram showing something that's in the full ARG but not the minimal tree sequence, to make the distinction clear? As part (b) of Figure 5?

This sounds like an excellent idea to me. How about something like this?

Screenshot from 2021-07-23 10-11-19

This would make a good two panel figure then I think, (a) illustrating an ARG and the full-arg representation in msprime, and (b) showing how the extra ARG nodes come to dominate in large simulations.

Perhaps we could also show the simplified, minimal tree sequence too?

Any thoughts @JereKoskela?

Add TOC to pdf?

When editing and looking at the overall structure, I think it might be really handy to have a table of contents in the thumbnail bar (often shown at the left of the pdf viewer).

According to https://tex.stackexchange.com/questions/42343/how-to-add-a-navigation-window-to-a-latex-generated-pdf-document we could do this by adding something like this to the preamble:

\usepackage[]{hyperref}
\hypersetup{
    pdftitle={Your title here},
    pdfauthor={Your name here},
    pdfsubject={Your subject here},
    pdfkeywords={keyword1, keyword2},
    bookmarksnumbered=true,     
    bookmarksopen=true,         
    bookmarksopenlevel=1,       
    colorlinks=true,            
    pdfstartview=Fit,           
    pdfpagemode=UseOutlines,    % this is the option you were lookin for
    pdfpagelayout=TwoPageRight
}

Is that worth doing, or would that mess too much with the established document layout format that has been given to us? Or could we try it out and remove it if necessary before submission? I'm happy to try it out if given the go-ahead.

Title

Current title is: Msprime 1.0: an efficient and extensible coalescent simulation framework

I'm not totally in love with this (see #46 for discussion of what "extensible" means, and why it's maybe no so important). Any other suggestions?

What about Population scale ancestry and mutation simulation with msprime 1.0?

GC section signoff

I updated the GC section in #118 - can you read through this please @fbaumdicker and @hyanwong and sign off on it (or comment/make changes, as needed).

The figure should be done now apart from some visual tweaks.

Initial draft

I'm working on the initial draft and will update the repo once it's in reasonable shape (also depends on #1)

Figure: impact of including gene conversion

E.g. for human/mammalian levels of GC, which are usually about 10x the recombination rate.

Should show the effect on runtime (e.g. of a stdpopsim model) and on the size of the resulting tree sequence.

We might want to also show how much of the genome is actually affected by this (i.e. not much => you can often get away with not doing it), but that's getting into the weeds of analysing GC itself, rather than the implementation in msprime, so possibly not worth it.

Comparison with bacterial simulators

We would like to compare the performance of running msprime with gene conversion with some specialised bacterial simulators. See the original thread for details:

Originally posted by @fbaumdicker in #4 (comment)

@fbaumdicker, could you sketch out what would be needed here please? We're looking for a single figure here, that won't take too long to run or involve too much work.

Finalise ARG illustration figure

Here's what the ARG illustration figure currently looks like:

Screenshot from 2021-08-12 14-32-11

It would be good if the three panels had a common y axis, so that the nodes had the same height in each one.

Also would be nice if the (A), (B), (C) labels could be on top, for concistency with other figures.

make figures in real paper sizes

To get font sizes right we ought to be making our figures so that they don't have to be shrunk down by latex, so for instance, the figsize should have width <= 6.5 (which is apparently in inches). I'm happy to do this when it's time.

runtime scaling of dtwf?

In #88, @sgravel said that dtwf scales like O(Ne^2 L) instead of O(Ne^2 L^2); currently there's no discussion of this. Should we add this? Is it discussed in the DTWF paper? Empirically, from that paper DTWF doesn't get faster than Hudson until quite long genomes (like, 30 Morgans, IIRC), so I think we may have decided this isn't a point we want to get into right now, since it'll be confusing in practice.

Confusing sentence in mutation appendix

l.1737:
If an edge spans a region of the genome with more than one mutation rate, this is done separately for each distinct region.

We use "region" twice in this sentence for two different things. I think we mean either:

If an edge spans regions of the genome with distinct mutation rates, this is done separately for each distinct region.

Or, more likely,

If an edge spans a region of the genome with more than one mutation rate, this is done separately for each mutation rate.

Perhaps it would be more clear to give an example of what multiple rates might refer to:

If an edge spans a region of the genome with more than one mutation rate, as may happen if we model multiple mutational processes, this is done separately for each mutation rate.

Review instantaneous bottlenecks section

@KLohse, the "Instantaneous bottlenecks" section in the paper.tex document should be pretty much done now. Can you review please, and either make any edits required or confirm you're happy with it here please?

Author ORCIDs

Hudson's algorithm time complexity

There's been a bit of discussion on this, so I want to get the pieces together. Here's a plot from the 2016 paper:

pcbi 1004842 g002

The dots on the left are a fitted quadratic. I used this to conjecture that Hudson is quadratic in rho - which the HSW bound confirms - we think it should be O(rho^2 log^2 n).

What would we want to do to check if the number of events really is well characterised by this expression? Would fitting it jointly to all the data here be convincing?

Mutation illustration figure

Figure 2 currently looks like this:

Screenshot from 2021-08-09 17-16-11

We need to decide what to do with it. I think at the least we want to greatly reduce the number of samples.

@petrelharp, did you have some plans here?

I'm pretty lukewarm on the necessity of it, but happy to go with the consensus.

Discrepancies between discoal paper and selection appendix

There appear to be some discrepancies between the equations we're giving in the selection appendix to the discoal paper. Specifically, the denominator of mu(p) definition differs.

@andrewkern, can you comment please?

If the model is identical to what's given in the discoal paper, is there any point in reproducing it here? It's not like we refer to the equations or anything. If not, then I'm not sure there's a lot of value in the selection appendix, as the non-equation bits could easily be merged into the main-text section.

a few small comments

Thanks Jerome -
a few very minor comments below; and I would maybe like to go once more over it

use line numbers - put
\usepackage[right]{lineno}
in preamble, and then
\linenumbers
where they should start, e.g. before Introduction;
I think this will help the reviewers

the line numbers below refer to the line numbers for paper.tex on github :

156: Stochastic simulation is a key tool in population genetics.
The models involved are often analytically intractable, and simulation is then
the only way of evaluating a given inference.

184: The coalescent process (maybe refer here also to the papers by Kingman from 1982, who formally derived
the coalescent; I think Hudson did so only informally)

@Article{K82,
author = {J F C Kingman},
journal = {Stoch Proc Appl},
pages = {235--248},
title = {The coalescent},
volume = 13,
year = 1982
}

@Article{K82b,
author = {J F C Kingman},
journal = {J App Probab},
pages = {27--43},
title = {On the genealogy of large populations},
volume = {{19A}},
year = 1982
}

406: Let us define a `genome' as the chromosome ...

I usually think of a genome' as all the chromosomes of a diploid (or polyploid) individual, so maybe for sake of illustration simply talk about chromosomes' ?

654: could we explain what we mean by a `mature' code ?

line 1169: include a reference to Schweinsberg (2003), where the Beta-coalescent is first mentioned

@Article{schweinsberg03,
author = {J Schweinsberg},
journal = {Stoch Proc Appl},
pages = {107--139},
title = {Coalescent processes obtained from supercritical {G}alton-{W}atson processes},
volume = 106,
year = 2003
}

line 1174: include a reference to Birkner etal where the Dirac-coalescent is first mentioned:

@Article{BBE13,
author = {M Birkner and J Blath and B Eldon},
journal = {Genetics},
pages = {255--290},
title = {An ancestral recombination graph for diploid populations with skewed offspring distribution},
volume = 193,
year = 2013
}

1188: `as a representation of the coalescent with recombination as a graph'

1713: the `Appendix' indicating the start of Appendix does not appear

2046: should it be \tan in the equation for \mu(p)
in the Selective sweeps trajectories ?

Feature section structure

The current narrative is based around having a section describing each of the major features in msprime. These should do the following:

  1. Explain what the feature is, and why it's important.
  2. Link into the literature for this feature, and especially list any existing simulators that implement this feature.
  3. Describe (briefly) how this feature is implemented in msprime, and how it can be used.

G3 does encourage the use of code snippets to illustrate things, so we might consider also adding a short example of the Python API for each feature.

This is just a proposal, so please respond below if you think we should do things differently.

Author details

Since the author list got added in #121, we need to get the affiliations and acknowledgements. Can each author check in the tex file

  • The spelling of their name
  • Their position in the alphabetical list
  • Their affiliation
  • That they have acknowledged any appropriate grants in the Acknowledgements

We hope to submit the preprint in a week or so, so please do check these details ASAP.

Cosmetic suggestions for fig 3

Figure 3 suggestion: I would spell out "Mutation rate" as a title to the legend to avoid redundancy and make the graph more visually self-contained.
I would also consider flipping the order of the items in the legend to have same sorting as in the data.

"Results" section - should we include example code / benchmarks?

I mentioned this at the end of the online meeting, but I bring it up again after having read the commented-out section in the paper layout that says, of the "Results" section:

For software, the authors are expected to present a
few examples that demonstrate the use of the software and benchmarks.

Does this mean that benchmarks etc (which could, or could not, be taken as meaning some of the plots, such as 1A) should go in this "Results & discussion" section? Or should we simply have a few examples showcasing the paragraphs in the methods - which we could do with "boxes with scripts supported by a narrative" as they recommend.

Or is all this too dependent on the G3 format, which is, I presume, where this structure comes from?

Some more details

Finding anything wrong with this manuscript is really hard, it's excellently written.

But here are four details that could be addressed:

  1. In the sentence "For example, ms has been forked to output information about migrating segments (Rosenzweig et al., 2016), ancestral lineages (Chen and Chen, 2013), and ms’s fork msHOT (Hellenthal and Stephens, 2007), has in turn been forked to output information on local ancestry (Racimo et al., 2017).", the comma before "has in turn been" should probably be removed.
  2. In the legend of Figure 3, "and recombination rate of 10-8" could be removed because apparently three different recombination rates were used. It might be worth specifying "recombination rate" instead of just "rate" in the inset.
  3. In the sentence "Taking a typical chromosome to be 1 Morgan in length, these plots show, roughly, that simulating chromosome-length samples from a population of thousands of individuals takes seconds, while samples from a population of tens of thousands takes minutes.", the last "takes" should be replaced with "take".
  4. In the sentence "However, the generation-by- generation approach of the DTWF is less efficient than the coalescent with recombination when the number of lineages is significantly less than the population size (the regime where the coalescent is an accurate approximation), which usually happens in the quite recent past Bhaskar et al. (2014).", the opening parenthesis of the Bhaskar et al. reference is misplaced.

Change Appendix to Methods?

At the moment the Appendix is a list of "how we do stuff for X". We could change this to a Methods section instead?

I don't have a strong opinion, just putting the idea out there.

Author approval for submission

Dear authors,

The manuscript is very close to being finished, and I think it's ready to go after a final round of (hopefully) minor updates and tweaks. I've attached the current draft. Can you please read through, and if you are happy for it to be preprinted on Biorxiv and submitted to Genetics, please comment below in this thead. If you would like to some changes made or have some comments to address before you're happy to approve, please either open a PR with those changes or a separate issue to discuss, or discuss on the overleaf version (basically, please don't start a discussion thread here about the paper, so I can track who has approved an not easily).

paper.pdf

@agladstein
@andrewkern
@apragsdale
@awohns
@benjeffery
@castedo
@cdquinto
@daniel-goldstein
@DomNelson
@eldonb
@fbaumdicker
@gbinux
@GertjanBisschop
@grahamgower
@gregorgorjanc
@gtsambos
@hugovk
@hyanwong
@ivan-krukov
@JereKoskela
@jeromekelleher
@jgallowa07
@KLohse
@mmatschiner
@molpopgen
@mufernando
@nspope
@petrelharp
@saunack
@sgravel
@shajoezhu
@TPPSellinger
@winni2k

Paper format and journal

The paper is envisaged as a description of msprime all of the features that it has, with comprehensive references to the literature of existing simulators, to emphasise msprime's large feature set and fragmented state of the simulation ecosystem.

Some options:

Bioinformatics application note

A bioinformations application note is a pretty standard format for software papers like this. It would be hard to argue against publishing this paper in Bioinformatics, given the number of simulators that are already published in this format.

The downside is the format is very short. My current, very incomplete, draft is already longer than this. Given the number of references and the number of authors, I think it would be very terse indeed.

Heredity computer note

Heredity have recently launched a computer note format which might be appropriate.

Performance figure

A key argument in favour of msprime is that it's fast. It's faster than lots of other simulators that only have a subset of its features. We want a figure demonstrating this.

@daniel-goldstein and I just sketched out this:

IMG_20200218_132955

The idea is that we compare msprime against other specialised simulators on a variety of applications, and show that it's faster (hopefully).

At the moment, we're looking at 4 panels. We have time on the y axis, and sample size on the x. Fix other parameters at human-ish, and tweak them for each panel so that we can actually run the simulations.

  • Standard coalescent: ms, msms, msprime
  • SMC approximations: MaCS, scrm, msprime_smc, msprime hudson
  • DTWF: Argon, msprime_dtwf
  • Sweeps: msms, discoal, msprime sweep

I'm least clear about the sweeps example: @andrewkern, what do you think we should do here? Any other simulators to consider?

Any other simulators we should consider?

Haldane's sieve reference

Actually, now that I think about it, I don't get the Haldane's sieve reference in the discussion. What's the connection to dominant and recessive alleles @petrelharp?

This is unsurprising, since the majority of simulation software development is performed by students, who are themselves learning to code, often without formal training in software development. Although this “Haldane’s sieve” of software has produced many good tools and enabled decades of research, it also represents a missed opportunity to invest as a community in shared infrastructure and mentorship in good software development practice.

Feature section: DTWF

We need a feature section for DTWF. See #7 for an overview of feature section structures.

Are there other backwards time, generation by generation simulators we should be citing here?

@DomNelson, can you pick this one up please?

Review paragraph on ABC/machine learning

#34 added a paragraph that's intended to motivate the need for efficient interchange of simulated data, and talks about what people do for ABC and machine learning. It would be very helpful if folks who know more about this could read over it, and update with any changes.

First paragraph of subsection called "Data interchange and interoperability"

Pinging @gtsambos, @grahamgower, @andrewkern

Authorship

Anyone who has contributed significantly to msprime will be an author on the paper describing the simulator (see #1 for discussion on journal/format). An author is someone who has contributed to the code, documentation, packaging or algorithm development. Minor doc fixes don't qualify (but this is a proposal, so if you don't agree please shout out).

Note that we're not currently including contributions to code that is now part of tskit. Such contributions will be acknowledged in a planned tskit paper at some point in the future when tskit has accumulated lots of features.

Here is the current proposal, mainly based on the contributions github page:

Ashander, Jaime, @ashander
Baumdicker, Franz, @fbaumdicker
Bisschop, Gertjan, @GertjanBisschop
Eldon, Bjarki, @eldonb
Ellerman, E. Castedo, @castedo
Galloway, Jared, @jgallowa07
Gladstein, Ariella, @agladstein
Goldstein, Daniel, @daniel-goldstein
Gorjanc, Gregor, @gregorgorjanc
Gower, Graham, @grahamgower
Gravel, Simon, @sgravel
Guo, Bing, @gbinux
Jeffery, Ben, @benjeffery
Kelleher, Jerome, @jeromekelleher
Kern, Andrew, @andrewkern
Koskela, Jere, @JereKoskela
Kretzschmar, Warren W., @winni2k
Krukov, Ivan, @ivan-krukov
Lohse, Konrad, @KLohse
Matschiner, Michael, @mmatschiner
Nelson, Dominic, @DomNelson
Pope, Nathaniel, @nspope
Quinto-Cortés, Consuelo D., @cdquinto
Ragsdale, Aaron, @apragsdale
Ralph, Peter, @petrelharp
Rodrigues, Murillo, F. @mufernando
Saunack, Kumar, @saunack
Sellinger, Thibaut, @TPPSellinger
Thornton, Kevin, @molpopgen
Tsambos, Georgia, @gtsambos
Van Kemenade, Hugo, @hugovk
Wohns, Anthony W, @awohns
Wong, Yan, @hyanwong
Zhu, Sha, @shajoezhu

If you do not wish to be an author, please let me know and I will remove you from the list. Otherwise, please click on the "Watch" button above to follow updates on this repo.

Author order

All authors will be listed in alphabetical order of surname, as this seems like the least unfair way to do it.

Updated: as discussed here, we're adopting a 4-category author system, with authors listed alphabetically within those categories.

Appendix for mutation algorithm

From #83:

Should we describe how the mutation simulation algorithm works? In an appendix?

This sounds like a good idea to me, and perfect for an appendix (I'm fine with lots of appedixes). I wouldn't be able to write it, though, so I'm assigning this one to you @petrelharp. If you don't want to do it, we'll just close.

big-picture comments

I've done a general read-through of the paper, and am recording big-picture comments here (for lack fo a better place?).

Generally: it looks great! I'd say that right now it is a general overview of the new features, with demonstrations that these new features are indeed fast. There isn't much detail on how things actually work, except in the Appendix. I think this is probably good - how much detail we want depends on the audience, and adding more detail would make it too long.

  1. In the "tree sequence" section, it doesn't yet explain how mutations are put in there.
  2. However, the next section does, so maybe that section should be titled "mutations" not "simulating mutations"
  3. We've got nothing showing the Demography features. We might have a table of Demography ingredients? Or a diagram of a demography with a few events, with the code to produce it alongside? Right now it feels a bit abstract, and arriving at "instantaneous bottlenecks" feels like then there will be sections for many more demographic ingredients (but there is not). Alternatively, something people will wonder is "will my old demography code work" and/or "how do I translate to the new demography"; perhaps we can address that here?
  4. In the ARG section, we might make a diagram showing something that's in the full ARG but not the minimal tree sequence, to make the distinction clear? As part (b) of Figure 5?
  5. In Figure 7, the y-axis is on a log scale. And, maybe we should extend the x-axis to 100Mb?
  6. Maybe "simulation interface" and "development model" should go first?
  7. In "development model" we could give an example of how we test things, to make it clear that it's rigorous.
  8. There's no discussion of when you need to do forwards simulation (selection, continuous space, etc); maybe this belongs in the Introduction?
  9. Should we describe how the mutation simulation algorithm works? In an appendix?
  10. I wonder if there's some descriptive figures we'd like to include, e.g. a conceptual one that shows simulating trees and then adding mutations (using the nice tskit svg drawing capability), or example trees simulated with and without multiple mergers. This could spice up the performance figures (which all look the same).

I'm happy to do some of these things, but thought I should make one issue instead of ten - but, maybe if others agree a thing should be done, it should be made a separate issue? I'm not sure how to deal with this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.