popsim-consortium / demes-spec Goto Github PK

4.0 6.0 4.0 319 KB

Demes demographic model specification

Home Page: https://popsim-consortium.github.io/demes-spec-docs/

License: MIT License

Python 100.00%

demes-spec's Introduction

A data model for describing demographic histories.

The Demes specification defines a data model for describing one or more demes (also known as populations), how they change over time, and their relationships to one another. A human-readable Demes model is written as a YAML file, which facilitates model sharing, reuse, and interoperability.

Documentation

Users interested in learning how to read and/or write a Demes YAML file are referred to the tutorial in the Demes documentation. A more precise definition of the Demes specification can also be found in the documentation, in the form of an annotated schema.

Software

Software for working with Demes YAML files is located in external repostories. In particular, the demes Python package. Links to additional Demes-related software can be found in the documentation.

demes-spec's People

Contributors

Stargazers

Watchers

Forkers

grahamgower jeromekelleher sgravel apragsdale

demes-spec's Issues

JSON doesn't support Infinity

Apparently the JSON spec explicitly does not permit Infinity, which we require to represent start times. The Python json implementation can deal with the special value Infinity, and converts to/from float infinity just fine. However, it seems this isn't a widespread extension. Not sure what to do about this, because it means we can't really serialize to json (at least, we can't expect this to be more portable than yaml).

https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript
https://stackoverflow.com/questions/61841001/handling-infinity-values-in-json

Inherited attributes

We've talked about how to deal with inherited attributes in a few places, so I thought I'd start an issue here to bring it together.

There's a tension between our goals of (a) providing a specification for humans to read that is non-repetitive; and (b) providing a specification for interchange that's as simple to implement as possible. For (a) we would definitely like to have inherited attributes, because we don't want to write down the same parameters over and again. For (b) we would kinda prefer not to have inherited attributes, because it does make the specification a bit more complex, and we would have to be quite precise about what demes parsers are expected to do.

So, here's a proposal. Our object tree currently looks like this:

root
+ --- description
+ --- demes
      + --- id
      + --- description
      + --- epochs
           + --- initial_size
           + --- final_size
+ --- pulses
     + --- source
     + --- dest
     + --- time
+ --- migrations
    + --- source
    + --- dest
    + --- rate

I don't like the idea of having things like cloning_rate defined as root attributes of the graph, because this implies that everything inherits this as a default (even pulses and migrations, say).

What if we added an explicit defaults section to the various levels of the tree, and the defaults are then given as fully qualified references? So, something like

description: An example demography to play around with cloning attributes.
generation_time: 1
time_units: generations
defaults:
    Epoch.cloning_rate: 0.05
    Epoch.initial_size: 1000

demes:
- id: pop1
    description: Population with epochs and changing cloning rates
    ancestors: root
    defaults:
         Epoch.cloning_rate: 0.1 # Overrides the default from higher in the tree
         Epoch.initial_size: 1e4
    epochs:
    -  end_time: 500
    - initial_size: 1e2
      end_time: 100
    - end_time: 0
      cloning_rate: 0.5
  - id: pop2
    description: Population with epochs and changing cloning rates
    ancestors: root
    epochs:
    - end_time: 500     
    - initial_size: 1e2
      end_time: 0
      cloning_rate: 1.0

So, the semantics can be defined precisely (if a value isn't given, go up the tree looking at the defaults, and if you don't find any, error out), and we're not cluttering up the namespace with things that aren't actually related to the object in question. The problem of outputting a compact/simplified (I'm on the fence about what the right terminology is here) is one of figuring out what can be put into the defaults to minimise repetition. This is a parsimony problem, which I'm sure we can solve.

The downside is that parsers have to be a bit more complicated, but I think this is something we can live with. So, to be clear, we get rid of the distinction between the low-level fully qualified graph that's used for interchange and settle on this single form which requires that implementations understand how default values work.

What do we think? I know I'm vacillating on what should and shouldn't go in the JSON schema definition (apologies!), but I feel like we're very close to something we can define precisely and start building on!

describe the simulation model

I'm thinking this would be analogous to the section in the msprime docs, and would include things like:

Forwards-time convention
The digressions in popsim-consortium/demes-python#7, regards child deme proportions, and parent/offspring generations

Eforce root deme start times are inf.

popsim-consortium/demes-python#169

initial_size<1 or final_size<1 should be an error

We currently enforce that Ne>0, but it should be Ne>=1. Unless there is some circumstance under which 0 < Ne < 1 makes sense?

Define the scope

We need to define the scope of the spec somehow. We know that demes is not a "universal specification format" for population genetic simulations and inference methods. It is for describing demographic models that people estimate and simulate. But, how do we concisely say what that actually means, and avoid feature creep in the future?

One way we might do this is to say that Demes is concerned with population-level processes only, and any parameters that are about the individual genomes are out of scope. So, this means that things like mutation rates, recombination rates, selection coefficients are out of scope, but things like migration and dispersal rates are in scope.There's probably a grey area somewhere, but this is biology, so there's never going to be a perfectly crisp definition of anything.

This is something we'd need to deal with in the paper too, but the specification needs to prominently define what it's about. See also https://github.com/apragsdale/demes-paper/issues/4 and https://github.com/apragsdale/demes-paper/issues/5

reference implementation doesn't output deme name in `asdict()`

As found during conversion of id to name in #61. Easy fix, but we should probably have a check that the resolved json files can be read back in to see if there's any other issues lurking.

Require at least one Epoch in a Deme

Currently we're implicitly defining a single epoch via some attributes of the Deme class. From a specification/parser writing perspective, it would be considerably simpler to just required the existence of at least one epoch. So, from the examples:

demes:
  constant_size_deme:
    initial_size: 1000

demes:
  constant_size_deme:
    epochs:
    - initial_size: 1000
      end_time: 0

Is the first that much more readable than the second? Are we the models that people will be writing down to contain many demes with just one epoch?

transfer docs from demes-python repo

All the intro/tutorial stuff properly belongs here, as it should be implementation independent.

clarify migration rate units/meaning

We currently say "the rate of migration per generation" (which was copied from python docstrings). Is this supposed to be a proportion of the destination population? The probability that any given individual in the source population migrates in a given generation? Something else? Presumably this should be bounded above by 1.0?

build/render docs and find them a home

In #1, I created sphinx docs. We aren't building these using CI yet. We should. But this opens the question of where the docs should live. demes-python is currently building sphinx docs in a CI action and pushing the built docs to the https://github.com/popsim-consortium/demes-docs repository, which is then hosted via github pages at https://popsim-consortium.github.io/demes-docs/main/index.html. So we could make a new demes-spec-docs repository, and do the same thing. Or we could push the demes-spec docs to the same demes-docs repository, e.g. in its own subdirectory.

Or we could choose to avoid sphinx for the spec docs, and do something else? Github pages supports rendering markdown natively, which is probably the simplest thing we could do. But automatically building docs from the spec.yaml file wouldn't work (or would be a custom job).

HDM: Provide definition of population size resolution

We don't currently define how population size resolution should work. I suggest something like

In the most ancient epoch for a given deme, at least one of start_size and end_size must be specified. If either value is not specified, it is assigned the value of the other size. For example, if start_size has been assigned the value 100 and end_size is null, end_size is resolved to 100.

For each subsequent epoch after the most ancient epoch, if its start_size is not specified it is assigned the value of the previous epoch's end_size. If its end_size is not specified it is assigned the value of its start_size.

Make single Migration class?

I know we've been over this a few times, but I wanted to get a final discussion here for the record. From a parser implementation and specification perspective it would be a lot simpler if we just have a single class of migration. There are very good reasons for wanting both types, so I wonder if we could work around the issue as follows:

Migrations all have a start_time, end_time and a rate (blah blah blah). Migrations are either symmetric or asymmetric. In the asymmetric case, the source and dest demes must be specified. In the symmetric case the list of demes is specified. Either one of source and dest OR demes must be specified, but not both.

This seems to give us the behaviour we want, while simplifying the spec a bit, doesn't it?

link to software using/implementing Demes in the docs

This would be as an advertisement that if you have, or write, a Demes file, that you can indeed use it for something. Probably as a third section on the introduction page, in alphabetical order. If it gets too long we can move it to its own page in the future.

demes-python
fwdpy11
moments
msprime (soon)
stdpopsim (soon)

A one-line description of each might be nice too.

Should ``Graph.description`` really be mandatory?

Currently the spec says that the top-level description field in the graph is mandatory. I see the logic for this: we want to try to encourage people to describe their graphs. I wonder if it's a bit heavy-handed though.

Thoughts @apragsdale @molpopgen?

If we keep it as mandatory, is the empty string valid input?

include discrete-time and continuous-time interpretations of relevant params

The spec should clearly describe

both the discrete-time and continuous-time semantics/interpretations for: times, open-closed time intervals, and migrations.
both the discrete-N and continuous-N semantics for deme sizes. (e.g. rounding in the discrete case)

create JSON schema

See #14.
https://json-schema.org/

describe best-practices for specifying a demes graph

E.g.

for two extant populations, define (Ancestral, (child1, child2)), rather than (child1, (child1 continuation, child2).
popsim-consortium/demes-python#46

examples/zigzag* population size parameters are 5x too low

See popsim-consortium/stdpopsim#745 and popsim-consortium/demes-python#179

Error conditions for pulses?

It seems like it should be an error if we specify the same pulse twice, of if we have contradictory proportion values for the same (source, dest, time) combinations. The parser SHOULD raise an error in this case.

Define these conditions in the spec and implement in the reference implementation.

update the reference implementation

In particular, there need to be changes made to address #38/#39 and #55. The test cases added in #68 can be used for verification against the demes-python implementation.

Clean up tutorial documentation

I've noticed some typos and errors in the Tutorial from my previous PR filling in docs (#47). This is just a note-to-self that the high-level YAML tutorial needs some attention, and we want to provide some basic API tutorial. However, since the API sounds like it's mostly meant to be used by developers putting together support for demes in their own software, we might move that section over to the Developer page.

HDM: describe how parameters obtain their values when they are omitted

The semantics of optional parameters like start_time, end_time, initial_size, final_size must be precisely described in the documentation. It might be useful to start with an example of a "complete" demography, where nothing is omitted, and then discuss how the model remains unambiguous when removing one or more parameters.

disallow pulse migrations at time=0?

I think this defines an event where the outcome can't be observed, so we should explicitly disallow it. We already don't allow a deme to have start_time=0.

HDM: Definition of default semantics for size_function

The parser should output fully a resolved model, with non-null values for every field that effects the model. Thus, we need to specify the semantics of what the default values should be. We could do something like:

The value of an Epoch's size_function should be resolved after the epoch's start_size and end_size values have been resolved. If the start_size and end_size are equal, the size function should be resolved to "constant", or otherwise to exponential.

However, I think it's simpler if we just default to exponential, and leave it up to the implementation to realise that the growth rate is 0. They will need to do this anyway, unless we make it part of the spec that parsers should raise an error if the size function is exponential and the start and end sizes are the same. This seems like it would end up being fragile and annoying to me, though.

reserve a top-level `version` keyword in the yaml format, for future use

It's almost certain that we'll want to change the format in the future. We should anticipate this change.

MDM: Specify error conditions on overlapping migrations

Suppose we have two migrations that involve the same demes and have overlapping time intervals. This is fairly clearly an error, which the parser should probably pick up, so we should define the error condition.

It's not so obvious how to phrase (or detect) this when we have a mixture of symmetric an asymmetric migrations, though. Any ideas?

Rules about Epoch start/end times are unclear

The current rules around how a Deme/Epoch's start_time is obtained are fairly clear currently, but it's not obvious to me what we're supposed to do with end_time. Are we assuming that end_time should always be specified for an Epoch, and it's only start_time that can be omitted?

It's not obvious why we would allow for start_time values to be filled in, but not end_time values, since they're equally redundant in either direction. Suppose we have the intervals

[(a, b), (b, c), (c, d)]

Then in principle we have stuff like the following without amibiguity (where X is "null")

[(a, X), (b, X), (c, d)]
[(a, b), (X, X), (c, d)]
[(a, X), (b, X), (c, d)]
[(a, X), (b, c), (X, d)]

Should we allow any mixture of missing values? If so, how do spell out what the rules are? I'm generally finding it a bit icky trying to nail down what the semantics are here on how time "flows" though the epochs and graph nodes. I'm not a huge fan of the redundancy, but I'm not really seeing a better way to do it. The only thing that occurs to me is to store just the list times=[a, b, c, d] as an attribute of Deme, but that has a whole bunch of its own issues.

vocabulary for demes

I think we need to settle on a clear and consistent vocabulary for both the spec and our documentation. Here's a start, feel free to suggest different names:

demes graph - an object that conforms to the schema of our data model. This could be in reference to a specific implementation of a demographic model. E.g. the Gutenkunst et al. OOA demes graph. This does not specify which format/encoding the graph is in. In the context of our documentation, and within the code, we can probably drop the 'demes' prefix when there's no chance of confusion.
graph encoding - the format the graph is in. This could be an interchange format like YAML/JSON, or a python demes.Graph object, or some other in memory data structure such as nested dictionaries/lists.
~~compact graph~~ simplified graph - a demes graph with redundant information excluded. Users will typically write their yaml files as a simplified demes graph. [EDIT: call this 'simplified']
~~redundant graph~~ fully-qualified graph - a demes graph with all the redundant information included. A demes.Graph object is an example of a fully-qualified demes graph, and all python API users will use the fully-qualified information in the demes.Graph object. One can also serialise a fully-qualified graph to a YAML or JSON encoded file. This might then be used by non-python applications that don't want to deal with simplified demes graphs themselves. [EDIT: call this 'fully qualified']

spec versioning

The spec will probably want a version number/string. I'd imagine we might want to start somewhere close to 1.0?

Change Deme.id to Deme.name

I think we should change the current id attribute on Deme to name. There's two main reasons for this:

Using id is intrusive for downstream implementations as they will already have some concept of a population/deme ID, and it's unlikely to match up with what we're requiring here. For example, msprime/tskit have zero-indexed integers as population IDs, and SLiM has one-indexed integers as population IDs. Neither can map the concept of "Deme id" directly into their data models, requiring icky workarounds where we add a "name" which corresponds to the Demes ID. This is both messy and will confuse users.
It's not as descriptive. "ID" has connotations of some unique, opaque identifier for a thing. We don't want people to give demes IDs, we want people to name then, in meaningful, memorable ways. Calling it a name gets this idea across much better.

Thoughts/objections?

HDM: symmetric migration semantics when start_time and/or end_time are omitted

What should this mean?

time_units: generations
defaults:
  epoch: {start_size: 1000}
demes:
- id: A
- id: B
- id: C
  start_time: 100
  ancestors: [B]
migrations:
- demes: [A, B, C]
  rate: 1e-5

The current behaviour is that there will be no migration between A<->B before time 100, because we set the migration start_time to the minimum value of all participating demes' start_times. This is gives a simple implementation, so I think we should keep this behaviour. But it's not necessarily the most intuitive behaviour. One could easily interpret this as meaning that A<->C and B<->C have migrants only after time 100, but A<->B has migrants at all times.

MDM: Define selfing_rate and cloning_rate

We don't currently have much of a definition of what the selfing_rate and cloning_rate parameters mean. (Related: #33, #12 )

See also the wider discussion of how much complexity in life history processes we can/should try to capture here: https://github.com/apragsdale/demes-paper/issues/5

Move detailed descriptions out of JSON schema

It's a good idea to put the full description of the spec into the JSON schema yaml file, but I think it has some limitations. It makes it more difficult to cross reference sections, and the formatting we can do is pretty limited. Also, (and the clincher for me) it results in pretty much unreadable error messages coming out of jsonschema.validate(), which writes out the full schema text and you therefore get pages of text you need to scroll through.

So, I suggest we structure the schema description as an RST file, and put in terse descriptions with URL section links to the full documentation in the schema text. It should be easy enough to make stable URLs for these, now?

formal description of the keywords in the demes yaml format

Along with #8, we want a precise and complete description of all the keywords (description, doi, demes, epoch, etc), their scope, and semantics.

Allow extra fields in the specification?

An important question we should settle is whether we allow extra information to be included in the JSON/yaml object. That is, if there's an extra key at any level in the graph description, should parsers ignore this or should they throw an error? If we advise parsers to ignore extra data, then this gives a natural way for people to "extend" the specification by adding in more stuff to a demography model that only their simulator/inference method understands. This may be a good thing under the "be lenient in what you receive and strict on what you send" principle.

However, it may also be quite a bad thing in that it encourages fragmentation and random extensions in various directions that may or may not be appropriate. Another bad outcome is that it reduces our ability to effectively version the schema. Suppose msprime version x supports Demes version 1.0, and then Demes version 1.1 comes out which includes some extra information which msprime simulates. Then, msprime version x will happily accept the Demes version 1.1 model and silently produce a different simulation to what it would if had had a Demes 1.1 parser. This is surely a bad thing from a reproducibility perspective.

So, on balance, I think we should require that parsers raise an error if they see any information in the JSON payload that they don't recognise.

Related to #20

Thoughts?

Model resolution

Here's an initial pass at writing down the resolution order for values in the graph, so that we go from a simplified graph to a full-qualified one (we should consider the inverse operation separately).

Values that are specified in the defaults section of the JSON document are propagated downwards through the document tree and set in a preorder traversal.
To resolve times and population sizes within demes, we start at the roots and visit demes in preorder (so, all ancestors will be fully qualified when we visit a particular deme). Within a deme, the times of each epoch are determined uniquely by the information supplied and the time information of parent demes. Population sizes are uniquely determined by the information within a deme.
Migrations are then resolved.

How does this sound?

transfer issues from demes-python repo

Issues with the demes-spec tag can probably be transferred without thinking. Those with additional tags might need more thought.

Use RFC 2119 terminology in spec

Use terms like SHOULD, MUST etc in the RFC 2119 sense, including the recommended header.

add set of test json files

There's a question about how much CI etc we do with this. My vote would be to keep it fairly minimal.

I think it will be useful for the demes-spec repository to include a set of test yaml (or json) files. There would be two sets, one for valid input, and one invalid input. The intention is for these to be available for the development/testing of new implementations. The reference implementation could then be tested against these files in CI.

Originally posted by @grahamgower in #26 (comment)

If time_units is "generations" should we allow ``generation_time`` to be set?

The value won't be used, so maybe it should be an error to specify it? On the other hand, it could be useful information that might have been used in inference or something?

The reference parser currently throws an error.

MDM: Simultaneous pulses occur in the order specified

Suppose we have two pulses at time t with (source, dest) = (a, b) and (b, c) respectively. Should we allow this? Should we specify that pulses SHOULD be executed in the order they are listed, so that some fraction of a's population may end up in c immediately after t?

MDM: Define the deme "size". Haploid, or diploid?

I fear we might have overlooked this fundamentally important decision.

HDM: Selfing and cloning rate should default to zero

Currently no definition of what the default for cloning_rate and selfing_rate is. In the interest of being explicit about what values we expect in a fully resolved model, we should give these a default of 0.

Can't see a good reason to give them a non-zero default, and leaving a "null" value in the resolved output just pushes the decision onto the simulator, leading to ambiguity.

We should also give clear language on what a simulator should do in the case of seeing a non-zero value for these parameters if it doesn't support the feature (I guess this is a more general point too - programs should explicitly check for parts of the model they don't support and raise an error).

Add top-level "metadata" field

Stdpopsim is an important downstream user of Demes, and it looks like we will often need to store the mutation rate with a given model:

popsim-consortium/stdpopsim#839

The right place to do this is in the Demes yaml file. The question is, how do we facilitate this. I see three options:

Add a mutation_rate field
Allow users to put extra stuff into the yaml, wherever they want
Add a top-level "metadata" (or "extra", or something) field where people can dump whatever they want.

I see pros and cons to all three. Thoughts?

add LICENSE file

Docs/specs need this too, unfortunately.

Rename to start_size/end_size?

It would perhaps be slightly easier to grok the connection between times and sizes if they shared the same prefix - i.e, start_size is the population size at start_time and end_size is the size at end_time. Are "initial" and "final" just synonyms for "start" and "end" here, without really adding any insight?

I'd imagine we went through this before... apologies for raking up stuff that's been settled already!

add "linear" size_function example to the tutorial

See popsim-consortium/demes-python#296 and popsim-consortium/demes-python#310

We probably want an example in the tutorial as well as a description in the spec. In the spec, we should explicitly state that when the size_function isn't specified, it is set to "constant" (if start_size == end_size) or "exponential" (if start_size != end_size).

clear description of demes yaml format

We should write an BNF/EBNF description of the declarative file format. If should be fairly short, because we don't have so much syntax or many keywords. At the very least, it will be nice to have a complete description to include in the documentation. BNF is machine readable though, and there are many tools that can do useful things with this (and the fuzzer suggested in popsim-consortium/demes-python#24 could consume this format).

Our spec defines a general data model for demographies, not a yaml format

In popsim-consortium/demes-python#105, there's some discussion relating to nested python dicts, or json format. The yaml format we've ended up with is fine and all, but could just as easily use json, or some other arbitrary format that lets us define the nested data structues and their attributes. So just how tied to yaml are we? If folks find other formats useful for interchange, presumably they'll proliferate instead (e.g. json). Should we be clearer with our framing of the spec, and just say we define a general model structure (with yaml as our intended interchange format)?

(Fun fact, json is a proper subset of the yaml spec, so valid json is valid yaml!).