Coder Social home page Coder Social logo

demes-spec's Introduction

A data model for describing demographic histories.

The Demes specification defines a data model for describing one or more demes (also known as populations), how they change over time, and their relationships to one another. A human-readable Demes model is written as a YAML file, which facilitates model sharing, reuse, and interoperability.

Documentation

Users interested in learning how to read and/or write a Demes YAML file are referred to the tutorial in the Demes documentation. A more precise definition of the Demes specification can also be found in the documentation, in the form of an annotated schema.

Software

Software for working with Demes YAML files is located in external repostories. In particular, the demes Python package. Links to additional Demes-related software can be found in the documentation.

demes-spec's People

Contributors

apragsdale avatar grahamgower avatar jeromekelleher avatar molpopgen avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

demes-spec's Issues

JSON doesn't support Infinity

Apparently the JSON spec explicitly does not permit Infinity, which we require to represent start times. The Python json implementation can deal with the special value Infinity, and converts to/from float infinity just fine. However, it seems this isn't a widespread extension. Not sure what to do about this, because it means we can't really serialize to json (at least, we can't expect this to be more portable than yaml).

https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript
https://stackoverflow.com/questions/61841001/handling-infinity-values-in-json

Inherited attributes

We've talked about how to deal with inherited attributes in a few places, so I thought I'd start an issue here to bring it together.

There's a tension between our goals of (a) providing a specification for humans to read that is non-repetitive; and (b) providing a specification for interchange that's as simple to implement as possible. For (a) we would definitely like to have inherited attributes, because we don't want to write down the same parameters over and again. For (b) we would kinda prefer not to have inherited attributes, because it does make the specification a bit more complex, and we would have to be quite precise about what demes parsers are expected to do.

So, here's a proposal. Our object tree currently looks like this:

root
+ --- description
+ --- demes
      + --- id
      + --- description
      + --- epochs
           + --- initial_size
           + --- final_size
+ --- pulses
     + --- source
     + --- dest
     + --- time
+ --- migrations
    + --- source
    + --- dest
    + --- rate

I don't like the idea of having things like cloning_rate defined as root attributes of the graph, because this implies that everything inherits this as a default (even pulses and migrations, say).

What if we added an explicit defaults section to the various levels of the tree, and the defaults are then given as fully qualified references? So, something like

description: An example demography to play around with cloning attributes.
generation_time: 1
time_units: generations
defaults:
    Epoch.cloning_rate: 0.05
    Epoch.initial_size: 1000

demes:
- id: pop1
    description: Population with epochs and changing cloning rates
    ancestors: root
    defaults:
         Epoch.cloning_rate: 0.1 # Overrides the default from higher in the tree
         Epoch.initial_size: 1e4
    epochs:
    -  end_time: 500
    - initial_size: 1e2
      end_time: 100
    - end_time: 0
      cloning_rate: 0.5
  - id: pop2
    description: Population with epochs and changing cloning rates
    ancestors: root
    epochs:
    - end_time: 500     
    - initial_size: 1e2
      end_time: 0
      cloning_rate: 1.0

So, the semantics can be defined precisely (if a value isn't given, go up the tree looking at the defaults, and if you don't find any, error out), and we're not cluttering up the namespace with things that aren't actually related to the object in question. The problem of outputting a compact/simplified (I'm on the fence about what the right terminology is here) is one of figuring out what can be put into the defaults to minimise repetition. This is a parsimony problem, which I'm sure we can solve.

The downside is that parsers have to be a bit more complicated, but I think this is something we can live with. So, to be clear, we get rid of the distinction between the low-level fully qualified graph that's used for interchange and settle on this single form which requires that implementations understand how default values work.

What do we think? I know I'm vacillating on what should and shouldn't go in the JSON schema definition (apologies!), but I feel like we're very close to something we can define precisely and start building on!

Define the scope

We need to define the scope of the spec somehow. We know that demes is not a "universal specification format" for population genetic simulations and inference methods. It is for describing demographic models that people estimate and simulate. But, how do we concisely say what that actually means, and avoid feature creep in the future?

One way we might do this is to say that Demes is concerned with population-level processes only, and any parameters that are about the individual genomes are out of scope. So, this means that things like mutation rates, recombination rates, selection coefficients are out of scope, but things like migration and dispersal rates are in scope.There's probably a grey area somewhere, but this is biology, so there's never going to be a perfectly crisp definition of anything.

This is something we'd need to deal with in the paper too, but the specification needs to prominently define what it's about. See also https://github.com/apragsdale/demes-paper/issues/4 and https://github.com/apragsdale/demes-paper/issues/5

Require at least one Epoch in a Deme

Currently we're implicitly defining a single epoch via some attributes of the Deme class. From a specification/parser writing perspective, it would be considerably simpler to just required the existence of at least one epoch. So, from the examples:

demes:
  constant_size_deme:
    initial_size: 1000
demes:
  constant_size_deme:
    epochs:
    - initial_size: 1000
      end_time: 0

Is the first that much more readable than the second? Are we the models that people will be writing down to contain many demes with just one epoch?

clarify migration rate units/meaning

We currently say "the rate of migration per generation" (which was copied from python docstrings). Is this supposed to be a proportion of the destination population? The probability that any given individual in the source population migrates in a given generation? Something else? Presumably this should be bounded above by 1.0?

build/render docs and find them a home

In #1, I created sphinx docs. We aren't building these using CI yet. We should. But this opens the question of where the docs should live. demes-python is currently building sphinx docs in a CI action and pushing the built docs to the https://github.com/popsim-consortium/demes-docs repository, which is then hosted via github pages at https://popsim-consortium.github.io/demes-docs/main/index.html. So we could make a new demes-spec-docs repository, and do the same thing. Or we could push the demes-spec docs to the same demes-docs repository, e.g. in its own subdirectory.

Or we could choose to avoid sphinx for the spec docs, and do something else? Github pages supports rendering markdown natively, which is probably the simplest thing we could do. But automatically building docs from the spec.yaml file wouldn't work (or would be a custom job).

HDM: Provide definition of population size resolution

We don't currently define how population size resolution should work. I suggest something like

In the most ancient epoch for a given deme, at least one of start_size and end_size must be specified. If either value is not specified, it is assigned the value of the other size. For example, if start_size has been assigned the value 100 and end_size is null, end_size is resolved to 100.

For each subsequent epoch after the most ancient epoch, if its start_size is not specified it is assigned the value of the previous epoch's end_size. If its end_size is not specified it is assigned the value of its start_size.

Make single Migration class?

I know we've been over this a few times, but I wanted to get a final discussion here for the record. From a parser implementation and specification perspective it would be a lot simpler if we just have a single class of migration. There are very good reasons for wanting both types, so I wonder if we could work around the issue as follows:

Migrations all have a start_time, end_time and a rate (blah blah blah). Migrations are either symmetric or asymmetric. In the asymmetric case, the source and dest demes must be specified. In the symmetric case the list of demes is specified. Either one of source and dest OR demes must be specified, but not both.

This seems to give us the behaviour we want, while simplifying the spec a bit, doesn't it?

link to software using/implementing Demes in the docs

This would be as an advertisement that if you have, or write, a Demes file, that you can indeed use it for something. Probably as a third section on the introduction page, in alphabetical order. If it gets too long we can move it to its own page in the future.

  • demes-python
  • fwdpy11
  • moments
  • msprime (soon)
  • stdpopsim (soon)

A one-line description of each might be nice too.

Should ``Graph.description`` really be mandatory?

Currently the spec says that the top-level description field in the graph is mandatory. I see the logic for this: we want to try to encourage people to describe their graphs. I wonder if it's a bit heavy-handed though.

Thoughts @apragsdale @molpopgen?

If we keep it as mandatory, is the empty string valid input?

Error conditions for pulses?

It seems like it should be an error if we specify the same pulse twice, of if we have contradictory proportion values for the same (source, dest, time) combinations. The parser SHOULD raise an error in this case.

Define these conditions in the spec and implement in the reference implementation.

Clean up tutorial documentation

I've noticed some typos and errors in the Tutorial from my previous PR filling in docs (#47). This is just a note-to-self that the high-level YAML tutorial needs some attention, and we want to provide some basic API tutorial. However, since the API sounds like it's mostly meant to be used by developers putting together support for demes in their own software, we might move that section over to the Developer page.

HDM: describe how parameters obtain their values when they are omitted

The semantics of optional parameters like start_time, end_time, initial_size, final_size must be precisely described in the documentation. It might be useful to start with an example of a "complete" demography, where nothing is omitted, and then discuss how the model remains unambiguous when removing one or more parameters.

disallow pulse migrations at time=0?

I think this defines an event where the outcome can't be observed, so we should explicitly disallow it. We already don't allow a deme to have start_time=0.

HDM: Definition of default semantics for size_function

The parser should output fully a resolved model, with non-null values for every field that effects the model. Thus, we need to specify the semantics of what the default values should be. We could do something like:

The value of an Epoch's size_function should be resolved after the epoch's start_size and end_size values have been resolved. If the start_size and end_size are equal, the size function should be resolved to "constant", or otherwise to exponential.

However, I think it's simpler if we just default to exponential, and leave it up to the implementation to realise that the growth rate is 0. They will need to do this anyway, unless we make it part of the spec that parsers should raise an error if the size function is exponential and the start and end sizes are the same. This seems like it would end up being fragile and annoying to me, though.

MDM: Specify error conditions on overlapping migrations

Suppose we have two migrations that involve the same demes and have overlapping time intervals. This is fairly clearly an error, which the parser should probably pick up, so we should define the error condition.

It's not so obvious how to phrase (or detect) this when we have a mixture of symmetric an asymmetric migrations, though. Any ideas?

Rules about Epoch start/end times are unclear

The current rules around how a Deme/Epoch's start_time is obtained are fairly clear currently, but it's not obvious to me what we're supposed to do with end_time. Are we assuming that end_time should always be specified for an Epoch, and it's only start_time that can be omitted?

It's not obvious why we would allow for start_time values to be filled in, but not end_time values, since they're equally redundant in either direction. Suppose we have the intervals

[(a, b), (b, c), (c, d)]

Then in principle we have stuff like the following without amibiguity (where X is "null")

[(a, X), (b, X), (c, d)]
[(a, b), (X, X), (c, d)]
[(a, X), (b, X), (c, d)]
[(a, X), (b, c), (X, d)]

Should we allow any mixture of missing values? If so, how do spell out what the rules are? I'm generally finding it a bit icky trying to nail down what the semantics are here on how time "flows" though the epochs and graph nodes. I'm not a huge fan of the redundancy, but I'm not really seeing a better way to do it. The only thing that occurs to me is to store just the list times=[a, b, c, d] as an attribute of Deme, but that has a whole bunch of its own issues.

vocabulary for demes

I think we need to settle on a clear and consistent vocabulary for both the spec and our documentation. Here's a start, feel free to suggest different names:

  • demes graph - an object that conforms to the schema of our data model. This could be in reference to a specific implementation of a demographic model. E.g. the Gutenkunst et al. OOA demes graph. This does not specify which format/encoding the graph is in. In the context of our documentation, and within the code, we can probably drop the 'demes' prefix when there's no chance of confusion.
  • graph encoding - the format the graph is in. This could be an interchange format like YAML/JSON, or a python demes.Graph object, or some other in memory data structure such as nested dictionaries/lists.
  • compact graph simplified graph - a demes graph with redundant information excluded. Users will typically write their yaml files as a simplified demes graph. [EDIT: call this 'simplified']
  • redundant graph fully-qualified graph - a demes graph with all the redundant information included. A demes.Graph object is an example of a fully-qualified demes graph, and all python API users will use the fully-qualified information in the demes.Graph object. One can also serialise a fully-qualified graph to a YAML or JSON encoded file. This might then be used by non-python applications that don't want to deal with simplified demes graphs themselves. [EDIT: call this 'fully qualified']

spec versioning

The spec will probably want a version number/string. I'd imagine we might want to start somewhere close to 1.0?

Change Deme.id to Deme.name

I think we should change the current id attribute on Deme to name. There's two main reasons for this:

  1. Using id is intrusive for downstream implementations as they will already have some concept of a population/deme ID, and it's unlikely to match up with what we're requiring here. For example, msprime/tskit have zero-indexed integers as population IDs, and SLiM has one-indexed integers as population IDs. Neither can map the concept of "Deme id" directly into their data models, requiring icky workarounds where we add a "name" which corresponds to the Demes ID. This is both messy and will confuse users.
  2. It's not as descriptive. "ID" has connotations of some unique, opaque identifier for a thing. We don't want people to give demes IDs, we want people to name then, in meaningful, memorable ways. Calling it a name gets this idea across much better.

Thoughts/objections?

HDM: symmetric migration semantics when start_time and/or end_time are omitted

What should this mean?

time_units: generations
defaults:
  epoch: {start_size: 1000}
demes:
- id: A
- id: B
- id: C
  start_time: 100
  ancestors: [B]
migrations:
- demes: [A, B, C]
  rate: 1e-5

The current behaviour is that there will be no migration between A<->B before time 100, because we set the migration start_time to the minimum value of all participating demes' start_times. This is gives a simple implementation, so I think we should keep this behaviour. But it's not necessarily the most intuitive behaviour. One could easily interpret this as meaning that A<->C and B<->C have migrants only after time 100, but A<->B has migrants at all times.

Move detailed descriptions out of JSON schema

It's a good idea to put the full description of the spec into the JSON schema yaml file, but I think it has some limitations. It makes it more difficult to cross reference sections, and the formatting we can do is pretty limited. Also, (and the clincher for me) it results in pretty much unreadable error messages coming out of jsonschema.validate(), which writes out the full schema text and you therefore get pages of text you need to scroll through.

So, I suggest we structure the schema description as an RST file, and put in terse descriptions with URL section links to the full documentation in the schema text. It should be easy enough to make stable URLs for these, now?

Allow extra fields in the specification?

An important question we should settle is whether we allow extra information to be included in the JSON/yaml object. That is, if there's an extra key at any level in the graph description, should parsers ignore this or should they throw an error? If we advise parsers to ignore extra data, then this gives a natural way for people to "extend" the specification by adding in more stuff to a demography model that only their simulator/inference method understands. This may be a good thing under the "be lenient in what you receive and strict on what you send" principle.

However, it may also be quite a bad thing in that it encourages fragmentation and random extensions in various directions that may or may not be appropriate. Another bad outcome is that it reduces our ability to effectively version the schema. Suppose msprime version x supports Demes version 1.0, and then Demes version 1.1 comes out which includes some extra information which msprime simulates. Then, msprime version x will happily accept the Demes version 1.1 model and silently produce a different simulation to what it would if had had a Demes 1.1 parser. This is surely a bad thing from a reproducibility perspective.

So, on balance, I think we should require that parsers raise an error if they see any information in the JSON payload that they don't recognise.

Related to #20

Thoughts?

Model resolution

Here's an initial pass at writing down the resolution order for values in the graph, so that we go from a simplified graph to a full-qualified one (we should consider the inverse operation separately).

  • Values that are specified in the defaults section of the JSON document are propagated downwards through the document tree and set in a preorder traversal.
  • To resolve times and population sizes within demes, we start at the roots and visit demes in preorder (so, all ancestors will be fully qualified when we visit a particular deme). Within a deme, the times of each epoch are determined uniquely by the information supplied and the time information of parent demes. Population sizes are uniquely determined by the information within a deme.
  • Migrations are then resolved.

How does this sound?

add set of test json files

There's a question about how much CI etc we do with this. My vote would be to keep it fairly minimal.

I think it will be useful for the demes-spec repository to include a set of test yaml (or json) files. There would be two sets, one for valid input, and one invalid input. The intention is for these to be available for the development/testing of new implementations. The reference implementation could then be tested against these files in CI.

Originally posted by @grahamgower in #26 (comment)

MDM: Simultaneous pulses occur in the order specified

Suppose we have two pulses at time t with (source, dest) = (a, b) and (b, c) respectively. Should we allow this? Should we specify that pulses SHOULD be executed in the order they are listed, so that some fraction of a's population may end up in c immediately after t?

HDM: Selfing and cloning rate should default to zero

Currently no definition of what the default for cloning_rate and selfing_rate is. In the interest of being explicit about what values we expect in a fully resolved model, we should give these a default of 0.

Can't see a good reason to give them a non-zero default, and leaving a "null" value in the resolved output just pushes the decision onto the simulator, leading to ambiguity.

We should also give clear language on what a simulator should do in the case of seeing a non-zero value for these parameters if it doesn't support the feature (I guess this is a more general point too - programs should explicitly check for parts of the model they don't support and raise an error).

Add top-level "metadata" field

Stdpopsim is an important downstream user of Demes, and it looks like we will often need to store the mutation rate with a given model:

popsim-consortium/stdpopsim#839

The right place to do this is in the Demes yaml file. The question is, how do we facilitate this. I see three options:

  1. Add a mutation_rate field
  2. Allow users to put extra stuff into the yaml, wherever they want
  3. Add a top-level "metadata" (or "extra", or something) field where people can dump whatever they want.

I see pros and cons to all three. Thoughts?

Rename to start_size/end_size?

It would perhaps be slightly easier to grok the connection between times and sizes if they shared the same prefix - i.e, start_size is the population size at start_time and end_size is the size at end_time. Are "initial" and "final" just synonyms for "start" and "end" here, without really adding any insight?

I'd imagine we went through this before... apologies for raking up stuff that's been settled already!

clear description of demes yaml format

We should write an BNF/EBNF description of the declarative file format. If should be fairly short, because we don't have so much syntax or many keywords. At the very least, it will be nice to have a complete description to include in the documentation. BNF is machine readable though, and there are many tools that can do useful things with this (and the fuzzer suggested in popsim-consortium/demes-python#24 could consume this format).

Our spec defines a general data model for demographies, not a yaml format

In popsim-consortium/demes-python#105, there's some discussion relating to nested python dicts, or json format. The yaml format we've ended up with is fine and all, but could just as easily use json, or some other arbitrary format that lets us define the nested data structues and their attributes. So just how tied to yaml are we? If folks find other formats useful for interchange, presumably they'll proliferate instead (e.g. json). Should we be clearer with our framing of the spec, and just say we define a general model structure (with yaml as our intended interchange format)?

(Fun fact, json is a proper subset of the yaml spec, so valid json is valid yaml!).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.