Coder Social home page Coder Social logo

nextclade_data_workflows's Introduction

Checking new tree

  1. Download generated files into nextclade data workflow repo:

    scp -rC [email protected]:~/nextclade_data_workflows/sars-cov-2/output output
  2. Plug them into nextclade.org advanced view.

  3. Filter to new nodes and check that:

    • clades are clean
    • no big outliers
  4. Check tag.json is up to date (ideally update in profiles/tag.json for posterity)

  5. Check qc.json does not regress (ideally update in profiles/qc.json for posterity) [beware, codons are 0 indexed]

  6. Potentially run scripts/common_stops.py and scripts/common_frameshifts.py to add new stops/frameshifts that have become more common to qc.json

Identifying most common frame shifts and stop conds

  1. Download metadata to data/metadata_raw.tsv

  2. Run snakemake workflow with following commands/targets:

    snakemake --profile=profiles/clades pre-processed/frameshifts.tsv -R select_frameshifts
    snakemake --profile=profiles/clades pre-processed/stops.tsv -R select_stops
  3. Format most commons stops/fs into qc.json JSON format using

    python3 scripts/common_stops.py
    python3 scripts/common_frameshifts.py
  4. Manually check resul for plausibility and add to qc.json

Committing to data repo

  1. Go to nextclade_data_workflow repo

  2. Checkout branch, open PR to master

  3. Copy output from workflow repo to data repo

    cp -r output/sars-cov-2/references/MN908947/versions/  ../../nextclade_data/data/datasets/sars-cov-2/references/MN908947/versions
  4. Update changelog.md

  5. Get Ivan to review

  6. Merge into master

Release process

Follow release guidelines as outlined here: https://github.com/nextstrain/nextclade_data#dataset-release-process

nextclade_data_workflows's People

Contributors

corneliusroemer avatar huddlej avatar rneher avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

huddlej

nextclade_data_workflows's Issues

ENH: Put all recombinants on single branch from root

To combat ancestral reconstruction inferring S:614G at root, I can add all recombinants on a single subtree from root - as opposed to them being added all individually on root.

Right now having long branch length works, but in the future with more recombinants this may not be enough.

Don't count nucleotides 0-100 towards divergence

44 is very flip-floppy due to being often not properly sequenced.

We should not show it in calculation of divergence to avoid flip flops from lengthening the tree

Nextclade distance calculation should exempt 0-100 from distance calculations as well, but that's a different feature (requires an addition to virus_properties.json though - masked sites for placement)

ENH: Make BA.2 build side-product of main build

Right now we have the builds on separate branches. It would be nice to produce them all in one workflow to make maintenance easier.

The BA.2 build forks after tree building: it's a clean subtree from the common ancestor of BA.2 and BA.3, rooted on BA.2

ENH: Add recombinants back in, as subtree connected to root

For each recombinant, make a separate subtree (including the root), and connect it at the root.

This way we don't generate non-biological internal nodes that could attract attachments that we don't want to have (e.g. somewhat reverted B.1.1.7 attaching at internal parent node of B.1.1.7 and XB).

ENH: Build proper tree for XBB.1.5 rather than the simplistic current one

Right now, all recombinant trees are built very simplistically, based on Pango hierarchy. This presents issues now that XBB keeps growing with more lineages.

Way to go is to build the XBB tree separately, rooted on the reconstructed founder, then do ancestral reconstruction with the founder attached to just Wuhan/BA.2 (depending on tree), then graft the result over to the other tree after reconstruction (to prevent recombinants from messing up the reconstruction).

Request: Add vaccine strains and clade defining viruses to NextClade flu dataset

Hi there,

On NextClade tree view, it would be very useful to include and mark (if possible) vaccine strains and clade defining viruses.

For vaccine strains, similar to Nextstrain portal, it would be great to include an X in the tree view of NextClade. At the minimum including the virus strains would be very useful.

Clade defining viruses are a little bit more tricky. Not only do clades continually get added, but potentially viruses which define these clades could change. This is related to nextstrain/seasonal-flu#66. I can provide a list of clade defining viruses to add to NextClade, at minimum this would be super useful as the user can manually search for the clade defining viruses.

Thanks,

Ammar

ENH: Add escape score etc as colorings

Add colorings:

  • Escape score
  • ACE2 score
  • Composite score
  • Growth advantage modelled by Wenseleer, Gerstung, Murrell models
  • Date of 5th (non misdated) sequence or something like this
  • Total reversions
  • Diff of reversions
  • Number of S mutations
  • Diff of S mutations
  • Key RBD mutations
  • Diff of key RBD mutations
    ...

If you have further ideas, feel free to comment

ENH: Attach recombinants at most recent common ancestor

This way we don't have the artificial non-specific clade "recombinant"

For example, XBB is derived from 21L and 22D - 22D derives from 21L, so XBB should be 21L, and attach at 21L, or a more recent common ancestor.

Importantly, the branch length to recombinants needs to be long, so that the ancestral reconstruction is not misled.

It would be nice if it was possible to visually indicate recombinants in Auspice, e.g. by having dashed line instead of full line.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.