Coder Social home page Coder Social logo

wook2014 / covizu Goto Github PK

View Code? Open in Web Editor NEW

This project forked from poonlab/covizu

0.0 0.0 0.0 12.73 MB

Rapid analysis and visualization of coronavirus genome variation

Home Page: https://filogeneti.ca/CoVizu/

License: MIT License

Python 44.02% Shell 0.01% CSS 2.78% HTML 19.10% JavaScript 34.09%

covizu's Introduction

CoVizu: Real-time visualization of SARS-COV-2 genomic diversity

CoVizu is an open source project to develop a `near real time' SARS-CoV-2 genome analysis and visualization system that highlights potential cases of importation from other countries or ongoing community transmission.

The current mode of visualization employed by CoVizu that we are tentatively referring to as a "beadplot":

How to read a beadplot:

  • Each horizontal line segment represents a unique SARS-CoV-2 genomic sequence variant. The emergence of a single new mutation in an infection is sufficient to establish a new variant. A given variant may be observed multiple times as identical genome sequences, where `identity' is loosely defined to accommodate genomes with incomplete coverage and ambiguous base calls. (Following GISAID's definition of a "complete genome", we require a minimum sequence length of 29,000 nt.)
  • Each circle represents one or more cases of a variant that were sampled on a given date. The size (area) of the circle is proportional to the number of sequences.
  • Cases are arranged in chronological order from left to right.
  • Vertical lines connect variants that are related by a minimum spanning tree, which gives a rough approximation of transmission links. The variant at the bottom terminus of the vertical line is the putative source.
  • The relative location of variants along the vertical axis does not convey any information. The variants are sorted with respect to the vertical axis such that ancestral variants are always below their "descendant" variants.

It is not feasible to reconstruct accurate links using only genomic data. However, our objective is to identify population-level events like importations into Canada, not to attribute a transmission to a specific source individual.

Rationale

There is a rapidly accumulating number of genome sequences of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) collected at sites around the world, predominantly available through the Global Intiative on Sharing All Influenza Data (GISAID) database. The public release of these genome sequences in near real-time is an unprecedented resource for molecular epidemiology and public health. For example, nextstrain has been at the forefront of analyzing and communicating the global distribution of SARS-CoV-2 genomic variation.

The central feature of nextstrain is a reconstruction of a time-scaled phylogeny (a tree-based model of how infections are related by common ancestors back in time). Geographic locations of samples are mapped onto the tree by colour to communicate the global distribution of the pandemic over time. However, it is not obvious how useful (actionable) information can be best extracted from these data, particularly in the context of a regional public health authority.

A significant challenge in using a time-scaled tree to visualize the evolutionary (and, to some approximation, epidemiological) relationships among sampled infections is that most of the image is occupied by ancestral lineages whose existence were reconstructed from the observed infections. One of our basic assumptions in developing CoVizu is that ancestral genomes are directly sampled --- we think this is not unreasonable given the relatively slow rate of molecular evolution in comparison to the virus transmission rate.

Another limitation of the tree visualization is that it does not convey information about observing the same genome sequence from multiple samples over time. There is no means to differentiate identical sequences in a phylogeny because there are no phylogenetically informative sites that separate them. One could extend the tips of the tree to span the time period of sample collection and mark sampled genomes as in the beadplot above. However, the time scale of sampling identical genomes is relatively short compared to the evolutionary history of the virus that is represented by the tree.

Current workflow

  1. Sequences from the last 24 hours are bulk downloaded from the GISAID database. All developers have signed the GISAID data access agreement, and sequences are not being re-distributed.

  2. Sequences are aligned pairwise against the SARS-COV-2 reference genome using the short read mapper minimap2 and a Python wrapper minimap2.py that applies the CIGAR string to each genome to either reconstitute the aligned sequence or extract all differences from the reference.

  3. Genomes are classified into Pangolin lineages using the script pangolearn.py.

  4. A single representative genome is selected for each Pangolin lineage. We take the most recent sample that pass all of our filtering criteria (<1% uncalled bases, genetic divergence consistent with molecular clock, with all problematic sites filtered out).

  5. A time-scaled tree is reconstructed using a combination of fasttree2 and TreeTime.

  6. For all genomes within each lineage, we extract all genetic differences from the reference genome as "features", which provides a highly compact representation of that genome. We calculate the symmetric difference of each genome to every other genome and cache the result on the filesystem.

  7. For each lineage, we generate 100 replicate bootstrap samples of the feature set union to convert the symmetric differences into distance matrices, treating every genetic difference equally (Manhattan distance). Each distance matrix is used to reconstruct a neighbor-joining tree using RapidNJ.

  8. For each lineage, a consensus tree is calculated from the set of bootstrap trees and converted into a beadplot.

Acknowledgements

The development and validation of these scripts was made possible by the labs who have generated and contributed SARS-COV-2 genomic sequence data that is curated and published by GISAID. We sincerely thank these labs for making this information available to the public and open science.

covizu's People

Contributors

artpoon avatar gopigugan avatar rouxcil avatar momomomoliu avatar bonnielu avatar horaciobam avatar kwade4 avatar babarlelephant avatar abayomi-olabode avatar ewong347 avatar ebrintn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.