Coder Social home page Coder Social logo

genome_crossovers's Introduction

Genome Crossovers

Just a bit of fun to keep track of which genomes (in this case genomes with a set of predicted proteins) are available where.

I am mainly taking information from NCBI, Ensembl, JGI, EuPathDB portals (and their consituent portals e.g Mycocosm etc) as of ~April 2018. Occasionally I add in other genomes not available from these 'main' portals to cover my own needs. I am only really interested in keeping track of the Eukaryotes (for another project OrcharDB). I have also split them into 4 rough 'groups', with that said I am not making any specific comment on these groupings, it's just a little easier to keep track of them this way.

Fungi

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl JGI FungiDB Other NCBI Total Unique Total
761 874 81 25 900 2641 1493

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

  • whilst NCBI contains the most genomes, only 91 are unique
  • whereas JGI has the most number of unique genomes at 530 not available elsewhere
  • the 4 genome portals share 39 of the same genomes
  • with Ensembl and NCBI having the largest overlap

I think that's pretty revealing, if you're going to try and cover taxonomic diveristy of taxa in your analyses, you're going to need to use more than just NCBI! Of course, you will need to explore the data more thouroughly, some genera have more sequencing projects than others and that is likely inflating some of the numbers...

And so you can access the data, here, and the code to make the plot here, there are other examples in the Fungi directory.

Plants

Continuing with the loose definitions, plants = anything green...

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl JGI NCBI Total Unique Total
53 85 87 225 148

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

  • NCBI wins this time with the most unique number of genomes
  • but JGI is a close second
  • and there's surprisingly few shared between the portals

You can access the data, here, and the code to make the plot here.

Metazoa

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl JGI NCBI Total Unique Total
162 26 371 559 433

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

  • Very few taxa exist in all 3 portals
  • NCBI seems to be very metazoan heavy!

You can access the data, here, and the code to make the plot here

Protists / Other

The total number of genomes from each of the 'big' genome portals are as below:

Ensembl JGI NCBI EuPathDB Other Total Unique Total
170 22 9 93 36 330 275

UpSetR Plot

Their intersections are as below, this is like a Venn diagram but 1000x better:

For example, we can see that:

  • Ensembl has the largest collection of protists in one place
  • Lots of protists exists in their own genome portal, or subsideries of others e.g. EuPathDB
  • NCBI maybe be underepresented in this graph!

You can access the data, here, and the code to make the plot here

Caveats

These lists were curated from various sources. Not everyone makes their information easily accessible. So, I have probably missed taxa and if it's your favourite one, then I apologise!

NCBI

They have a list hidden away in their FTP site @ ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/. Is it complete? Hard to say. Is it full of duplicates and multiple assemblies? Yes. If anyone knows of a better list then please feel free to let me know and tell NCBI to make it more obvious.

JGI

Fungi seem to be all contained in Mycocosm now, which is great but wasn't always the case. Other taxa plants/protists are not all in their own '-cosms' or '-zomes' though. So it's a bit more hit and miss. I have to parse their XMLs with scripts here or just look at the genome portal web page, one by one.

Ensembl

Most of the constituent portals have lists and tables with the information I want in. By far the easiest to use and extract data from! Yay!

EuPathDB + Others

Pretty easy from EuPathDB and consituent portals as they have a list function which also shows taxa with predicted proteins. Nice.

genome_crossovers's People

Contributors

guyleonard avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.