Coder Social home page Coder Social logo

jaleezyy / vhost-classifier Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kzra/vhost-classifier

1.0 0.0 0.0 1.98 MB

For a list of taxon IDs, VHost-Classifier will filter out the viruses and then sort these viruses into groups based on their host lineage.

License: MIT License

Python 100.00%

vhost-classifier's Introduction

VHost-Classifier

For a list of taxonIDs, VHost-Classifier will filter out the viruses and then sort these viruses into groups based on their host lineage.

The VHost-Classifier algorithm uses the Virus-Host DB, the NCBI Taxonomy DB and inbuilt predictive rules to achieve a high rate of virus host classification. VHost-Classifier will classify virus taxonIDs to family resolution.

VHost-Classifier will sort viruses it could not assign a host to by the environment they were sequenced from. To do this it uses the IMG/VR database and inbuilt predictive rules.

When benchmarked on 1000 randomly selected viral taxonids on NCBI, the software could classify 93% of vtaxids to the rank of Class, and 37% of vtaxids to the rank of Family, with an accuracy of 100%. A list of these random taxids can be found in the random_ids.csv file.

Usage:

Clone the directory and run from within cloned directory.

python vhost_classifier.py [TaxonID.tsv] [VirusHostDB.tsv] [Output Dir] [-i] [-g] [-n]

[TaxonID.tsv]: a .tsv list of taxonIDs to be classified (one taxon ID per row).

[VHostDB.tsv]: a copy of the Virus Host DB which can be downloaded here
or by running : wget ftp://ftp.genome.jp/pub/db/virushostdb/virushostdb.tsv

[Output Dir] : the name of the directory to output results to (must be unique).

[-i]: optional argument, specify the value to start indexing the input taxonIDs from (default 0).

[-g]: optional argument, taxonomic ranks to bin to. PCO, Phylum Class Order or POF, Phylum Order Family (default PCO).

[-n]: optional argument, supply file of scientific names alongside taxon ids (use if taxonid list returns an index error).

Example:

python VHost_Classifier.py random_ids.csv VirusHostDB.tsv VHC_Run_1 -i 1 -g POF -n random_names.csv

Virus host classify a list of taxonIDs in random_ids.csv, use the VHost-DB file supplied by VirusHostDB.tsv and output the results to VHC_RUN_1. Index the input taxonIDs from 1 in the output csv files. Classify taxonIDs to Phylum Order Family. Parse the random_names.csv file.

Dependencies:
Python 3
ETE3 Toolkit for Python 3
Note: On first run through NCBI taxonomy database will be downloaded by ETE3.

Output: VHost Classifier will create directories and in each directory write .csv files.

Reading the .csv files: the first column contains taxon IDs, the second column the index position (indexed from -i) of the taxon id in the input file. The final column contains the virus name. In each directory a counts.csv file is also written which contains the counts of how many taxon IDs are in each taxonomic class.

VHC-Analysis: run this script from within the Host-Assigned directory of the run you want to analyse. The script will write walk the directory tree and write each Counts.csv file to a Total_Counts.csv file which will be saved in the Host-Assigned directory. This file makes it easier to compare the overall host diversity of viruses in your input.

Citation:
Kitson,E. and Suttle,C.A. (2019) VHost-Classifier: Virus-Host Classification using natural language processing. Bioinformatics.

References:
Virus-Host DB: Mihara, Tomoko, et al. "Linking virus genomes with host taxonomy." Viruses 8.3 (2016): 66.

IMG/VR: Paez-Espino, David, et al. "IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses." Nucleic acids research (2016): gkw1030.

vhost-classifier's People

Contributors

kzra avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.