Coder Social home page Coder Social logo

mashpit's Introduction

Mashpit

install with bioconda unittest License: GPL v2 PyPI release

Create a database of mash signatures and find the most similar genomes to a target sample

Installation

Option 1. Install with Conda/Mamba (Recommended)

conda create -n mashpit -c conda-forge -c bioconda 'mashpit=0.9.6'
conda activate mashpit

Option 2. Install with pip

1. Dependency: Install NCBI datasets

curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets'
chmod +x datasets
export PATH=$PATH:$PWD

2. Install mashpit using pip:

pip install mashpit

Or git clone from github:

git clone https://github.com/tongzhouxu/mashpit.git
cd mashpit
pip install . 

Mashpit Database

A mashpit database is a directory containing:

  • $DB_NAME.db
  • $DB_NAME.sig

Mashpit database can be built using:

  1. A taxonomic name A standard database is a collection of representative genomes from each cluster on Pathogen Detection. By default mashpit will download the latest version of a specified species and find the centroid of each SNP cluter (SNP tree).
  2. BioSample accessions
    A custom database is a collection of genomes based on a proveded biosample accesion list.

Usage

1. Build a mashpit database

usage: mashpit build [-h] [--quiet] [--number NUMBER] [--ksize KSIZE] [--species SPECIES] [--email EMAIL] [--key KEY] [--pd_version PD_VERSION] [--list LIST] {taxon,accession} name

positional arguments:
  {taxon,accession}     mashpit database type.
  name                  mashpit database name

optional arguments:
  -h, --help            show this help message and exit
  --quiet               disable logs
  --number NUMBER       maximum number of hashes for sourmash, default is 1000
  --ksize KSIZE         kmer size for sourmash, default is 31
  --species SPECIES     species name
  --email EMAIL         Entrez email
  --key KEY             Entrez api key
  --pd_version PD_VERSION
                        a specified Pathogen Detection version (PDG accession). Default is the latest.
  --list LIST           Path to a list of NCBI BioSample accessions
  • Example command
mashpit build taxon salmonella -s Salmonella

Note: Supported species names can be found in this list

2. Query against a mashpit database

usage: mashpit query [-h] [--number NUMBER] [--threshold THRESHOLD] [--annotation ANNOTATION] sample database

positional arguments:
  sample                path to query sample
  database              path to the database folder

optional arguments:
  -h, --help            show this help message and exit
  --number NUMBER       number of isolates in the query output, default is 200
  --threshold THRESHOLD
                        minimum jaccard similarity for mashtree, default is 0.85
  --annotation ANNOTATION
                        mashtree tip annoatation, default is none
  • Example command
mashpit query sample.fasta path/to/database

Optional: Update the database

usage: mashpit update [-h] [--metadata METADATA] [--quiet] database name

positional arguments:
  database             path for the database folder
  name                 database name

optional arguments:
  -h, --help           show this help message and exit
  --metadata METADATA  metadata file in csv format
  --quiet              disable logs
  • Example command
mashpit update path/to/database salmonella

mashpit's People

Contributors

tongzhouxu avatar dependabot[bot] avatar lskatz avatar

Stargazers

Jianshu_Zhao avatar Bryce Kille avatar  avatar Eric T. Dawson avatar Luiz Irber avatar Curtis Kapsak avatar Henk den Bakker avatar Austin Richardson avatar

Watchers

Luiz Irber avatar  avatar Henk den Bakker avatar  avatar

mashpit's Issues

Merging two databases

Just a thought for a future version (and not now) but it might be kind of awesome if we could merge two databases. If it works, it could be a mechanism for users to update local databases without having to remake them.

README.md

Need to add main documentation

  • Synopsis
  • Installation
  • Requirements
  • For the impatient -- some basic wording on how to run tests.sh

Results to screen

Could you change it so that the results are printed to stdout instead of to a file? And if so, document it in the readme?

Comparison capability

Need a method so that we can run

Mashpit query Mashpit.sqlite sketch1.msh [sketch2.msh...]

Where Mashpit.sqlite is a database created by the Mashpit creation scripts and sketch1.msh is a regular mash sketch file.

Unit tests

Need to have unit tests. Need to use standard python methods for unit tests, e.g., https://docs.python.org/3/library/unittest.html

Suggestions for your first unit tests:

  • test --version and --help flags to make sure there is a desired stdout and exit code
  • Creating a small (3-genome?) database
    • test metadata against expected output
    • test signatures file against expected output
    • test a dist operation against expected distance

Sourmash

Consider using sourmash instead of system calls to mash

database should be a parameter

Could you make the database a parameter for each of the scripts? It should not always be ./mashpit.db. For example, if I had a separate database for each species, then I might have differently named databases. Or I might be in a different directory.

Test database

Please create one markdown file describing how to create a test database with about 1000 samples. It should be such that any typical user can follow the instructions and that the database will come out exactly the same in my hands and in your hands and in another person's hands, etc.

Requirements for python branch

Need to add requirements for Python branch into a future README.md

This is my first attempt:

  • Mash >= v2.0
  • Skesa >= v2.4.0
  • sratoolkit
  • Python3 modules
    • Biopython
    • sqlite

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.