Coder Social home page Coder Social logo

linsalrob / partie Goto Github PK

View Code? Open in Web Editor NEW
24.0 7.0 5.0 251.38 MB

PARTIE is a program to partition sequence read archive (SRA) metagenomics data into amplicon and shotgun data sets. The user-supplied annotations of the data sets can not be trusted, and so PARTIE allows automatic separation of the data.

License: MIT License

Makefile 4.75% Perl 49.48% R 7.51% Shell 31.65% Python 6.61%
bioinformatics sra metagenomic-data metagenomics metagenomes

partie's Introduction

Edwards Lab DOI License: MIT

PARTIE

PARTIE is a program to partition sequence read archive (SRA) metagenomics data into amplicon and shotgun data sets. The user-supplied annotations of the data sets can not be trusted, and so PARTIE allows automatic separation of the data.

PARTIE takes a subsample of the data, measures several different parameters associated with the sequences, and uses those parameters to classify the sequences based on a trained random forest.

Currently, PARTIE classifies the data based on:

  • percent_unique_kmer: The percent of the sequences are represented by unique k-mers
  • percent_16S: The percent of the sequences that are similar to 16S genes
  • percent_phage: The percent of the sequences that are similar to phage genes
  • percent_Prokaryote: The percent of the sequences that are similar to prokaryotic genes (those from Bacteria and Archaea).

We typically classify the data sets into three groups:

  • WGS: Random Community Metagenomes (including metatranscriptomes)
  • AMPLICON: 16S metabarcoding projects
  • OTHER: everything else

We have released two files:

  • SRA_Metagenome_Types.tsv is a tab separated file with two columns, the SRA run ID and the classification of the sequence.
  • SRA_PARTIE_DATA.txt is a tab separated file with the Partie data described above in case you want to generate your own classification. The columns of this data are ID, percent unique k-mer, percent 16S rRNA, percent phage, percent prokaryote, and partie annotation.

The file SRA_Update_Time shows the time of the last update of the SRA.

Installation

Please see the installation page to find out about the prerequisites and to install the databases for PARTIE.

Testing

Please see the test suite for PARTIE

Running PARTIE

You can provide PARTIE with several different inputs. We use the extension to figure out what kind of input you have provided.

  • fasta DNA sequence files (ending .fna, .fa, or .fasta)
  • fastq DNA sequence files (ending .fq or .fastq)
  • SRA run IDs. Append .sra to the end (e.g. DRR023185.sra)
  • A text file with a list of SRA ids (ending .txt)

Run partie with fasta or fastq files

perl partie.pl <fasta file>
perl partie.pl <fastq file>

Run partie with an SRA ID

perl partie.pl SRAID.sra

For more examples, see the testing documentation.

Classifying the data

We have provided a pre-built classifier, though if you would like to rebuild the classifier, you can run the training code. You should not need to do that though.

Once you have the output from partie, you can run the classifier:

Rscript RandomForest/PARTIE_Classification.R outputfile.txt

For more examples, see the testing documentation.

Size Restrictions

There is a minimum limit to how much data we need before we can accurately classify something. For example, we can't really classify a metagenome (or an amplicon library) that has a single 150 bp read.

We are not exactly sure what the minimum limit is for accurate classification at the moment, we're trying to figure out what the minimum sequence depth is. However, our preliminary analysis suggests that we need about 5MB of sequence to get an accurate prediction. Below that, we're just not sure. So at the moment we filter sequences to only those that have 5,000,000 bp of sequence before we can create a prediction.

However, this may be a bit low and we should perhaps increase this, because many, many datasets in the 5MB range are reconstructed genomes rather than metagenomes. But there are also plenty of real metagenomes that are that size.

In a future release, we may train partie to try and recognize reconstructed genomes from metagenomes.

MAGS

We have specifically labeled the 7,889 metagenome assembled geomes from Phil Hugenholtz's study as MAGS. We will also add this label to other metagenome assembled genomes we identify.

Zero restrictions

Tjere are several SRA datasets that have zero reads, zero bases, and zero data. We have several of those and we've denoted them as "NO DATA". There are a couple of explanations for these: either they have been deleted from the SRA for some reason (and probably replaced with something else), or they are protected by dbGAP or something similar. We're working on a solution for that.

Databases

We have included the human database in PARTIE now. This is a pre-built bowtie2 index for hg38.fa.gz (md5sum: 1c9dcaddfa41027f17cd8f7a82c7293b) from UCSC. That is the human genome we use to compare everything to. We were undercounting the number of human matches in the database for a number of samples, and these are being (27 Jan 20) recounted, and we will update those numbers when completed.

partie's People

Contributors

blankenberg avatar deprekate avatar linsalrob avatar pjtorres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

partie's Issues

Repeated entry in tables (SRR360776)

The run SRR360776 appears twice in the tables:

$ grep SRR360776 SRA_PARTIE_DATA.txt SRA_Metagenome_Types.tsv    
...
SRA_PARTIE_DATA.txt:SRR360776   86.69725743     0       0       0.01    AMPLICON
SRA_PARTIE_DATA.txt:SRR360776   86.6972574342988        0       0       0.0100000000000051      WGS
...
SRA_Metagenome_Types.tsv:SRR360776      AMPLICON
SRA_Metagenome_Types.tsv:SRR360776      WGS

(removed some irrelevant matches)

Most confusingly, as you can see, once it is classified as AMPLICON and another time as WGS

partie/db databases still maintained?

Hi,

I was trying to create/update the databases in partie/db by running make, but this seems to timeout without success. Are you still maintaining these databases for download?

TIA

using partie with accessions or text files

I have beein trying to feed an sra accession to partie. Those attempts always fail for me, including the tests at https://github.com/linsalrob/partie/blob/master/TEST.md

The underlying issue seems to be that fastq-dump (at least in the versions i have tried out) expects an argument specifying line width after --fasta
Changing line 162 in the partie perl script to specify default as the line width has fixed this issue for me when using fastq-dump 2.11.0

The "Running Partie" section here on github also mentions the option of providing a .txt file with IDs, but i did not manage to get that working. Does this functionality just not exist in the current relase?

Thank your for the great tool, it is very useful to me!

16S db corrupted?

Hello,

I am trying to use partie and after following the installation instructions. I tried running the tool and have gotten this error:

Error: 16S database corrupted
Runing the command /home/software/miniconda3/envs/ww_env/bin/bowtie2-inspect -s .//db/16SMicrobial
/home/software/miniconda3/envs/ww_env/bin/bowtie2-inspect-s: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory

I have tried deleting and redownloading the database, but I continue to get this error.

Just wanted to check and make sure that databses are still maintained or if I am doing something wrong?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.