Coder Social home page Coder Social logo

ncrna / pathogentrack Goto Github PK

View Code? Open in Web Editor NEW
26.0 1.0 8.0 60.88 MB

A pipeline to identify pathogenic microorganisms from scRNA-seq raw data.

License: MIT License

Python 97.29% Dockerfile 2.71%
bioinformatics scrna-seq pathogens bacteria viruses ngs metagenomics covid-19

pathogentrack's Introduction

Build Status The MIT License PYPI Conda Conda Downloads Platform check in Biotreasury

PathogenTrack

PathogenTrack is an unsupervised computational software that uses unmapped single-cell RNAseq reads to characterize intracellular pathogens at the single-cell level. It is a python-based script that can be used to identify and quantify intracellular pathogenic viruses and bacteria reads at the single-cell level. PathogenTrack has been tested on various scRNA-seq datasets derived from simulated and real datasets and performed robustly. The detailes are described in our paper PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19.

System Requirements

PathogenTrack has been tested on Linux platform with CentOS 7 and Mac platform with macOS 11.6.1.

Installation

PathogenTrack can be installed in two steps:

1 . Installing Miniconda on Linux / MacOS Platform. For details, please refer to Miniconda Installation.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh    # For Linux users
bash Miniconda3-latest-Linux-x86_64.sh                                        # For Linux users
-----------------------------------------------------------------------------------------------
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh   # For MacOS users
bash Miniconda3-latest-MacOSX-x86_64.sh                                       # For MacOS users

2 . Installing PathogenTrack and the dependencies.

conda env create -f environment.yml

Users are strongly sugguested to install these software with conda. The dependencies and test versions are listed below.

Package Version
python 3.6.10
biopython 1.78
fastp 0.12.4
star 2.7.5a
umi_tools 1.1.1
kraken2 2.1.1

Databases Preparation

1. Prepare the Human genome database

Download the Human GRCh38 genome and genome annotation file, and then decompress them:

wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz
gzip -d Homo_sapiens.GRCh38.101.gtf.gz

Build STAR Index with the following command:

STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ \
     --genomeFastaFiles ./Homo_sapiens.GRCh38.dna.toplevel.fa \
     --sjdbGTFfile ./Homo_sapiens.GRCh38.101.gtf \
     --sjdbOverhang 100

2. Prepare Kraken2 database

wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz
tar zxf minikraken_8GB_202003.tgz

Run PathogenTrack

Before running PathogenTrack, you should run cellranger or alevin to get the single cells' gene expression matrix. Here, we take the simulated 10X sequencing data as an example:

First, we use cellranger to get scRNA-seq expression matrix and valid barcodes:

cellranger count --id cellranger_out --transcriptom /path/to/cellranger_database/

Attention

Three files must be ready to run PathogenTrack: 1) the valid barcode.tsv file; 2) the raw scRNA-seq fastq file (xxx_R1.fastq.gz; xxx_R2.fastq.gz).

Then we run PathogenTrack to identify and quantify pathogen expression at the single-cell level:

(Users should change the '/path/to/' in the following command to the databases' real paths)

conda activate PathogenTrack
PathogenTrack count --project_id PathogenTrack_out \
                    --pattern CCCCCCCCCCCCCCCCNNNNNNNNNN \
                    --min_reads 10 --confidence 0.11 \
                    --star_index /path/to/STAR_index/ \
                    --kraken_db /path/to/minikraken_8GB_20200312/ \
                    --barcode barcodes.tsv \
                    --read1 test_S1_L001_R1_001.fastq.gz \
                    --read2 test_S1_L001_R2_001.fastq.gz 

IMPORTANT: The Read 1 in the example is made up of 16 bp CB and 10 bp UMI, so the --pattern is CCCCCCCCCCCCCCCCNNNNNNNNNN (16C and 10N). Users must adjust the pattern with their own Read 1 accordingly. (for 10X Genomics scRNA-seq Chemistry Version 2: 16 bp CB and 10 bp UMI; for Version 3: 16 bp CB and 12 bp UMI)

Note: It may take 4-6 hours to complete one sample, and it depends on the performance of computational resources and the size of the raw single-cell data.

Please see QUICK_START.md for a complete tutorial.

Questions

If you have any questions/problems with PathogenTrack, feel free to leave an issue! We will try our best to provide support, address new issues, and keep improving this software.

Citation

Wei Zhang, Xiaoguang Xu, Ziyu Fu, Jian Chen, Saijuan Chen, Yun Tan. PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19. Front. Med., https://doi.org/10.1007/s11684-021-0915-9

The preprint version can be found here.

pathogentrack's People

Contributors

ncrna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pathogentrack's Issues

About how to track cov19

Hi,

In your paper, I saw that you have analysis on virus.
But this part is missing in the tutorial to run virus detection?
Are there any option that we can switch to call virus?

Regards,
Junyi

Use with 5' RNA-seq

Hello! It was nice to discover your tool. I was wondering if it is possible to use it on 5' RNA-seq data. What adjustments should be done if you think it is possible ?

Thank you!

mutiple read1 and read2 for PathogenTrack count

Dear PathogenTrack team,
Thanks for your excellent work,
I installed your package and it ran smoothly with your test data.
However, when I used my own data, the Fastq files for read1 and read2 are multiple as below, all the reads belong to one sample HC1. What can I do with my data?
I am very appreciative of your help.

HC1-1_S1_L003_R1_001.fastq.gz HC1-5_S1_L003_R1_001.fastq.gz
HC1-1_S1_L003_R2_001.fastq.gz HC1-5_S1_L003_R2_001.fastq.gz
HC1-2_S1_L003_R1_001.fastq.gz HC1-6_S1_L003_R1_001.fastq.gz
HC1-2_S1_L003_R2_001.fastq.gz HC1-6_S1_L003_R2_001.fastq.gz
HC1-3_S1_L003_R1_001.fastq.gz HC1-7_S1_L003_R1_001.fastq.gz
HC1-3_S1_L003_R2_001.fastq.gz HC1-7_S1_L003_R2_001.fastq.gz
HC1-4_S1_L003_R1_001.fastq.gz HC1-8_S1_L003_R1_001.fastq.gz
HC1-4_S1_L003_R2_001.fastq.gz HC1-8_S1_L003_R2_001.fastq.gz

Getting empty microbes.tsv file

After running the code for about 8 hours, I am getting empty microbes.tsv file with just barcodes in it, no counts. Any idea as to how we can resolve this, please
image
.

Unclassified

Question regarding the unclassified reads via Kraken 2, per our previous 16 S analysis with Silva, we try to limit unclassified reads to < 10% however, via Pathogen Track this is coming up between 85-91% including your own tutorial.. Is this worrisome for incomplete capture ?

91.20 15487988 15487988 U 0 unclassified
8.80 1494822 18457 R 1 root
8.69 1476358 23863 R1 131567 cellular organisms
8.55 1452457 0 D 2759 Eukaryota
8.55 1452457 0 D1 33154 Opisthokonta
8.55 1452457 0 K 33208 Metazoa
8.55 1452457 0 K1 6072 Eumetazoa
8.55 1452457 0 K2 33213 Bilateria
8.55 1452457 0 K3 33511 Deuterostomia
8.55

Mitochondrial genes classified as bacteria

Hi! First of all, thank you for the amazing tool.

I am processing some scRNA-seq data and I'm getting a lot of hits for Bacillus thuringiensis. It turns out that these reads align with human mithocondrial genes when I run nt-BLAST. Is there a way to filter out these mitochondrial reads that are being classified as bacteria? Thank you.

Use PathogenTrack when a sample has multiple pairs of fastq files

Appreciations for your wonderful job in "PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19" !

I noticed that the input of PathogenTrack (count) includes a pair of fastq files (read1 and read2). I wonder how should I use PathogenTrack when a sample has multiple pairs of fastq files?

For example, Sample01_S1_L001_R1_001.fastq.gz, Sample01_S1_L001_R2_001.fastq.gz, Sample01_S1_L002_R1_001.fastq.gz, Sample01_S1_L002_R2_001.fastq.gz.

Best wishes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.