The baldr's intro from aerijman

BALDR

BALDR is a pipeline for reconstructing human or rhesus macaque immunoglobulin(Ig)/B cell receptor(BCR) sequences from single cell RNA-Seq data generated by Illumina sequencing.

BALDR is based on the de novo assembly of RNA-Seq reads. It allows reconstructions using following methods:

IG-mapped_Unmapped (human and rhesus) - Assemble reads mapping to Ig loci + Unmapped reads [default for human]
FilterNonIG (rhesus)- Assemble reads after filtering those that match to non-Ig genes in the genome [default for rhesus]
Unfiltered (human and rhesus) - Use all reads for assembling transcripts and select Ig transcript models
IG-mapped_only (human and rhesus) - Assemble reads mapping to Ig loci
Recombinome-mapped (human) - Assemble reads mapping to the Ig recombinome
IMGT-mapped (human) - Assemble reads mapping to IMGT V(D)J and C sequences

Installation (manual)

Pre-requisites

Trimmomatic 0.32
Trinity v2.3.2 (Newer versions are not compatible)
bowtie2 2.3.0
STAR v2.5.2b
samtools v1.3.1
IgBLAST v1.6.1 (Newer versions are not compatible)
seqtk 1.2
Perl 5

Install all the pre-requisites Clone or download the BALDR package.

cd BALDR
chmod +x BALDR

Docker image

You can pull the image from DockerHub

docker pull bosingerlab/baldr

You can use the Dockerfile to build an image.

mkdir baldr_docker
cd baldr_docker
wget https://raw.githubusercontent.com/BosingerLab/BALDR/master/Dockerfile
docker build -t bosingerlab/baldr .

Running BALDR

STAR genome index

If the --STAR_index flag is not provided, BALDR will download and generate the genome index. This takes time and if a docker container is used, it may end up using a lot more space for the docker image. It is thus recommended that the STAR genome index be generated before running BALDR. A pre-built index can also be loaded into the shared memory which will significantly decrease the runtime.

Human

mkdir -p STAR_GRCh38/STAR_GRCh38_index
cd STAR_GRCh38
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz
gunzip *.gz
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir STAR_GRCh38_index --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.86.gtf --sjdbOverhang 100

Rhesus macaque

mkdir -p STAR_MacaM/STAR_MacaM_index
cd STAR_MacaM
wget https://www.unmc.edu/rhesusgenechip/MacaM_Rhesus_Genome_v7.fasta.bz2
bunzip2 MacaM_Rhesus_Genome_v7.fasta.bz2
wget https://www.unmc.edu/rhesusgenechip/RhesusGenomeUpload/MacaM_Rhesus_Genome_Annotation_v7.8.2.gtf
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir STAR_MacaM_index --genomeFastaFiles MacaM_Rhesus_Genome_v7.fasta --sjdbGTFfile MacaM_Rhesus_Genome_Annotation_v7.8.2.gtf --sjdbOverhang 100

Loading genome index in the shared memory

Before running BALDR, it is recommended to load the genome index in the memory

STAR --genomeDir </path/to/STAR/index/folder> --genomeLoad LoadAndExit

When BALDR has been run for all cells, run the following command to clear the genome index from the shared memory

STAR --genomeDir </path/to/STAR/index/folder> --genomeLoad Remove

Command line usage

Single-end:
./BALDR --single <file.fastq.gz> <options>

Paired-end:
./BALDR --paired <R1.fastq.gz,R2.fastq.gz> <options>

Options:
  --method       One or more reconstruction methods. For multiple methods, separte only by comma
  		 human: IG-mapped_Unmapped (default), Unfiltered, IG-mapped_only, IMGT-mapped, Recombinome-mapped 
  		 rhesus_monkey: FilterNonIG (default), Unfiltered, IG-mapped_only, IG-mapped_Unmapped
  --organism     human (default) or rhesus_monkey
  --trinity      Path for Trinity (e.g. ~/trinityrnaseq-Trinity-v2.3.2/Trinity) (required)
  --adapter      Path for the Trimmomatic adapter file (e.g. ~/Trimmomatic-0.36/adapters/NexteraPE-PE.fa) (required)
  --trimmomatic  Path for trimmomatic.jar file (e.g. ~/Trimmomatic-0.36/trimmomatic-0.36.jar) (required)
  --igblastn     Path for igblastn (e.g. ~/ncbi-igblast-1.6.1/bin/igblastn) (required)
  --db           Path for custom IgBLAST database. The folder should contain the following files: 
		 human_IG_V.*,human_IG_D.*,human_IG_J.*,human_IG_C.*,human_gl.aux OR
		 rhesus_monkey_V.*,rhesus_monkey_D.*,rhesus_monkey_J.*,rhesus_monkey_C.*,rhesus_monkey_gl.aux
		 If any file is not available, it can be copied from resources	 
  --STAR         Path for the STAR binary (required)
  --STAR_index   Path for the STAR aligner genome index
  --sharedMemory Flag to indicate if STAR should use LoadAndKeep for shared memory. (Default NoSharedMemory)
  --BALDR        Path for the BALDR directory (e.g. ~/BALDR) (required)
  --memory       Max memory for Trinity (default 32G)
  --threads      number of threads for STAR/bowtie2/Trinity (default 1)
  --version      Version
  --help         Print this help

Running BALDR using the Docker image

# Single end
docker run -v </path/to/STAR/index>:/genome -v </path/to/fastq/files>:/data -w /data bosingerlab/baldr /home/tools/BALDR-master/BALDR --single /data/<file1_R1_001.fastq.gz> --trimmomatic /home/tools/Trimmomatic-0.38/trimmomatic-0.38.jar --adapter /home/tools/Trimmomatic-0.38/adapters/NexteraPE-PE.fa --BALDR /home/tools/BALDR-master --trinity /home/tools/trinityrnaseq-Trinity-v2.3.2/Trinity --memory 32G --threads 16 --STAR /home/tools/STAR-2.6.0c/bin/Linux_x86_64/STAR --STAR_index /genome --igblastn  /home/tools/ncbi-igblast-1.6.1/bin/igblastn

# Paired end
docker run -v </path/to/STAR/index>:/genome -v </path/to/fastq/files>:/data -w /data bosingerlab/baldr /home/tools/BALDR-master/BALDR --paired /data/<file1_R1_001.fastq.gz>,/data/<file1_R2_001.fastq.gz> --trimmomatic /home/tools/Trimmomatic-0.38/trimmomatic-0.38.jar --adapter /home/tools/Trimmomatic-0.38/adapters/NexteraPE-PE.fa --BALDR /home/tools/BALDR-master --trinity /home/tools/trinityrnaseq-Trinity-v2.3.2/Trinity --memory 64G --threads 32 --STAR /home/tools/STAR-2.6.0c/bin/Linux_x86_64/STAR --STAR_index /genome --igblastn  /home/tools/ncbi-igblast-1.6.1/bin/igblastn

Output directories

The output is written in the working directory. The following folders are created:

Trimmed - contains trimmed reads output by Trimmomatic
STAR - contains output of STAR aligner
IG-mapped_Unmapped/FilterNonIG/Unfiltered/IG-mapped/Recombinome-mapped/IMGT-mapped - Folder created based on number of methods used Each of the above folder contains:
1. Trinity - output of Trinity assembly
2. IgBLAST - output of IgBLAST
3. IgBLAST/tabular - (ii) parsed into a tabular format
4. IgBLAST_quant - (iii) + number of reads mapped to the complete sequence and VDJ sequence
5. IgBLAST_quant_sorted - (iv) sorted by number of reads mapped to the complete sequence
6. IgBLAST_quant_sorted_filtered - (v) filtered to retain only productive sequences and remove redundant models. The final results are saved in this folder.

Aggregate results for all cells

The summary_IGH.pl and summary_IGKL.pl scripts in BALDR/lib can be used to aggregate results for all the cells.

./summary_IGH.pl <path/to/IgBLAST_quant_sorted_filtered>
./summary_IGKL.pl <path/to/IgBLAST_quant_sorted_filtered>

This will generate a tab separated file with all filtered models for all cells.

The columns of the Results_IGH_rank_all.txt file are as follows:

Column number	Column	Description
1	SampleID	Cell/Sample file name
2	Chain	Heavy
3	Method	Method used
4	Rank	Rank sorted by # reads mapped
5	Query	Transcript ID
6	Error	Rare cases of misassembly where both IGH and IGL genes predicted
7	Variable coords	V start to C start/V start to J end (if no C hit)
8	Model length	Length of complete chain
9	Bowtie2 count	Number of reads mapping to complete sequence
10	VDJ length	Length of VDJ region
11	VDJ Bowtie2 count	Number of reads mapping to VDJ sequence
12	V subject id	blast results for top V hit
13	V % identity
14	V alignment length
15	V mismatches
16	V gap opens
17	V gaps
18	V q. start
19	V q. end
20	V s. start
21	V s. end
22	V evalue
23	V bit score
24	J subject id	blast results for top J hit
25	J % identity
26	J alignment length
27	J mismatches
28	J gap opens
29	J gaps
30	J q. start
31	J q. end
32	J s. start
33	J s. end
34	J evalue
35	J bit score
36	C subject id	blast results for top C hit
37	C % identity
38	C alignment length
39	C mismatches
40	C gap opens
41	C gaps
42	C q. start
43	C q. end
44	C s. start
45	C s. end
46	C evalue
47	C bit score
48	Top V gene match	IgBLAST results
49	Top D gene match
50	Top J gene match
51	Chain type
52	stop codon
53	V-J frame
54	Productive
55	Strand
56	V end
57	V-D junction
58	D region
59	D-J junction
60	J start
61	CDR3
62	CDR3_nucleotide
63	CDR3_aminoacid
64	Orientation
65	Sequence
66	Sequence(+)	Sequence in the + orientation
67	VDJ sequence	VDJ sequence in the + orientation

The columns of the Results_IGKL_rank_all.txt file are as follows:

Column number	Column	Description
1	Sample	Cell/Sample file name
2	Chain	Light
3	Method	Method used
4	Rank	Rank sorted by # reads mapped
5	Query	Transcript ID
6	Error	Rare cases of misassembly where both IGH and IGL genes predicted
7	Variable coords	V start to C start/V start to J end (if no C hit)
8	Model length	Length of complete chain
9	Bowtie2 count	Number of reads mapping to complete sequence
10	VJ length	Length of VJ region
11	VJ Bowtie2 count	Number of reads mapping to VJ sequence
12	V subject id	blast results for top V hit
13	V % identity
14	V alignment length
15	V mismatches
16	V gap opens
17	V gaps
18	V q. start
19	V q. end
20	V s. start
21	V s. end
22	V evalue
23	V bit score
24	J subject id	blast results for top J hit
25	J % identity
26	J alignment length
27	J mismatches
28	J gap opens
29	J gaps
30	J q. start
31	J q. end
32	J s. start
33	J s. end
34	J evalue
35	J bit score
36	C subject id	blast results for top C hit
37	C %identity
38	C alignment length
39	C mismatches
40	C gap opens
41	C gaps
42	C q. start
43	C q. end
44	C s. start
45	C s. end
46	C evalue
47	C bit score
48	Top V gene match	IgBLAST results
49	Top J gene match
50	Chain type
51	stop codon
52	V-J frame
53	Productive
54	Strand
55	V end
56	V-J junction
57	J start
58	CDR3
59	CDR3_nucleotide
60	CDR3_aminoacid
61	Orientation
62	Sequence
63	Sequence(+)	Sequence in the + orientation
64	VJ sequence	VJ sequence in the + orientation

Clonal assignment

... in progress

aerijman / baldr Goto Github PK

baldr's Introduction

BALDR

Installation (manual)

Pre-requisites

Docker image

Running BALDR

STAR genome index

Loading genome index in the shared memory

Command line usage

Running BALDR using the Docker image

Output directories

Aggregate results for all cells

Clonal assignment

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent