UMI (Unique Molecular Identifier) Processing Pipeline

NCGR and WPI are supported by the Gordon and Betty Moore Foundation grant number 4823 to develop single cell RNA seq in plants. This research includes preparation and sequencing of Illumina RNA-seq libraries having Unique Molecular Identifiers (UMI) to enable counting of mRNA molecules [1]. We developed a series of programs in Perl and C to process sequence data containing UMI. Programmers and analysts familiar with handling sequence data with these languages will be able to use and adapt these programs and we hope you will find them useful. However, this is not finished software. Its usefulness will be limited unless you can read and edit C and Perl, and unless you are familiar with sequence files, formats, and common practices. You will need to install libraries, edit makefiles to match your environment, and set the path to your Perl. Our use case only involves single ended sequencing at this stage. Steps in the workflow and the scripts to use are as follows:

Quality filter your sequence data

Use the C project src/c/fastq_qual_filter. This requires the Better String Library shared object to be in your LD_LIBRARY_PATH and bstring.h should be somewhere sensible. This library makes your string manipulations in C safer and easier. Processing big FASTQ files in Perl can be slow which is why C is used here. Go into src/c/fastq_qual_filter and edit the makefile. The compiled program takes a FASTQ file and filters on any substring of your choice. This is important because you may wish to apply a stricter quality threshold to the UMI itself than the rest of your sequence, or your UMI position in the read may not be the same as ours. You can filter the entire read on one quality then take the output and then filter just on the UMI. This assumes your FASTQ file has been demultiplexed already, i.e. contains only one index, or that you don't care what the indices are.

Usage: fastq_qual_filter infile goodfile badfile logfile min_ok offset length

infile: path to a fastq file
goodfile: path to output fastq file for passing reads
badfile: path to output fastq file for failing reads
logfile: path to file for logging (not used)
min_ok: pass/fail mean quality threshold
offset: 0-based position to begin the substring to examine
length: length of substring to examine

Like: fastq_qual_filter disposable.fq good.fq bad.fq log 30 0 10

Warning: not much arg checking is done so unexpected behavior may be encountered. For example if you enter a substring longer than the read length, or negative values. Output files will be overwritten without warning.

Remove the UMI from each sequence read

Use the C project src/c/fastq_umi_clipper. This requires the Better String Library (https://github.com/msteinert/bstring) shared object to be in your LD_LIBRARY_PATH and bstring.h should be somewhere sensible. Edit the makefile to match your environment. You tell the program how long the UMI is at the start of each read. It replaces the index at the end of the FASTQ header with the UMI. Optionally, it erases more bases at the 5' end of each read. This is used when you make your RNA-seq library using the Clontech SMRTER template switching system, which leaves GGG or GGGG at the 5' end of the read and you don't want these bases in your alignments.

Usage: fastq_umi_clipper fastqfilename clip_length g_length
- fastqfilename: path to a fastq file
- clip_length: number of nucleotides to move from the start of the read and put in header 1
- g_length: number of nucleotides (Gs) to delete after the UMI (can be 0)
Warning: not much arg checking is done so unexpected behavior may be encountered. For example if you enter a clip_length longer than the read length, or negative values. Output files will be overwritten without warning.

Align your reads to a genomic reference sequence

Do this however you want. We use STAR specifying no introns. To get to the next step you need a sorted SAM file.

Eliminate duplicate UMI at the the same genomic position

Run the Perl script umi_counterizer.pl on your sorted SAM file. This requires the packages Text::Levenshtein::XS and Try::Tiny.

Usage: umi_counterizer.pl sorted_samfile max_edit_distance At the same mapping position on the genome, UMI greater than max_edit_distance from each other will be considered different UMI and not collapsed into one.

What it does:
- Open a sorted SAM file.
- Identify UMI duplicates at the same position.
- Collapse duplicates pairwise into most abundant UMI
- Count the UMIs at each position
- Print to STDOUT
You end up with the UMI associated with reads mapping to positions on the genome: position, number of UMI, the UMIs (space delimited):
```
Chr01:219194 1 CAAGGTATAG
Chr01:219196 6 TGATGGAGGG TCAACCGGGT TGGGGAAAGG AAGTAGAAGG TATGTAACAG GCTTCCAGGG
Chr01:219204 1 GTGGGGTCCT
Chr01:219221 1 TCCTTTGGTT
Chr01:219226 2 GGTCAAACCA TGGTCATCCC
Chr01:219243 2 TGGTTTGTGA CAGTTTGTGC
Chr01:219245 3 GGGGCCCACT TCTGTATGTA CGCGGGTCCC
etc.
```

Some logging is done and sent to all_sam_umis.txt (every UMI observed in the SAM file) and collapsed_sam_umis.txt (non-redundant list of UMI found in the SAM file). Both are sorted.

Find the genomic positions of every UMI

Next step is to find all of the positions where a UMI maps in the genome. You do with with the C project in src/c/umi_analyzer_uthash. Keeping track of positions associated with a UMI requires a hash or map structure, which C lacks. Unfortunately using a scripting language for this task is far too slow. Fortunately there are C implementations of hashes. One of these is uthash, which is implemented entirely as preprocessor macros. So put uthash.h somewhere sensible and there is no need to link to a library. The input to this program is the output of umi_counterizer.pl. There is no flexibility here, and not much error checking. The emphasis here is solely on speed, making important assumptions about the structure of the input. This makes it possible to use fgets() instead of getline() to peel lines off a filehandle, and strtok_r() expects space delimited input lines. You need to edit exe.c before compiling. Specify the length of the UMI and the maximum length of the position string in bytes. Also specify the maximum expected length of a line in the umi counts file.

Some logging is done to record all UMI found in the input file. These are to be found in analyzer_umis.txt. The list should be the same as collapsed_sam_umis.txt from the umi_counterizer.pl step previous.

Output is each UMI and where it maps in the genome. This file is usually called umi_positions.txt.

TGTAAATTGG Chr01:76001 Chr01:76002 
GTTGCGGGCT Chr01:76002 
ACAGGTTAGC Chr01:76007 
GGGCAGTGTT Chr01:76059 
GTTTGGGGGG Chr01:76065 Chr01:6054609 
TAAATGAGGG Chr01:76075 
GCGTTGTCTT Chr01:76079 
CGCGTTGTCT Chr01:76080 
TCTGTGGCGC Chr01:76087 
GCAGTGAGGG Chr01:76103 Chr01:76104 
GGTTTTGTGG Chr01:76122 Chr01:76123 
TGGGTTGGGG Chr01:76122 Chr01:76123 Chr01:76124 Chr01:10482858 
etc.

Count UMI at each genomic position

Next step is to count unique UMI at any position where a UMI is found. This is one more step towards what we really want, which is a count of UMI for every gene, which would allow differential gene expression analysis. Run the Perl script umi_clusterizer.pl using the umi_positions.txt file from the previous step as input. The output will be positions on the genome and the number of unique UMI found in reads mapping to those positions:

Chr01:10482857  25
Chr01:10482858  70
Chr01:10482859  90
Chr01:10482860  33
Chr01:10482861  11
Chr01:10482862  5
Chr01:10482865  3
Chr01:10482866  3
Chr01:10482867  4
Chr01:10482868  3
Chr01:10482869  1
etc.

Obtain a table of genes and UMI mapping to those genes

This is the final step in obtaining an gene expression count table. This is done by looking up the positions of the UMI in the previous step and finding the corresponding annotations in a GFF file. Run the Perl script gff3_parserizer.pl:

Usage: gff3_parserizer.pl gff_file umi_file

gff_file: GFF fprmatted annotations for your organism. Positions are grouped on the feature "gene" so your GFF file must contain these.
umi_file: output file from previous step

This script makes a GFF3 file into a structure suitable for fast look ups of genes just based on a position. It makes a large table in memory of every nucleotide position in any gene, pointing to the attributes field of that gene. Each input position is looked up in this table and the corresponding gene identified. The UMI counts at the psotion are added to those for the gene.

The output is a gene and UMI count table like this, which is the start of a differential gene expression analysis experiment:

ID=Pp3c18_8100;gene_id=Pp3c18_8100;ancestorIdentifier=Pp3c18_8100.v3.1   1987
ID=Pp3c22_5610;gene_id=Pp3c22_5610;ancestorIdentifier=Pp3c22_5610.v3.1   1827
ID=Pp3c21_3950;gene_id=Pp3c21_3950;ancestorIdentifier=Pp3c21_3950.v3.1   1486
ID=Pp3c3_18750;gene_id=Pp3c3_18750;ancestorIdentifier=Pp3c3_18750.v3.1   1251
ID=Pp3c19_21160;gene_id=Pp3c19_21160;ancestorIdentifier=Pp3c19_21160.v3.1   1033
ID=Pp3c7_14450;gene_id=Pp3c7_14450;ancestorIdentifier=Pp3c7_14450.v3.1   1032
ID=Pp3c13_7900;gene_id=Pp3c13_7900;ancestorIdentifier=Pp3c13_7900.v3.1   861
ID=Pp3c19_20900;gene_id=Pp3c19_20900;ancestorIdentifier=Pp3c19_20900.v3.1   835
ID=Pp3c13_5930;gene_id=Pp3c13_5930;ancestorIdentifier=Pp3c13_5930.v3.1   722
ID=Pp3c3_10750;gene_id=Pp3c3_10750;ancestorIdentifier=Pp3c3_10750.v3.1   710
ID=Pp3c5_4230;gene_id=Pp3c5_4230;ancestorIdentifier=Pp3c5_4230.v3.1   625
etc.

[1] Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P., Linnarsson, S., 2014. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166. doi:10.1038/nmeth.2772

To be documented

umi_proximitizer.pl umi_proximitizer_detail.pl

leangreen / umi-analysis Goto Github PK

umi-analysis's Introduction

UMI (Unique Molecular Identifier) Processing Pipeline

Quality filter your sequence data

Remove the UMI from each sequence read

Align your reads to a genomic reference sequence

Eliminate duplicate UMI at the the same genomic position

Find the genomic positions of every UMI

Count UMI at each genomic position

Obtain a table of genes and UMI mapping to those genes

To be documented

umi-analysis's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent