Coder Social home page Coder Social logo

leangreen / umi-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dfajar2/umi-analysis

0.0 2.0 0.0 10.58 MB

Scripts for processing Illumina sequence data containing unique molecular identifiers (UMI)

License: Other

C 43.20% Makefile 1.21% Perl 55.59%

umi-analysis's Introduction

UMI (Unique Molecular Identifier) Processing Pipeline

NCGR and WPI are supported by the Gordon and Betty Moore Foundation grant number 4823 to develop single cell RNA seq in plants. This research includes preparation and sequencing of Illumina RNA-seq libraries having Unique Molecular Identifiers (UMI) to enable counting of mRNA molecules [1]. We developed a series of programs in Perl and C to process sequence data containing UMI. Programmers and analysts familiar with handling sequence data with these languages will be able to use and adapt these programs and we hope you will find them useful. However, this is not finished software. Its usefulness will be limited unless you can read and edit C and Perl, and unless you are familiar with sequence files, formats, and common practices. You will need to install libraries, edit makefiles to match your environment, and set the path to your Perl. Our use case only involves single ended sequencing at this stage. Steps in the workflow and the scripts to use are as follows:

Quality filter your sequence data

Use the C project src/c/fastq_qual_filter. This requires the Better String Library shared object to be in your LD_LIBRARY_PATH and bstring.h should be somewhere sensible. This library makes your string manipulations in C safer and easier. Processing big FASTQ files in Perl can be slow which is why C is used here. Go into src/c/fastq_qual_filter and edit the makefile. The compiled program takes a FASTQ file and filters on any substring of your choice. This is important because you may wish to apply a stricter quality threshold to the UMI itself than the rest of your sequence, or your UMI position in the read may not be the same as ours. You can filter the entire read on one quality then take the output and then filter just on the UMI. This assumes your FASTQ file has been demultiplexed already, i.e. contains only one index, or that you don't care what the indices are.

Usage: fastq_qual_filter infile goodfile badfile logfile min_ok offset length

  • infile: path to a fastq file

  • goodfile: path to output fastq file for passing reads

  • badfile: path to output fastq file for failing reads

  • logfile: path to file for logging (not used)

  • min_ok: pass/fail mean quality threshold

  • offset: 0-based position to begin the substring to examine

  • length: length of substring to examine

    Like: fastq_qual_filter disposable.fq good.fq bad.fq log 30 0 10

    Warning: not much arg checking is done so unexpected behavior may be encountered. For example if you enter a substring longer than the read length, or negative values. Output files will be overwritten without warning.

    Remove the UMI from each sequence read

    Use the C project src/c/fastq_umi_clipper. This requires the Better String Library (https://github.com/msteinert/bstring) shared object to be in your LD_LIBRARY_PATH and bstring.h should be somewhere sensible. Edit the makefile to match your environment. You tell the program how long the UMI is at the start of each read. It replaces the index at the end of the FASTQ header with the UMI. Optionally, it erases more bases at the 5' end of each read. This is used when you make your RNA-seq library using the Clontech SMRTER template switching system, which leaves GGG or GGGG at the 5' end of the read and you don't want these bases in your alignments.

    Usage: fastq_umi_clipper fastqfilename clip_length g_length

    • fastqfilename: path to a fastq file
    • clip_length: number of nucleotides to move from the start of the read and put in header 1
    • g_length: number of nucleotides (Gs) to delete after the UMI (can be 0)

    Warning: not much arg checking is done so unexpected behavior may be encountered. For example if you enter a clip_length longer than the read length, or negative values. Output files will be overwritten without warning.

    Align your reads to a genomic reference sequence

    Do this however you want. We use STAR specifying no introns. To get to the next step you need a sorted SAM file.

    Eliminate duplicate UMI at the the same genomic position

    Run the Perl script umi_counterizer.pl on your sorted SAM file. This requires the packages Text::Levenshtein::XS and Try::Tiny.

    Usage: umi_counterizer.pl sorted_samfile max_edit_distance At the same mapping position on the genome, UMI greater than max_edit_distance from each other will be considered different UMI and not collapsed into one.

    What it does:

    • Open a sorted SAM file.
    • Identify UMI duplicates at the same position.
    • Collapse duplicates pairwise into most abundant UMI
    • Count the UMIs at each position
    • Print to STDOUT

    You end up with the UMI associated with reads mapping to positions on the genome: position, number of UMI, the UMIs (space delimited):

    Chr01:219194 1 CAAGGTATAG
    Chr01:219196 6 TGATGGAGGG TCAACCGGGT TGGGGAAAGG AAGTAGAAGG TATGTAACAG GCTTCCAGGG
    Chr01:219204 1 GTGGGGTCCT
    Chr01:219221 1 TCCTTTGGTT
    Chr01:219226 2 GGTCAAACCA TGGTCATCCC
    Chr01:219243 2 TGGTTTGTGA CAGTTTGTGC
    Chr01:219245 3 GGGGCCCACT TCTGTATGTA CGCGGGTCCC
    etc.
    

Some logging is done and sent to all_sam_umis.txt (every UMI observed in the SAM file) and collapsed_sam_umis.txt (non-redundant list of UMI found in the SAM file). Both are sorted.

Find the genomic positions of every UMI

Next step is to find all of the positions where a UMI maps in the genome. You do with with the C project in src/c/umi_analyzer_uthash. Keeping track of positions associated with a UMI requires a hash or map structure, which C lacks. Unfortunately using a scripting language for this task is far too slow. Fortunately there are C implementations of hashes. One of these is uthash, which is implemented entirely as preprocessor macros. So put uthash.h somewhere sensible and there is no need to link to a library. The input to this program is the output of umi_counterizer.pl. There is no flexibility here, and not much error checking. The emphasis here is solely on speed, making important assumptions about the structure of the input. This makes it possible to use fgets() instead of getline() to peel lines off a filehandle, and strtok_r() expects space delimited input lines. You need to edit exe.c before compiling. Specify the length of the UMI and the maximum length of the position string in bytes. Also specify the maximum expected length of a line in the umi counts file.

Some logging is done to record all UMI found in the input file. These are to be found in analyzer_umis.txt. The list should be the same as collapsed_sam_umis.txt from the umi_counterizer.pl step previous.

Output is each UMI and where it maps in the genome. This file is usually called umi_positions.txt.

TGTAAATTGG Chr01:76001 Chr01:76002 
GTTGCGGGCT Chr01:76002 
ACAGGTTAGC Chr01:76007 
GGGCAGTGTT Chr01:76059 
GTTTGGGGGG Chr01:76065 Chr01:6054609 
TAAATGAGGG Chr01:76075 
GCGTTGTCTT Chr01:76079 
CGCGTTGTCT Chr01:76080 
TCTGTGGCGC Chr01:76087 
GCAGTGAGGG Chr01:76103 Chr01:76104 
GGTTTTGTGG Chr01:76122 Chr01:76123 
TGGGTTGGGG Chr01:76122 Chr01:76123 Chr01:76124 Chr01:10482858 
etc.

Count UMI at each genomic position

Next step is to count unique UMI at any position where a UMI is found. This is one more step towards what we really want, which is a count of UMI for every gene, which would allow differential gene expression analysis. Run the Perl script umi_clusterizer.pl using the umi_positions.txt file from the previous step as input. The output will be positions on the genome and the number of unique UMI found in reads mapping to those positions:

Chr01:10482857  25
Chr01:10482858  70
Chr01:10482859  90
Chr01:10482860  33
Chr01:10482861  11
Chr01:10482862  5
Chr01:10482865  3
Chr01:10482866  3
Chr01:10482867  4
Chr01:10482868  3
Chr01:10482869  1
etc.

Obtain a table of genes and UMI mapping to those genes

This is the final step in obtaining an gene expression count table. This is done by looking up the positions of the UMI in the previous step and finding the corresponding annotations in a GFF file. Run the Perl script gff3_parserizer.pl:

Usage: gff3_parserizer.pl gff_file umi_file

  • gff_file: GFF fprmatted annotations for your organism. Positions are grouped on the feature "gene" so your GFF file must contain these.
  • umi_file: output file from previous step

This script makes a GFF3 file into a structure suitable for fast look ups of genes just based on a position. It makes a large table in memory of every nucleotide position in any gene, pointing to the attributes field of that gene. Each input position is looked up in this table and the corresponding gene identified. The UMI counts at the psotion are added to those for the gene.

The output is a gene and UMI count table like this, which is the start of a differential gene expression analysis experiment:

ID=Pp3c18_8100;gene_id=Pp3c18_8100;ancestorIdentifier=Pp3c18_8100.v3.1   1987
ID=Pp3c22_5610;gene_id=Pp3c22_5610;ancestorIdentifier=Pp3c22_5610.v3.1   1827
ID=Pp3c21_3950;gene_id=Pp3c21_3950;ancestorIdentifier=Pp3c21_3950.v3.1   1486
ID=Pp3c3_18750;gene_id=Pp3c3_18750;ancestorIdentifier=Pp3c3_18750.v3.1   1251
ID=Pp3c19_21160;gene_id=Pp3c19_21160;ancestorIdentifier=Pp3c19_21160.v3.1   1033
ID=Pp3c7_14450;gene_id=Pp3c7_14450;ancestorIdentifier=Pp3c7_14450.v3.1   1032
ID=Pp3c13_7900;gene_id=Pp3c13_7900;ancestorIdentifier=Pp3c13_7900.v3.1   861
ID=Pp3c19_20900;gene_id=Pp3c19_20900;ancestorIdentifier=Pp3c19_20900.v3.1   835
ID=Pp3c13_5930;gene_id=Pp3c13_5930;ancestorIdentifier=Pp3c13_5930.v3.1   722
ID=Pp3c3_10750;gene_id=Pp3c3_10750;ancestorIdentifier=Pp3c3_10750.v3.1   710
ID=Pp3c5_4230;gene_id=Pp3c5_4230;ancestorIdentifier=Pp3c5_4230.v3.1   625
etc.

[1] Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P., Linnarsson, S., 2014. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166. doi:10.1038/nmeth.2772


To be documented

umi_proximitizer.pl umi_proximitizer_detail.pl

umi-analysis's People

Contributors

phlatphish avatar

Watchers

James Cloos avatar LeanGreen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.