Coder Social home page Coder Social logo

vivekbhr / icetea Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 9.04 MB

R/Bioconductor package for analysis of single and paired-end capped-RNAseq data (CAGE/RAMPAGE/MAPCap)

Home Page: https://vivekbhr.github.io/icetea/

License: GNU General Public License v3.0

R 100.00%
cage transcriptomics rna-seq expression

icetea's People

Contributors

kayla-morrell avatar nturaga avatar vivekbhr avatar vobencha avatar

Stargazers

 avatar  avatar

Watchers

 avatar

icetea's Issues

filterDuplicates with `UMI` option

filterDuplicates assumes that UMI is present and throws an error for CAGE data (no UMI) . Add an option called UMI (default = TRUE) such that users could set this to FALSE to remove duplicates using only 5' end.

plot read metrics

provide a function to plot the read processing matrices from the CapSet object

count paired-end

In order to avoid filtering the files for single end reads, paired-end reads should be counted.

  • Find a way to consider both reads of a fragment during PCR duplicate filtering.
  • During TSS calling, first base of the read 1 (for MAPCap/RAMPAGE) or read 2 (for TruSeq) should be overlapped and counted within the windows. 2nd read should be ignored.
  • But since bamFiltering would not be needed after implementing this, we can turn on the counting of fragments in case of gene-level summarization of counts later (for DE, correlation with RNAseq etc.)

reduce deps

try to reduce dependencies : plyr and reshape2 can go away

multicore improvements

  • Initiate parallel backend early when multiple calls to bplapply is used in the function
  • add BPPARAM to fitdiffTSS

Move to subjunc for alignment

Move to subjunc for alignment since subread performs bad at the splice anchors, leading to false mapping of the spliced read.

calculate FRIT score

during TSS detection, also get number of reads per sample within peaks and save within the S4 object

Travis

Follwing changes pending for unit testing.

  • Activate travis builds
  • Write testthat checks.
  • Make all examples work (if possible).

use summarizeoverlaps

Instead of using regioncounts and windowcounts from csaw, use summarizeOverlaps so that strand could be considered.

selective input of reads into duplicate removal

This would be a big change from the beginning of the workflow. In principle it's possible to do the PCR duplicate removal and demultiplexing at the same time. The idea is :

  • During mapping, add the readGroupID and readGroup tags to the bam file, based on the fastq sample index (I don't know how).
  • Directly use the filterDuplicates function, but input the reads from the bam file selectively by tag, then output one filtered file per tag, using the tagFilter arg in ScanBamParam

S4 implementation

For ease of use and going along with bioc infrastructure, I am implementing S4 classes to handle the analysis..

  • S4 class CapSet contructed with raw fastq files and barcode info.

  • function trimfastqIndex should use this and trim the index, plus should attach the path to trimmed fastq into it.

  • function demultiplex_fastq can use the trimmed output to demultiplex fastq and add the demult_reads info + path to demult files into sampleinfo slot.

  • function mapCaps should be able to execute in a multiplexed and demultiplxed mode, depending upon whether or not the sampleinfo slot has multiplxed info. It should then attach the prop of mapped reads into the object (either in sampleinfo or somewhere else.

  • function splitBAM_byIndex may or may not be needed depending on pre/post mapping demultiplexing. in case of post-mapping demultiplexing, it can still use the capset object and fill the demult_reads info in the sampleInfo slot.

  • function filterduplicates should be able to use the CapSet object. In this case the function would run on all the bamfiles and append the stats of dup removal in the sampleInfo slot called dedup_reads. If not given the CapSet object, it would work as a single sample function (current status).

  • function detectTSS should also be able to use the capset object, in this case It would append the design matrix to the object. In this case the samplenames should match the existing sample names in the sampleinfo slot.

  • To push it further, the function fit_diffTSS can also use the CapSet object, replacing the args bam.files and design and taking it from the object. detect_diffTSS would stay the same.

New plotting functions can then be added, taking info from the sampleinfo slot, one can make the plots for mapped reads, demult_reads, dedup_reads and reads_inTSS (i.e. number of reads falling in the detected TSS) for each sample

taking into consideration strandedness

Looking in to detetcTSS.R file and the strandBinCounts function precisely (line 28), reads are counted as single-end no matter the info here (CSobject@paired_end).

fdata <-
        GenomicAlignments::summarizeOverlaps(
            features = windows$gr.plus,
            reads = bam.files,
            mode = "IntersectionStrict",
            ignore.strand = FALSE,
            inter.feature = ignoreMultiMap,
            singleEnd = TRUE,
            fragments = FALSE,
            preprocess.reads = func,
            param = bam_param,
            BPPARAM = bp_param)

So I think that you don't want to count twice the reads in both the plus (fdata) and minus signs (rdata), but what if we have a stranded library.

fdata <-
        GenomicAlignments::summarizeOverlaps(
            features = windows$gr.plus,
            reads = BamFileList(bam.files,asMates = TRUE),
            mode = "IntersectionStrict",
            ignore.strand = FALSE,
            inter.feature = ignoreMultiMap,
            singleEnd = FALSE,
            fragments = FALSE,
            strandMode = 2,
            preprocess.reads = func,
            param = bam_param,
            BPPARAM = bp_param)

And same thing in line 203:

wider <-
                suppressWarnings({
                GenomicAlignments::summarizeOverlaps(
                    features = neighbors,
                    reads = BamFileList(bam.files,asMates = TRUE),
                    mode = "IntersectionStrict",
                    ignore.strand = FALSE,
                    inter.feature = FALSE,
                    singleEnd = FALSE,
                    fragments = FALSE,
                    strandMode=2,
                    preprocess.reads = ppfunc,
                    param = bamParams,
                    BPPARAM = bpParams)
                  })

Would you tell me if this is ok? or if counting reads as paired end will have an effect on the analysis?
Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.