pasquali-lab / umi4cats Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 3.0 38.27 MB

An R package for analyzing UMI-4C chromatin contact data.

Home Page: https://pasquali-lab.github.io/UMI4Cats/

R 95.87% TeX 4.13%

umi4c genomics chromatin chromatin-interaction

umi4cats's Introduction

UMI4Cats

Bioconductor release status

Branch	R CMD check	Last updated
devel
release

The goal of UMI4Cats is to provide and easy-to-use package to analyze UMI-4C contact data.

Installation

You can install the latest release of UMI4Cats from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("UMI4Cats")

If you want to test the development version, you can install it from the github repository:

BiocManager::install("Pasquali-lab/UMI4Cats")

Now you can load the package using library(UMI4Cats).

Basic usage

For detailed instructions on how to use UMI4Cats, please see the vignette.

library(UMI4Cats)

## 0) Download example data -------------------------------
path <- downloadUMI4CexampleData()

## 1) Generate Digested genome ----------------------------
# The selected RE in this case is DpnII (|GATC), so the cut_pos is 0, and the res_enz "GATC".
hg19_dpnii <- digestGenome(
    cut_pos = 0,
    res_enz = "GATC",
    name_RE = "DpnII",
    ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    out_path = file.path(tempdir(), "digested_genome/")
)

## 2) Process UMI-4C fastq files --------------------------
raw_dir <- file.path(path, "CIITA", "fastq")

contactsUMI4C(
    fastq_dir = raw_dir,
    wk_dir = file.path(path, "CIITA"),
    bait_seq = "GGACAAGCTCCCTGCAACTCA",
    bait_pad = "GGACTTGCA",
    res_enz = "GATC",
    cut_pos = 0,
    digested_genome = hg19_dpnii,
    bowtie_index = file.path(path, "ref_genome", "ucsc.hg19.chr16"),
    ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    threads = 5
)

## 3) Get filtering and alignment stats -------------------
statsUMI4C(wk_dir = file.path(path, "CIITA"))

## 4) Analyze UMI-4C results ------------------------------
# Load sample processed file paths
files <- list.files(file.path(path, "CIITA", "count"),
    pattern = "*_counts.tsv",
    full.names = TRUE
)

# Create colData including all relevant information
colData <- data.frame(
    sampleID = gsub("_counts.tsv.gz", "", basename(files)),
    file = files,
    stringsAsFactors = FALSE
)

library(tidyr)
colData <- colData %>%
    separate(sampleID,
        into = c("condition", "replicate", "viewpoint"),
        remove = FALSE
    )

# Load UMI-4C data and generate UMI4C object
umi <- makeUMI4C(
    colData = colData,
    viewpoint_name = "CIITA",
    grouping = "condition"
)

## 5) Perform differential test ---------------------------
umi <- fisherUMI4C(umi,
    grouping = "condition",
    filter_low = 20
)

## 6) Plot results ----------------------------------------
plotUMI4C(umi,
    grouping = "condition",
    ylim = c(0, 15),
    xlim = c(10.75e6, 11.25e6)
)

Code of Conduct

Please note that the UMI4Cats project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

umi4cats's People

Contributors

Stargazers

Watchers

Forkers

rmalinverni warddeb jianhua-wang

umi4cats's Issues

duplicated normalization in calculating calculateAdaptativeTrend

Hi,
Thanks for developing this package.

In the calculateAdaptativeTrend function I feel the following line does duplicated normalisation:
if (!is.null(assays(umi4c)$norm_mat) & normalized) vector_trend <- vector_trend * assays(umi4c)$norm_mat

Indeed, when normalized==TRUE, the dgram is already normalised, since vector_trend is computed based on the normalized dgram, so it should not be normalized again.

[BUG] UMI4Cats problems reported in the Multiple platform build/check report for BioC 3.19

According to the Multiple platform build/check report for BioC 3.19,
the UMI4Cats package has the following problem(s):

o ERROR for 'R CMD check' on nebbiolo1. See the details here:
https://master.bioconductor.org/checkResults/3.19/bioc-LATEST/UMI4Cats/nebbiolo1-checksrc.html

Please take the time to address this by committing and pushing
changes to your package at git.bioconductor.org

Notes:

This was the status of your package at the time this email was sent to you.
Given that the online report is updated daily (in normal conditions) you
could see something different when you visit the URL(s) above, especially if
you do so several days after you received this email.
It is possible that the problems reported in this report are false positives,
either because another package (from CRAN or Bioconductor) breaks your
package (if yours depends on it) or because of a Build System problem.
If this is the case, then you can ignore this email.
Please check the report again 24h after you've committed your changes to the
package and make sure that all the problems have gone.
If you have questions about this report or need help with the
maintenance of your package, please use the Bioc-devel mailing list:

https://bioconductor.org/help/mailing-list/

(all package maintainers are requested to subscribe to this list)

For immediate notification of package build status, please
subscribe to your package's RSS feed. Information is at:

https://bioconductor.org/developers/rss-feeds/

Thanks for contributing to the Bioconductor project!

Single-end?

Hi,
Is it possible to use UMI4Cats with single-end 4C data?
Thanks,
Best,
Stéphanie

Issue with contactsUMI4C terminating on splitUMI4C step.

Hi, Thank you for making the UMI4Cats Pipeline! Apologies in advance if this error is caused by my own use of the pipeline. I have managed to digest my genome with DpnII, however the contactsUMI4C part of the pipeline keeps terminating on splitUMI4C with the following error. Do you have any advice on how I can resolve this problem? Thanks again for making this pipeline! All the best, Sunniyat

[2021-11-10 12:20:01] Starting splitUMI4C using:

Work directory: UMI4C_IRX3_CRISPR-304755451/FASTQ_Generation_i7_only-484826572/PF382_L001-ds.fad6bdb0503f4b64b7c71c0f112f8ead/
Cut position: 0
Restriction enzyme: GATC
Number of reads loaded into memory: 1e+09
Error in .normarg_at2(at, x) :
some ranges in 'at' are off-limits with respect to their corresponding sequence in
'x'
In addition: Warning messages:
1: In if (nchar(barcode) > unique(width(reads_fqR1))) { :
the condition has length > 1 and only the first element will be used
2: In S4Vectors:::V_recycle(end, x, x_what = "end", skeleton_what = "x") :
'length(x)' is not a multiple of 'NROW(end)'

[BUG] When digesting with a restriction enzyme that cut in position > 0 fragments contain a gap

Hello guys!

I'm using your package for analysing our UMI-4C datasets but we're using a different restriction enzyme (HindIII) that cuts in the position +1 of the target sequence.

I use the function digestGenome, with the argument options you specify in the table of section 3.2 Reference genome digestion
of your vignette, as follows:

library(BSgenome.Hsapiens.UCSC.hg19)
refgen <- BSgenome.Hsapiens.UCSC.hg19

hg19_hindiii <- digestGenome(
  res_enz = "AAGCTT",
  cut_pos = 1,
  name_RE = "HindIII",
  ref_gen = refgen,
  out_path = file.path(tempdir(), "digested_genome/")
)

The problem is that the fragments generated by the digestion are not contiguous, they have a missing base between them, i.e. the first fragment finished at position 160006 and the next one starts at 16008, so the base at position 16007 is missing.

GRanges object with 837646 ranges and 1 metadata column:
           seqnames            ranges strand |                     id
              <Rle>         <IRanges>  <Rle> |            <character>
       [1]     chr1           1-16006      * |     fragment_HindIII_1
       [2]     chr1       16008-24570      * |     fragment_HindIII_2
       [3]     chr1       24572-27980      * |     fragment_HindIII_3

I know one base it's not a big issue for the analysis, but could you solve this bug?

Thanks so much in advance.

UMI counts look strange

Hello! I am using your package with 4C-seq data from two biological replicates from a single condition, and the UMI counts look really strange. I used your vignette code to read in my data and generate UMIs from forward and reverse reads - raw and with optical duplicates removed (dedup). Based on various QC metrics from fastqc and Picard tools, the two replicates look extremely similar. Rep2 has slightly more reads, but that's the only notable difference. However, the number of UMIs generated with this package is extremely different between the two replicates, with the direction of the change flipping after duplicate removal. I have analyzed the data separately two other ways, using the pipe4C R package and my own custom code, and I'm not seeing major differences in the overall abundance of counts. Attached is the plot generated by statsUMI4C. Do you know what could be causing this to happen? I would really like to use your package in the future to compare two conditions, so I'd like to understand what is going on with the UMI counts. Thanks!
UMI4Cats_Stats_dedup.pdf

Possible confusion

Hi, in the .singleCounterUMI4C function, the idea of the following code is to retrieve ligations whose readID matches the reps_pos_umis$readID. But rather to match the readID, I feel it should be to match the reps_pos_umis$umi (i.e., the 10 character sequence header). If I am correct, the code could be updated. Thanks!

The original code: final_ligations <- unlist(ligations[names(ligations) %in% unique(reps_pos_umis$readID)])
The psudo code I propose: final_ligations <- unlist(ligations[ligation_umi matches unique(reps_pos_umis$umi)])

R version

Dear developer,

I am asking about the version of R required for UMI4Cats. As specified, R version > 4.0.0 is needed but on my machine only 3.6.0 is possible. I've successfully installed the package by modifying the dependency configuration. I would like to know whether UMI4Cats is compatible also for 3.6.0? Thanks!

Non paired-end FASTQ files

Hi,

I've been trying to use UMI4Cats but I keep getting this message when I use 'contactsUMI4C'

Error in prepUMI4C(fastq_dir = fastq_dir, wk_dir = wk_dir, file_pattern = file_pattern, : Non paired-end FASTQ files with the extension _RX.fastq, _RX.fq, _RX.fq.gz or _RX.fastq.gz inVolumes/MVH_Linux/UMI-4C/ABCA4/fastq

Tried several files, files from different runs... files look like this:
@A00785:56:HHLLTDRXX:1:2101:1832:1000 1:N:0:ATCACG NTTTGTCCTCGGCAAGAGCAATGTTTCTATCTTAGATAAACAGTACTCCTGGCAGAGGTAATAGCCTGAAGTGGGAATAAGCCTGGCTTATTCCAGAAACAGCAAGTAGCCCAATGTGGCTGGAGCAGAGTCAGGGAAGGGTGAAATAGT + #F,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFF,FFFFFFFF
and
@A00785:56:HHLLTDRXX:1:2101:1832:1000 2:N:0:ATCACG CTTATAGTGATCAGAAGAGCTTATTTGTCCTGTCCCTTTCAACATCCAAGTTTTCTCTCTTACCTCCTCTTCTACTATTTCACCCTTCCCTGACTCTGCTCCAGCCACATTGGGCTACTTGCTGTTTCTGGAATAAGCCAGGCTTATTCCC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFF,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF

paired files have the exact same name (except for R1 and R2). The test data provided with the installation worked fine.

Any help would be appreciated
Mattias

[BUILD ERROR] Latest BioC buld 3.16 not passing with error

According to the Multiple platform build/check report for BioC 3.16,
the UMI4Cats package has the following problem(s):

ERROR for 'R CMD build' on nebbiolo2. See the details here: https://master.bioconductor.org/checkResults/3.16/bioc-LATEST/UMI4Cats/nebbiolo2-buildsrc.html

Please take the time to address this by committing and pushing
changes to your package at git.bioconductor.org

Seems to be failing when building the vignette, specifically after the enzyme digestion. Lines skipped belong to the processing R chunk:

Generating digested genome using:
 > Restriction enzyme sequence: GATC 
 > Restriction enzyme cut position: 0 
 > Restriction enzyme name: dpnII 
 > Reference genome: BSgenome.Hsapiens.UCSC.hg19 
 > Output path: /tmp/RtmpO4hMpS/digested_genome/
Finished genome digestion.
Quitting from lines 267-292 (UMI4Cats.Rmd) 
Error: processing vignette 'UMI4Cats.Rmd' failed with diagnostics:
`string` must be a vector, not a <DNAStringSet> object.

[BUG] Different reads have the same read numbder

In the .singlePrepUMI4C function, the new_id_R2, new_id_R1 are computed from 'seq_len(length(filtered_reads_fqR1))', which can duplicate for different iterations of the 'repeat' function, i.e, different reads will get the same read number

This new_id_xx is used later on in 'gr_sp <- GenomicRanges::GRangesList(GenomicRanges::split(gr, gr$readID))'. I have the feeling that this will cause problem

Plotting UMIs normalised trend across entire domainogram or visible region

Hi, Thank you for making such a brilliant package to help analyse UMI4C data. This may not be an issue, rather the way I am using the package, so apologies in advance if this is due to my own error.

I would like to ask if there is a way to force plotting the UMI normalised trendline across the entire visible co-ordinates of the plot? I had a look at the plotUMI4C function for additional arguments to help do this but was unable to identify relevant settings. Any advice you may have would be great. I have an example of the plot to this post to help explain what I mean. Best wishes, Sunniyat

Apply to reads with unequal read length

Hi, just another comment, there can be situations when the read lengths in fastq are not unique. For example, after adaptor trimming. In the package there are multiple places where the "unique" function has been used to collect information from different reads, should be an issue.

Address comments for Bioconductor sumbission

These are the comments from the Bioconductor reviewer that we need to address (from Bioconductor/Contributions#1504)

build report

strongly recommend unit test to check for correct functionality of the
package.

gitignore

Please also git ignore any extra directories or files for other
applications. Like the dev directory, _pkgdown.yml file, pkgdown
directory
Make sure doing git clone does not include any of those files/directories

README

Are both the Rmd and md files necessary?
Make sure to update to pull from the Bioconductor repository for the
official/stable version than your github (for development) It can include both
if you like.

DESCRIPTION

Please remove LazyData: true we have rarely found this useful and can
slow installation.

NAMESPACE

Since your classes extend SummarizedExperiment, I highly recommend
importing the full package rather than importFrom. The full range of
functinoality provided by SummarizedExperiments should be automatically
available to your objects but not without importing the package.

inst

Please have an inst\scripts that describes what is in and how you
created any data in inst/extdata This should include any source information,
filtering, etc. It doesn't have to be strictly code, it can be text or
sudo-code
Citation file should be able to be read independent of package
installation. The current CITATION will not display properly on Biocoductor.
Again readCitationFile(<pathtofile>) should run without error wihtout the
package being loaded.

vignette

I suggest renaming the vignette from umi4cats-usage to UMI4Cats or
umi4cats as it is much more intuitive for a user to just call the package
name to get the vignette (ie. vignette('UMI4Cats')
Why not run with the sample data you provide with
downloadUMI4CexampleData() I haven't looked at the R code yet for this
function but by default it should use a temp directory not the users working
directory to download data. I'm wondering if you implemented any caching
mechanism so users would not have to download multiple times? I would suggest
looking at BiocFileCache.
All defined output directorys should write to a tempdir for demo
purposes (i.e any use of out_path in the vignette)

R/man

downloadUMI4CexampleData should cache the tar files to save time
downloading in the future.

Originally posted by @lshep in Bioconductor/Contributions#1504 (comment)

[BUG] getViewpointCoordinates() seemed not to work for bait on the negative strand

Context

Hi, When I used bait from the negative strand, getViewpointCoordinates() did not work.

Code

getViewpointCoordinates(
bait_seq = "AAAGAGAATAAGAATAGTTATAATACAAA",
bait_pad = "TTAGAGAT",
res_enz = "GATC",
ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
sel_seqname = "chr11"
)

prompt an error

Error in stop_if_wrong_length(what, ans_len) :
'ranges' must have the length of the object to construct (1) or length 1

check the error trace

traceback()
5: stop(wmsg(what, " must have the length of the object ", "to construct (",
ans_len, ") or length 1"))
4: stop_if_wrong_length(what, ans_len)
3: new_GRanges("GRanges", seqnames = seqnames, ranges = ranges,
strand = strand, mcols = mcols, seqinfo = seqinfo)
2: GenomicRanges::GRanges(seqnames = sel_seqname, ranges = pos_viewpoint)
1: getViewpointCoordinates(bait_seq = "AAAGAGAATAAGAATAGTTATAATACAAA",
bait_pad = "TTAGAGAT", res_enz = "GATC", ref_gen = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
sel_seqname = "chr11")

Do you know how to solve the problem? Thank you a lot.