Coder Social home page Coder Social logo

leekgroup / recount-website Goto Github PK

View Code? Open in Web Editor NEW
11.0 12.0 5.0 97.71 MB

Code for the Recount project

Home Page: https://jhubiostatistics.shinyapps.io/recount/

License: MIT License

R 60.88% Shell 7.07% Python 5.01% JavaScript 4.06% CSS 0.40% Roff 22.58%
rnaseq bioconductor rstats recount annotation-agnostic r

recount-website's Introduction

recount-website's People

Contributors

andrewejaffe avatar klugem avatar lcolladotor avatar nellore avatar shanellis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recount-website's Issues

SRP051606, 3 samples missing

Hi,
I wanted to perform some analysis using Recount2 data from SRP051606. However, I find that 3 samples (SRR1736492,SRR1736495, and SRR1736498) are missing.
colData(rse_gene)$run

[1] "SRR1736482" "SRR1736483" "SRR1736484" "SRR1736485" "SRR1736486" "SRR1736487" "SRR1736488" "SRR1736489"
[9] "SRR1736490" "SRR1736491" "SRR1736493" "SRR1736494" "SRR1736497" "SRR1736496" "SRR1736499" "SRR1736500"
[17] "SRR1736501" "SRR1736502" "SRR1736503" "SRR1736505" "SRR1736504" "SRR1736507" "SRR1736506" "SRR1736508"
[25] "SRR1736509" "SRR1736510

Although, in the project it mentions 29 samples, I find data for only 26.

<NA> values in rse_tx.RData

Hey

I've noticed that there are some transcripts that contain NA's in the assay count table. E.g. ENST00000622420.1 in DRP001055. In this case there are NA's for all four samples. In the GTEx data there are a total of 4.2 million NA's e.g. for transcript ENST00000604479.5, but here it's only for a subset of the samples.

Could you please verify that for me and let me know how to interpret this? I've been struggling with his for a few days now.

Thank you

missing two samples in SRP041538

Hi!

I'm using ReCount2 and it's very helpful and easy to use, I really appreciate all the effort.
I was working with counts from SRP041538 and I realized that the file only contains 187 (of 189) samples, two samples are missing, SRR1265670 and SRR1265673.
I was wondering if it's because of a quality issue or a mistake.

Thank you

prep_sample.R fails due to mismatched exon numbers

I've been trying to follow the workflow suggested here for generating recount contributions, but I'm having an issue with the prep_sample.R script. Specifically, the script crashes at line 137 with the following error message:

nb of rows in 'assay' (603139) must equal nb of rows in 'rowData' (329092)

I've tracked this down and it seems that the Gencode-v25.bed file contains 603139 entries, while the exons_all data structure (line 136) contains 329092 rows, so I believe this incompatibility is the source of the error. exons_all is built from recount_exons, which as far as I can tell is a part of the recount package

Looking at some other recount related code, it appears that some checks were recently modified that migrated the expected number of rows from 329092 to 603139 when the Gencode BED file reference was updated. So my hunch is that perhaps the recount_exons object was not appropriately updated to make this script functional.

Could you offer any suggestions about how to fix this issue? For reference, I'm using recount version 1.4.2

Thanks!
Thomas

Error with prep_sample.R

Hi there,

I encountered an error with prep_sample.R, after the counts_exon_*.tsv is created. The error is for “SummarizedExperiment” object which says the number of rows in assay must equal number of rows in row data. I had successfully run rail-RNA on my fastq files and use the bigwig output. Here is how i run it
Rscript prep_sample.R -f ~/recount2/coverage_bigwigs/VCA.bw -c ~/recount2/cross_sample_results/counts.tsv.gz -b /usr/local/bin/bwtool -w ~/src/WiggleTools/bin/wiggletools -a TRUE -d /Users/udp3f/recount2/tmp

Could you please suggest what could be the issue.

-Uma

prep_merge.R fails for input with only one sample

jx_info after loading (prep_merge.R:1-207)

> head(jx_info, 2)
    chr  start    end sample_ids reads
1 chr1+ 439348 632904          0     1
2 chr1+ 779093 803918          0     1

prep_merge.R:210-211:

jx_info_samples <- strsplit(jx_info$sample_ids, ',')
jx_info_reads <- strsplit(jx_info$reads, ',')

Error: Error in strsplit(jx_info$sample_ids, ",") : non-character argument

possible fix:

jx_info_samples <- strsplit(as.character(jx_info$sample_ids), ',')
jx_info_reads <- strsplit(as.character(jx_info$reads), ',')

prep_merge does not work for junction file

The deliverable "junctions.tsv.gz" is as described in RailRNA guideline:
junctions.tsv.gz: a labeled matrix whose (i, j)th element is the number of reads in sample j covering intron i. Introns are in the first column. Each one takes the form;<strand (+/-)>;<1-based start position (inclusive)>;<1-based end position (exclusive)>.

The following part of prep_merge.R code does not work for the junction file junctions.tsv.gz.

`## Code for creating rse_jx
message(paste(Sys.time(), 'reading', opt$jx_file))
jx_info <- read.table(opt$jx_file, sep = '\t', header = FALSE,
stringsAsFactors = FALSE, check.names = FALSE)
colnames(jx_info) <- c('chr', 'start', 'end', 'sample_ids', 'reads')

Create the counts matrix

message(paste(Sys.time(), 'processing count information'))
jx_info_samples <- strsplit(jx_info$sample_ids, ',')
jx_info_reads <- strsplit(jx_info$reads, ',')`

request for standalone text files of counts

Hi. It would be great if you posted tab-delimited files for gene, transcript, exon and junction counts that contain all row and column identifiers and don't need to be joined with other files, at least for TCGA and GTEx. This wouldn't make the count text files much larger than they are, but it would make them useful. Currently, you need to extract data from R data frames whose structures differ for exons, genes, etc., and some of which are too large to load at once if you don't have enough memory; at least with text files, you can parse on the fly. For junctions and exons, I'd suggest including the gene and coordinates. Thanks.

Edit: I see the functionality in recount to download region-specific data, so the memory argument may not be valid, but I still think standalone text files would be useful.

prep_sample.R bug

genes <- recount_genes
counts_gene <- lapply(split(as.data.frame(exon_counts), count_groups), colSums)
counts_gene <- do.call(rbind, counts_gene)
Error: object 'genes' not found

would it be possible to set a filter by tissue of analysis?

Hi,
First thanks a lot for this work, it's really nice to be able to access this data.
I wondered if it would be possible to set a filter by tissue to select only studies that have the tissue of interest.
It seems that when I look for one tissue in the search field, it's matching the abstract but not the actual tissue of analysis. Maybe am doing it the wrong way too, I would love to know the most direct way to restrict by tissue.
Thanks a lot in advance!
Best,
Laure

download csv for filtered studies

feature request: add a download csv button for list of filtered studies displayed on the shiny app


collab: "hey, can you find some public data to look at X?"

me: "sure."

(5 minutes passes)

me: "here's a spreadsheet. what looks interesting to you?"

etc.

Version number for the database

Hi,

Is there versioning available for recount2 (beyond the global v2 label)? I'd like to be able to see when additional (analysis-ready) data is added and check whether my local version is out of date.

wiggletools arises error

in prep_sample.R
I had to replace
system(paste(opt$wiggletools, 'AUC', auc_file, bw))
by
system(paste(opt$wiggletools, 'print' , auc_file, 'AUC', bw))
to make it work!

Access to novel exons using referenence free quantification

To whom it may concern,

Hi, I'm just wondering in recount2 compilation, if novel exons have been taken into account for and integrated into your resources?
For example, regular pipeline quantify exon expression based on annotation, with lossing information of novel exons within isoforms. Stringtie/Stringtie2 can usually reveal novel exons specficlly expressed in samples.

I'm not sure if recount2 intergrate some functionality of such Stringtie2 to reveal novel exons, other than well annotated ones in Gencode. Thanks.

Best,

Yu

one-base exons in TCGA & GTEx data

Hi.

I might be doing something incorrect, but I find one-base and other short exons in TCGA and GTEx exon count data. I wasn't able to find an explanation in your Genome Biology paper, so I would appreciate a pointer if it's there somewhere.

Here is an example with a small file from TCGA:

> load("rse_exon_bile_duct.Rdata")               
> rse_exon@rowRanges[]   
GRanges object with 603139 ranges and 0 metadata columns:
                     seqnames                 ranges strand
                        <Rle>              <IRanges>  <Rle>
  ENSG00000000003.14     chrX [100627109, 100628669]      -
  ENSG00000000003.14     chrX [100628670, 100629986]      -
  ENSG00000000003.14     chrX [100630759, 100630866]      -
(...)
> rse_exon@rowRanges[520:522]  
GRanges object with 3 ranges and 0 metadata columns:
                     seqnames               ranges strand
                        <Rle>            <IRanges>  <Rle>
  ENSG00000001631.15     chr7 [92245913, 92245914]      -
  ENSG00000001631.15     chr7 [92245915, 92245915]      - <<<<<<<<<<<<<<<<<<<
  ENSG00000001631.15     chr7 [92245916, 92245920]      -

If these short exons are artifacts, are they "stealing" counts from legitimate exons? If so, is there a rational way to recapture these counts, like adding the counts to the longest overlapping exon, or is it safe to just ignore them? Should I be using derfinder?

Thanks.

zero-based stop positions in BED files for splice junctions

Hi.

Please check my math, but it looks like the stop positions in recount BED files for splice junctions are 0-based instead of 1-based.

Here is an example. I choose a small study at random and click on jx_bed. Here are two KnownGene junctions from this file:

chr1 1407019 1407129 ... -
chr1 1486235 1486542 ... +

In real-life (UCSC browser 1-based coordinates), the first intron goes from 1407020 to 1407130, and the second from 1486236 to 1486543. Stop positions are 1-based in BED format. I hate that too. :O)

It would be useful to include a comment in the files themselves or in the file info column to the effect that these are zero-based intronic hg38 coordinates, and assuming you don't want to re-create all the files, to call the files something other than BED.

Thanks.

temp directory

An argument to determine temp directory instead of using tempdir() would be beneficial!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.