leekgroup / recount-website Goto Github PK

View Code? Open in Web Editor NEW

11.0 12.0 5.0 97.71 MB

Code for the Recount project

Home Page: https://jhubiostatistics.shinyapps.io/recount/

License: MIT License

R 60.88% Shell 7.07% Python 5.01% JavaScript 4.06% CSS 0.40% Roff 22.58%

rnaseq bioconductor rstats recount annotation-agnostic r

recount-website's Introduction

Reproducible RNA-seq analysis using recount2

Website: jhubiostatistics.shinyapps.io/recount/
Bioconductor package: recount. Check the vignette for more information.
Paper available via Nature Biotechnology.

The code for this repository started at nellore/runs in case that you are interested in the full commit history.

recount-website's People

Contributors

Stargazers

Watchers

Forkers

andrewejaffe naimmahi adrimaz tanboyu klugem

recount-website's Issues

Availability of bigwig files?

Hi there,

Do you have the bigwig files for recount available? I tried to manually stitch together what I thought the link format might be e.g.

http://duffel.rail.bio/recount/v2/SRP042161/mean_SRP042161.bw

this is a 404 however

Separate issue here from the google chrome just to stay on target :)

SRP051606, 3 samples missing

Hi,
I wanted to perform some analysis using Recount2 data from SRP051606. However, I find that 3 samples (SRR1736492,SRR1736495, and SRR1736498) are missing.
colData(rse_gene)$run

[1] "SRR1736482" "SRR1736483" "SRR1736484" "SRR1736485" "SRR1736486" "SRR1736487" "SRR1736488" "SRR1736489"
[9] "SRR1736490" "SRR1736491" "SRR1736493" "SRR1736494" "SRR1736497" "SRR1736496" "SRR1736499" "SRR1736500"
[17] "SRR1736501" "SRR1736502" "SRR1736503" "SRR1736505" "SRR1736504" "SRR1736507" "SRR1736506" "SRR1736508"
[25] "SRR1736509" "SRR1736510

Although, in the project it mentions 29 samples, I find data for only 26.

Missing column names for tsv files.

Thanks for a great resource. When I download tsv files from the recount2 website (for instance, the GTEx data http://duffel.rail.bio/recount/SRP012682/counts_gene.tsv.gz), I have no way of linking this to genes. It seems the column names are missing. Are these available? Would be useful to have them in the file.

Unable to click download links in Chrome?

Currently if I visit https://jhubiostatistics.shinyapps.io/recount/ in google chrome, I am not able to click links in the table for downloads. I'm not exactly sure but it might be due to the links being http and the hosting page is https

Copying the link and pasting into a new tab or into wget works
Firefox also works (after a short wait time after link click)

504 gateway timeout

Hi I'm trying to download the TCGA lung dataset,
http://duffel.rail.bio/recount/v2/TCGA/rse_exon_lung.Rdata

however I keep getting the 505 gateway timeout. Is there an ftp site somewhere?

thanks!

Rail web server error

Hello!

When trying to access the RAIL server from Recount2, I got an error. Here is the link: http://duffel.rail.bio/recount/v2/SRP025982/rse_gene.Rdata

Here is the error:

Ability to Identify Newest Data Sets

It would be convenient if the results of data set search could be sorted by year of public release.

<NA> values in rse_tx.RData

Hey

I've noticed that there are some transcripts that contain NA's in the assay count table. E.g. ENST00000622420.1 in DRP001055. In this case there are NA's for all four samples. In the GTEx data there are a total of 4.2 million NA's e.g. for transcript ENST00000604479.5, but here it's only for a subset of the samples.

Could you please verify that for me and let me know how to interpret this? I've been struggling with his for a few days now.

Thank you

missing two samples in SRP041538

Hi!

I'm using ReCount2 and it's very helpful and easy to use, I really appreciate all the effort.
I was working with counts from SRP041538 and I realized that the file only contains 187 (of 189) samples, two samples are missing, SRR1265670 and SRR1265673.
I was wondering if it's because of a quality issue or a mistake.

Thank you

prep_sample.R fails due to mismatched exon numbers

I've been trying to follow the workflow suggested here for generating recount contributions, but I'm having an issue with the prep_sample.R script. Specifically, the script crashes at line 137 with the following error message:

nb of rows in 'assay' (603139) must equal nb of rows in 'rowData' (329092)

I've tracked this down and it seems that the Gencode-v25.bed file contains 603139 entries, while the exons_all data structure (line 136) contains 329092 rows, so I believe this incompatibility is the source of the error. exons_all is built from recount_exons, which as far as I can tell is a part of the recount package

Looking at some other recount related code, it appears that some checks were recently modified that migrated the expected number of rows from 329092 to 603139 when the Gencode BED file reference was updated. So my hunch is that perhaps the recount_exons object was not appropriately updated to make this script functional.

Could you offer any suggestions about how to fix this issue? For reference, I'm using recount version 1.4.2

Thanks!
Thomas

Error with prep_sample.R

Hi there,

I encountered an error with prep_sample.R, after the counts_exon_*.tsv is created. The error is for “SummarizedExperiment” object which says the number of rows in assay must equal number of rows in row data. I had successfully run rail-RNA on my fastq files and use the bigwig output. Here is how i run it
Rscript prep_sample.R -f ~/recount2/coverage_bigwigs/VCA.bw -c ~/recount2/cross_sample_results/counts.tsv.gz -b /usr/local/bin/bwtool -w ~/src/WiggleTools/bin/wiggletools -a TRUE -d /Users/udp3f/recount2/tmp

Could you please suggest what could be the issue.

-Uma

prep_merge.R fails for input with only one sample

jx_info after loading (prep_merge.R:1-207)

> head(jx_info, 2)
    chr  start    end sample_ids reads
1 chr1+ 439348 632904          0     1
2 chr1+ 779093 803918          0     1

prep_merge.R:210-211:

jx_info_samples <- strsplit(jx_info$sample_ids, ',')
jx_info_reads <- strsplit(jx_info$reads, ',')

Error: Error in strsplit(jx_info$sample_ids, ",") : non-character argument

possible fix:

jx_info_samples <- strsplit(as.character(jx_info$sample_ids), ',')
jx_info_reads <- strsplit(as.character(jx_info$reads), ',')

prep_merge does not work for junction file

The deliverable "junctions.tsv.gz" is as described in RailRNA guideline:
junctions.tsv.gz: a labeled matrix whose (i, j)th element is the number of reads in sample j covering intron i. Introns are in the first column. Each one takes the form;<strand (+/-)>;<1-based start position (inclusive)>;<1-based end position (exclusive)>.

The following part of prep_merge.R code does not work for the junction file junctions.tsv.gz.

`## Code for creating rse_jx
message(paste(Sys.time(), 'reading', opt$jx_file))
jx_info <- read.table(opt$jx_file, sep = '\t', header = FALSE,
stringsAsFactors = FALSE, check.names = FALSE)
colnames(jx_info) <- c('chr', 'start', 'end', 'sample_ids', 'reads')

Create the counts matrix

message(paste(Sys.time(), 'processing count information'))
jx_info_samples <- strsplit(jx_info$sample_ids, ',')
jx_info_reads <- strsplit(jx_info$reads, ',')`

TCGA metadata issue in column age_at_initial_pathologic_diagnosis

Hi,

See leekgroup/recount#8 for details about the issue as noted from the recount Bioconductor package.

I already started the process to fix the TCGA metadata starting from commit 20e8184. Currently, I'm waiting for https://github.com/leekgroup/recount-website/blob/master/metadata/tcga_prep/tcga_meta.R to re-run.

Best,
Leo

request for standalone text files of counts

Hi. It would be great if you posted tab-delimited files for gene, transcript, exon and junction counts that contain all row and column identifiers and don't need to be joined with other files, at least for TCGA and GTEx. This wouldn't make the count text files much larger than they are, but it would make them useful. Currently, you need to extract data from R data frames whose structures differ for exons, genes, etc., and some of which are too large to load at once if you don't have enough memory; at least with text files, you can parse on the fly. For junctions and exons, I'd suggest including the gene and coordinates. Thanks.

Edit: I see the functionality in recount to download region-specific data, so the memory argument may not be valid, but I still think standalone text files would be useful.

prep_sample.R bug

genes <- recount_genes
counts_gene <- lapply(split(as.data.frame(exon_counts), count_groups), colSums)
counts_gene <- do.call(rbind, counts_gene)
Error: object 'genes' not found

would it be possible to set a filter by tissue of analysis?

Hi,
First thanks a lot for this work, it's really nice to be able to access this data.
I wondered if it would be possible to set a filter by tissue to select only studies that have the tissue of interest.
It seems that when I look for one tissue in the search field, it's matching the abstract but not the actual tissue of analysis. Maybe am doing it the wrong way too, I would love to know the most direct way to restrict by tissue.
Thanks a lot in advance!
Best,
Laure

download csv for filtered studies

feature request: add a download csv button for list of filtered studies displayed on the shiny app

collab: "hey, can you find some public data to look at X?"

me: "sure."

(5 minutes passes)

me: "here's a spreadsheet. what looks interesting to you?"

etc.

Version number for the database

Hi,

Is there versioning available for recount2 (beyond the global v2 label)? I'd like to be able to see when additional (analysis-ready) data is added and check whether my local version is out of date.

Line 130 'exons' should be replaced by 'exon'

Load exons info

exon <- recount_exons

Create rse_exon

exons_all <- unlist(exons)
Error in unlist(exons) : object 'exons' not found

wiggletools arises error

in prep_sample.R
I had to replace
system(paste(opt$wiggletools, 'AUC', auc_file, bw))
by
system(paste(opt$wiggletools, 'print' , auc_file, 'AUC', bw))
to make it work!

Access to novel exons using referenence free quantification

To whom it may concern,

Hi, I'm just wondering in recount2 compilation, if novel exons have been taken into account for and integrated into your resources?
For example, regular pipeline quantify exon expression based on annotation, with lossing information of novel exons within isoforms. Stringtie/Stringtie2 can usually reveal novel exons specficlly expressed in samples.

I'm not sure if recount2 intergrate some functionality of such Stringtie2 to reveal novel exons, other than well annotated ones in Gencode. Thanks.

Best,

one-base exons in TCGA & GTEx data

Hi.

I might be doing something incorrect, but I find one-base and other short exons in TCGA and GTEx exon count data. I wasn't able to find an explanation in your Genome Biology paper, so I would appreciate a pointer if it's there somewhere.

Here is an example with a small file from TCGA:

> load("rse_exon_bile_duct.Rdata")               
> rse_exon@rowRanges[]   
GRanges object with 603139 ranges and 0 metadata columns:
                     seqnames                 ranges strand
                        <Rle>              <IRanges>  <Rle>
  ENSG00000000003.14     chrX [100627109, 100628669]      -
  ENSG00000000003.14     chrX [100628670, 100629986]      -
  ENSG00000000003.14     chrX [100630759, 100630866]      -
(...)
> rse_exon@rowRanges[520:522]  
GRanges object with 3 ranges and 0 metadata columns:
                     seqnames               ranges strand
                        <Rle>            <IRanges>  <Rle>
  ENSG00000001631.15     chr7 [92245913, 92245914]      -
  ENSG00000001631.15     chr7 [92245915, 92245915]      - <<<<<<<<<<<<<<<<<<<
  ENSG00000001631.15     chr7 [92245916, 92245920]      -

If these short exons are artifacts, are they "stealing" counts from legitimate exons? If so, is there a rational way to recapture these counts, like adding the counts to the longest overlapping exon, or is it safe to just ignore them? Should I be using derfinder?

Thanks.

zero-based stop positions in BED files for splice junctions

Hi.

Please check my math, but it looks like the stop positions in recount BED files for splice junctions are 0-based instead of 1-based.

Here is an example. I choose a small study at random and click on jx_bed. Here are two KnownGene junctions from this file:

chr1 1407019 1407129 ... -
chr1 1486235 1486542 ... +

In real-life (UCSC browser 1-based coordinates), the first intron goes from 1407020 to 1407130, and the second from 1486236 to 1486543. Stop positions are 1-based in BED format. I hate that too. :O)

It would be useful to include a comment in the files themselves or in the file info column to the effect that these are zero-based intronic hg38 coordinates, and assuming you don't want to re-create all the files, to call the files something other than BED.