thelovelab / tximport Goto Github PK

View Code? Open in Web Editor NEW

134.0 15.0 33.0 366 KB

Transcript quantification import for modular pipelines

R 93.41% TeX 6.59%

rnaseq bioconductor deseq2

tximport's Introduction

tximport

Import and summarize transcript-level estimates for transcript- and gene-level analysis

Description of methods and analysis described in:

Charlotte Soneson, Michael I. Love, Mark D. Robinson. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences, F1000Research, 4:1521, December 2015. doi: 10.12688/f1000research.7563.1

Imports transcript-level abundance, estimated counts and transcript lengths, and summarizes into matrices for use with downstream statistical analysis packages such as edgeR, DESeq2, limma-voom. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix which can be used as an offset for different expression of gene-level counts.

See examples in the vignette.

Notes:

tximport as of version 1.3.9 will import inferential replicates (Gibbs samples or bootstrap samples) from Salmon, Sailfish or kallisto.
Though we provide here functionality for performing gene-level differential expression using summarized transcript-level estimates, this is does not mean we suggest that users only perform gene-level analysis. Gene-level differential expression can be complemented with transcript- or exon-level analysis. The argument txOut=TRUE can be used to generate transcript-level matrices.

tximport's People

Contributors

Stargazers

Watchers

tximport's Issues

more flexible file designation for sailfish

Trying to import sailfish results that aren't in the typical dir/quant.sf and dir/stats.tsv convention, because I've moved/renamed some things:

> sf <- tximport("data/sailfish-txps.txt", type="salmon", gene2tx=grch38_gt)
reading in files
1 Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'data/stats.tsv': No such file or directory

Coming from this bit of code:

tmp2 <- read.table(file.path(dirname(x), "stats.tsv"),

tximport is strictly looking in the same path as the results file for stats.tsv. Suggest allowing explicitly specifying this file as an argument, but default back to this?

tximport of kallisto h5s generated with multiple threads is not deterministic

I quantified bulk RNA-seq with kallisto quant with default arguments (therefore, no bootstraps) and ten threads.

kallisto quant --threads 10 --index {input.idx} --output-dir {output.kout} {input.fqs}

I then imported the h5 files via tximport.

tximport("abundance.h5", tx2gene=tx2gene, type="kallisto", countsFromAbundance="scaledTPM")

The results of tximport are saved to TSV. If I repeat this process again with no changes to get another TSV, I find that the two TSVs are different.

$ md5sum scounts.tsv.gz old.scounts.tsv.gz
35c170054847cd6af28d35262022fc85  scounts.tsv.gz
1c5b5477e03c64f298363a48baecacdb  old.scounts.tsv.gz
$ zcat scounts.tsv.gz | sort | md5sum
c661db690d5efd537745312044f334df  -
$ zcat old.scounts.tsv.gz | sort | md5sum
4eb4196d95072574eb8ab5c6fb04106e  -

If I set kallisto to single-threaded execution, making no other changes, I get a deterministic result: the same TSV every run.

I report this issue here rather than with kallisto, because kallisto authors have already responded to threading and determinism here: pachterlab/kallisto#236 (comment). They say that the multi-threaded kallisto output only looks different due to randomness in which threads finish first.

tximport of feature counts?

I can't find a way to do the featureCounts tximport. I have this error although I chose countsFromAbundance = "no".

# path <- file.path(targets$sample_label, "abundance.tsv")
# all(file.exists(path))
#
# Tx_1 <- roslin.annotation[, c(8, 24)]
# # colnames(Tx_1)
# names(Tx_1) <- c("target_id", "gene_name")
# Tx_1 <- as_tibble(Tx_1)

gene_counts <- tximport(path,
                       type = "none",
                       tx2gene = Tx_1,
                       txOut = FALSE,
                       countsFromAbundance = "no",
                       importer = read_tsv, # did this as I was getting an error > reading in files with read_tsv
1 Error in importer(files[i]) : could not find function "importer"
                       countsCol = "counts",
                       lengthCol = "Length",
                       txIdCol = "target_id",
                       ignoreTxVersion = TRUE,
                       ignoreAfterBar = TRUE)
Rows: 45713 Columns: 3                                                                                                                             
── Column specification ────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): target_id
dbl (2): Length, counts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in tximport(path, type = "none", tx2gene = Tx_1, txOut = FALSE,  : 
  argument "abundanceCol" is missing, with no default

Returns TPM when abundanceCol is set to any value for type='rsem'

Desired behavior is to return FPKMs when abundanceCol='FPKM' and throw an error if an invalid value is set - TPMs are returned regardless of its value currently without a warning to the user.

e.g.
tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE, abundanceCol = 'FPKM')

tximport_1.18.0

Error in 1:nrow(m): argument of length 0

I get the error Error in 1:nrow(m): argument of length 0 when trying to import either kallisto or sailfish output. Here's a reprex.

Load packages used here.

library(dplyr)
library(readr)
library(tximport)
library(annotables) # devtools::install_github(stephenturner/annotables)

To summarize transcripts to the gene level we'll need to use a transcript-to-gene annotation. I provide this in the annotables package (blog post; install from github with devtools::install_github("stephenturner/annotables")). The grch38_gt table contains mappings for 216133 transcripts to 65988 genes.

annotables::grch38_gt

## Source: local data frame [216,133 x 2]
## 
##            ensgene          enstxp
##              (chr)           (chr)
## 1  ENSG00000210049 ENST00000387314
## 2  ENSG00000211459 ENST00000389680
## 3  ENSG00000210077 ENST00000387342
## 4  ENSG00000210082 ENST00000387347
## 5  ENSG00000209082 ENST00000386347
## 6  ENSG00000198888 ENST00000361390
## 7  ENSG00000210100 ENST00000387365
## 8  ENSG00000210107 ENST00000387372
## 9  ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## ..             ...             ...

Download data from here: https://dl.dropboxusercontent.com/u/66281/sailfish-txps.txt to data/ directory. Take a look at what the file looks like:

readLines("data/sailfish-txps.txt", n=20)

##  [1] "# sailfish (quasi-mapping-based) v0.7.6"        
##  [2] "# [ program ] => sailfish "                     
##  [3] "# [ command ] => quant "                        
##  [4] "# [ index ] => { sailfish-grch38-index }"       
##  [5] "# [ libType ] => { IU }"                        
##  [6] "# [ threads ] => { 3 }"                         
##  [7] "# [ geneMap ] => { Homo_sapiens.GRCh38.82.gtf }"
##  [8] "# [ mates1 ] => { SRR493366_1.fastq }"          
##  [9] "# [ mates2 ] => { SRR493366_2.fastq }"          
## [10] "# [ output ] => { SRR493366-sfem }"             
## [11] "# [ mapping rate ] => { 95.2197% }"             
## [12] "# Name\tLength\tTPM\tNumReads"                  
## [13] "ENST00000415118\t8\t0\t0"                       
## [14] "ENST00000448914\t13\t0\t0"                      
## [15] "ENST00000631435\t12\t0\t0"                      
## [16] "ENST00000434970\t9\t0\t0"                       
## [17] "ENST00000632684\t12\t0\t0"                      
## [18] "ENST00000431440\t16\t0\t0"                      
## [19] "ENST00000454691\t18\t0\t0"                      
## [20] "ENST00000451044\t17\t0\t0"

Try importing:

sf <- tximport("data/sailfish-txps.txt", type="salmon", gene2tx=grch38_gt)

## Error in 1:nrow(m): argument of length 0

Error in importing RSEM gene level result files

I used this line of code to import RSEM ".gene.results" files:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)，and it returns error："all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE" ，
then i explicitly define the column names like in the line of code below：
txi.rsem <- tximport("ceshi2.tsv", type = "rsem", txIn = FALSE, txOut = FALSE, tx2gene = NULL, countsFromAbundance = "no", varReduce = FALSE, dropInfReps = FALSE, ignoreTxVersion = FALSE, geneIdCol = "gene_id", txIdCol = "transcript_id", abundanceCol = "TPM", countsCol = "expected_count", lengthCol = "effective_length")
it returns the same error.
my files are just like the picture :

Tximport issue with new R version

I use the package called Tximport for long time to generate the tables I use for RNA-Seq. The code is quite simple. Here it is:

gene_id_import <- read.delim("gene_id_import.txt") # This is the file with the transcripts version and their respective gene names.
library(tximport)
filepath = "C:/ad4gy/MM"
files<- sapply(list.dirs(path = filepath, recursive = FALSE)[grep("_quant2TRIMbias",list.dirs(path = filepath, recursive = FALSE))],function(x) paste0(x,"/quant.sf"))# This command searchs for the quant.sf files inside each folder
#Tximport with length scaled TPM values as output (scaled up to library size and average transcript length over samples)
txi<- tximport(files= files, type= "salmon", tx2gene = gene_id_import, countsFromAbundance = "lengthScaledTPM")# When I run this line, now it’s giving the error:

reading in files with read_tsv
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid bytes in UTF8 string.
"seqBias": [], "": "–gcBias", "auxDir": "aux_in
(right here) ------^

#save your output file typing:
write.table(txi, file = "counts.txt", quote = TRUE, sep = "\t")

I have no idea about this error or how to correct it. Besides, it’s in the last line of the code. I noticed the mistake started after I upgrade to R4.0.3
Do you any idea how to fix it?
I appreciate all the help?
Regards,

Importing transcript abundance datasets generated with Salmon 0.14.0

Hi!
I have recently updated Salmon to the latest version (0.14.0) and I'm not anymore able to import Salmon outputs into R using the code
tximport(file, type = "salmon", txOut = TRUE, countsFromAbundance = "no")

This is the error that I get when I run the code above:

txi <- tximport(file, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
reading in files with read_tsv
Error in readBin(bootCon, "integer", n = expected.n) :
invalid 'n' argument

Any suggestion?
Thanks

Edo

`ignoreAfterBar` not used if `txOut=TRUE`

It seems that ignoreAfterBar is not used if transcript-level output is requested. Am I overlooking a reason why this should not be supported (i.e., return only the part of the transcript ID before the first |)?

Move `DESeqDataSetFromTximport` into tximport for now?

Since the DESeqDataSetFromTximport function is only available in the development version of DESeq2, does it make sense to move the function into tximport until the next DESeq2 stable release?

effective length for salmon

instead of using the length column in quant.sf which is actual tx length, use the effective length in stats.tsv

Error: all(file.exists(files)) is not TRUE

I'm working with some Salmon files using mouse RNA-Seq. I have followed the steps of the vignette and it seems to work, until I reach the tximport step when I run the following command:

txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv)

I get the following error:

Error: all(file.exists(files)) is not TRUE

I am not sure if this is an issue with tximport or with my own data.

My tx2gene looks like this:

TXNAME,GENEID
ENSMUSG00000064372,mt-Tp
ENSMUSG00000064371,mt-Tt
ENSMUSG00000064370,mt-Cytb
ENSMUSG00000064369,mt-Te

and my files looks like this:

                                                                                                   sample_1

"/Documents/data/samples/deconv//salmon/sample_1/quant.genes.sf"
sample_2
"/Documents/data/samples/deconv//salmon/sample_2/quant.genes.sf"
sample_3
"~/Documents/data/samples/deconv//salmon/sample_3/quant.genes.sf"

Thanks in advance for pointing out to a solution or a glaring error from my side.

Regards

Please read this before submitting an issue or PR

If you have a question about usage of the software, do not post here, instead post to the Bioconductor support site:

https://support.bioconductor.org

This GitHub repo is to facilitate browsing of the code, and pull requests. For non-minor pull requests, it's a very good idea to first post to the support site link above, where we can discuss beforehand. I probably will not accept major PR that haven't been discussed beforehand, because I would need to have time to ensure that it won't cause unintended consequences on the user base.

Error in H5Fopen

Hi,mikelove
I used the kallisto for RNA-seq, then import the result "*.h5" by the tximport tools, but I get the following error:
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.

I am not sure if this is an issue with tximport or with my own data.

rename argument: gene2tx => tx2gene

Outdated vignette?

I just wanted to indicate, that the currently linked bioconductor vignette is not the same as the tximport.Rmd in this repository.
Especially the edgeR-part in the vignette seems to be outdated in comparison to the .Rmd file (the changes were committed before the 26th of June, which is the date of the linked vignette).
Or am I missing something?

Add option to import only whitelisted barcodes from alevin

Alevin returns a whitelist.txt file - add option to import only barcodes listed therein.

Recomended count scaling method for DTU analysis

When using tximport to import transcript level quantification with the end goal of performing DTU analysis which of the 4 count scaling methods would you recommend? (Asking since dtuScaledTPM was introduced after the Swimming downstream paper recommending "scaledTPM").

salmon/sailfish importer rework

this line won't work, number of lines depends on call

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L82

Missing release tags

Hello,
May I ask for release tags? Tximport is part of the pigx-rnaseq pipeline (https://github.com/BIMSBbioinfo/pigx_rnaseq/) that I have just packaged for Debian. With release tags we would be informed about updates. The version on BioConductor is 1.12.3, which we could also use if that is what you prefer.
Many thanks
Steffen

feature request: modularize gene-level summarization when txOut=TRUE

I wanted to use tximport to read in a bunch of sailfish output files then write out abundances/counts for both genes and transcripts. I ended up doing the import twice, once for txOut=TRUE and once where it was FALSE. Would be nice to allow usage where txOut=FALSE, then apply a later function that would perform the summarization as a separate step.

e.g.

txdata <- tximport(files, txOut=FALSE)
gndata <- txGeneSum(txdata)

Error against the result of salmon with option --numBootstraps

Hi, @mikelove

I tried to perform tximport against the result files of salmon,
but it seems not properly be performed.

library("tximport")
srr <- paste0("SRR", 3670977:3670992)
file_salmon_options <- file.path("analysis/salmon_options", srr, "quant.sf")
names(file_salmon_options) <- srr
txi.salmon_options <- tximport(
	file_salmon_options, type="salmon", txOut=TRUE)
# reading in files with read_tsv
# 1 Error in readBin(bootCon, "integer", n = expected.n) :
#  invalid 'n' argument

The job of salmon is performed with the option "--numBootstraps 100",
and I think this part might be related to the error,
because the error is related to a metadata file (meta_info.json),
which is saved in the directory containing the result of salmon,
and specifying the option changes the value of "num_bootstraps" as 100.
https://github.com/mikelove/tximport/blob/8373c95d36fd342f2f53598ef6bc984021a982b2/R/infReps.R#L75

Could you confirm that this error is reproduced in your environment?

The script of the salmon I performed is this:

#!/bin/bash

srrs=({77..92})

for srr in ${srrs[@]}; do
	tools/salmon-latest_linux_x86_64/bin/salmon \
	quant \
	-i data/transcripts_index_salmon \
	-p 4 \
	-l A \
	--numBootstraps 100 \
	--seqBias \
	--gcBias \
	--posBias \
	-r data/SRR36709${srr}.fastq.gz \
	-o analysis/salmon_options/SRR36709${srr}
done

and the results of salmon I performed is here↓
https://www.dropbox.com/s/qw495oagjj5wb8l/analysis.zip?dl=0

The result of sessionInfo is as follows.

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /home/koki/Software/R-3.5.0/lib/libRblas.so
LAPACK: /home/koki/Software/R-3.5.0/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.3.1    jsonlite_1.6   tximport_1.8.0

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  R6_2.4.0        hms_0.4.2       tools_3.5.0
 [5] rhdf5_2.24.0    pillar_1.4.1    tibble_2.1.3    Rcpp_0.12.19
 [9] crayon_1.3.4    pkgconfig_2.0.2 rlang_0.3.0.1   Rhdf5lib_1.2.1

Different length and counts depending on the number of loaded samples

Hi,

I noticed something weird using tximport by summarizing at the gene level.
Depending on the number of samples I load I don't have the same length and counts for my genes in the corresponding samples.

Quant_Matrix_tximport=tximport(**AlltheFiles**,txIn = T, txOut = F, type="salmon", countsFromAbundance ="lengthScaledTPM", tx2gene = tx2gene)
Quant_Matrix_tximport1=tximport(**20FirstFiles**,txIn = T, txOut = F, type="salmon", countsFromAbundance ="lengthScaledTPM", tx2gene = tx2gene)

Example:
When all samples are load I have Sample1:Gene 1 --> Counts=386.3943113
When 20 samples are loaded I have Sample1:Gene1 --> Counts=383.4657441

I used "no" option insted to check. I did not see any differences at the count level but at the length of the genes level.

Is this something that is expected ?

Florian Rocher

make it easier to read in rsem tx-level ests

cann't import kallisto abundance.h5/.tsv

Hi! I just want to import the kallisto abundance.h5/abundance.tsv by tximport, but I failed.

 ~/RNA-seq/Analysis/GSE112055/quant_out/SRR6868519]$ cat run_info.json
{
        "n_targets": 141398,
        "n_bootstraps": 100,
        "n_processed": 59161050,
        "n_pseudoaligned": 50375628,
        "n_unique": 18126755,
        "p_pseudoaligned": 85.1,
        "p_unique": 30.6,
        "kallisto_version": "0.46.2",
        "index_version": 10,
        "start_time": "Mon May  4 17:44:24 2020",
        "call": "kallisto quant -i /home2/ymwang/linqin/RNA-seq/fasta/release99/transcripts_Mus.idx -o quant_out/SRR6868519 -t 24 -b 100 download_GSE112055/SRR6868519.sra_1.fastq download_GSE112055/SRR6868519.sra_2.fastq"
}

~/RNA-seq/Analysis/GSE112055/quant_out/SRR6868519]$ head abundance.tsv
target_id       length  eff_length      est_counts      tpm
ENSMUST00000177564.1    16      8.71429 0       0
ENSMUST00000196221.1    9       8.66667 0       0
ENSMUST00000179664.1    11      8.25    0       0
ENSMUST00000178537.1    12      9.25    0       0
ENSMUST00000178862.1    14      8       0       0
ENSMUST00000179520.1    11      8.25    0       0
ENSMUST00000179883.1    16      8.71429 0       0
ENSMUST00000195858.1    10      9.66667 0       0
ENSMUST00000179932.1    12      9.25    0       0

> library(tximport)
> library(rhdf5)
> sample_names = c("SRR6868519",paste0("SRR686852", 0:8))
> quant_files = file.path("/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/quant_out", sample_names, "abundance.h5")
> file.exists(quant_files)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> txi = tximport(files = quant_files, type = "kallisto", txOut = TRUE)
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate

> quant_files = file.path("/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/quant_out", sample_names, "abundance.tsv")
> file.exists(quant_files)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> txi = tximport(files = quant_files, type = "kallisto", txOut = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
>

Then, I reproduced the example(https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html). Interestingly, it had different results for two kallisto files in package tximportData.

#kallisto_boot dir
> files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5")
> names(files) <- paste0("sample", 1:6)
> all(file.exists(files))
[1] TRUE
> txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.

> files <- file.path(dir, "kallisto_boot", samples$run, "abundance.tsv.gz")
> names(files) <- paste0("sample", 1:6)
> all(file.exists(files))
[1] TRUE
> txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.

#kallisto dir
> files <- file.path(dir, "kallisto", samples$run, "abundance.tsv.gz")
> file.exists(files)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6
summarizing abundance
summarizing counts
summarizing length
Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
> txi.kallisto.tsv <- tximport(files, type = "kallisto", txOut=T)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6
Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.

What's the problem with it? does tximport need special outputs from kallisto with special arguments?

import using tximport for the rsem count table to get integers?

The rsem table doesn't have integers so it is common to use tximport before running differential gene expression downstream analysis. I am wondering what tximport do in rsem table? what kind of modification?
I am trying to use SARTools (deseq2 method) to get my differential expressed genes but the input count table needs to be an integer so I was thinking of just using round() as I am not sure what tximport modification is doing for the rsem table.

Move Alevin import code to C++

Plans to move the Alevin import code to C++, big improvements in speed shown by @k3yavi

Code will probably go into fishpond as it is designed to be a package for Salmon to Bioc methods and utilities.

Feature Idea : Derive abundance from inferential replicates

It would be nice to allow tximport to optionally derive the abundance for each transcript in each sample by computing some summary statistic (e.g. median or mean) of the inferential replicates, if they are available. These can sometimes be more robust or accurate than individual point estimates, and if inferential replicates are already being read in and summarized to obtain variances, the overhead for doing what is suggested above should be marginal.

error in 1:nrow(m): argument of length 0

Dear all,
I struggle with the error
Error in 1:nrow(x) : argument of length 0

I was hoping that tximport can help here.
Specifically, I imported a shapefile for block groups that include census tracts for the US. I am trying to compute dissimilarity, spatial proximity, absolute clustering, eta2, and isolation indexes for each census tract based on group blocks using OasisR package. The code worked for larger units using a different shapefile (Metropolitan areas based on tracts). However, now I am receiving an error I mentioned above. Here is the full code:
data<-readOGR(dsn="/Users/16033/Desktop/Exp_shapefiles_dropped", layer="Export_Output_2000b")
library(OasisR)
x<-data@data
tractlist <- unique(x$gisjoin_t1)
ntract <- length(tractlist)
result <- as.data.frame(matrix(0, nrow = ntract, ncol = 6))
names(result) <- c("gisjoin_t1", "Dissim", "SpatProx","AbsClus","Eta2","Isol")
result$gisjoin_t1 <- tractlist
for (i in 1:ntract)
{
prov <- subset(x, gisjoin_t1 == tractlist[i])
result$Dissim[i] <- Atkinson(prov[,2:3])
}
After this command I am getting an error:
Error in 1:nrow(x) : argument of length 0
... Same for other indexes.
Here is the first 50 observation from my file
reconstructed <- structure(list(gisjoin_t1 = c(200130000100, 200130000100, 200130000100, 200160000100, 200160000200, 200160000200, 200200000101, 200200000101, 200200000101, 200200000102, 200200000102, 200200000102, 200200000102, 200200000201, 200200000201, 200200000202, 200200000202, 200200000202, 200200000202, 200200000203, 200200000203, 200200000203, 200200000203, 200200000203, 200200000204, 200200000204, 200200000300, 200200000400, 200200000500, 200200000500, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000701, 200200000701, 200200000701, 200200000701, 200200000702, 200200000702, 200200000702, 200200000703, 200200000703, 200200000703, 200200000703, 200200000703), black_n = c(18, 13, 14, 8, 127, 30, 9, 17, 30, 12, 6, 5, 0, 37, 57, 19, 22, 27, 24, 44, 25, 63, 44, 30, 29, 23, 1106, 814, 49, 92, 25, 129, 121, 146, 136, 104, 100, 120, 258, 109, 85, 153, 121, 166, 262, 106, 99, 173, 34, 197), mhwhite_n = c(231, 127, 163, 279, 949, 831, 342, 1561, 2197, 762, 1160, 845, 1147, 1307, 1130, 968, 1341, 1230, 1518, 1081, 1796, 2410, 1551, 1151, 1196, 884, 3498, 4909, 559, 477, 101, 335, 291, 341, 383, 295, 271, 221, 790, 630, 504, 487, 450, 963, 1283, 603, 355, 642, 458, 700), his_n = c(151, 59, 129, 22, 323, 228, 22, 59, 69, 16, 33, 28, 41, 64, 93, 50, 48, 49, 30, 30, 38, 129, 59, 55, 48, 40, 503, 475, 41, 181, 12, 49, 101, 115, 146, 47, 98, 109, 124, 70, 109, 66, 63, 83, 136, 51, 40, 78, 44, 70)), row.names = 0:49, class = "data.frame")

As far as I read, I need to post debunk=FALSE somewhere, but I am not sure where. Thank you in advance
Sylwia

Dealing with multispecies using tximport

Hi,

I have a methodological question about how to deal with multispecies mapping and tximport.

I have RNA-seq samples that are a blend of 3 different species interacting with each other. So I performed the mapping with Salmon using an hybrid reference composed of the transcriptomes of those 3 species.
Now I wonder how to correctly use tximport on the outputs of Salmon in order to generate one counting matrix for each those 3 species. Because normalization according to the library size will be done for each species independently I was wondering if I should separate the salmon outputs according to the species before running tximport.
From the documentation of tximport, I should use "countsFromAbundance=no". This parameter will then use the NumReads column of the Salmon output which appears to partly rely on the relative abundance for each transcript. Is NumReads partly based on the library size ? If it's not I guess that dividing the results per species before or after tximport won't have any effect but if NumReads dependent of the library size then I should run tximport on the complete output of Salmon.

Thanks for your help,

Florian Rocher

can't read alevin output from salmon 0.14.0

Tried with latest version of tximport installed with devtools::install_github("mikelove/tximport") 1.13.3

tximport::tximport(file.path(alevin_dir, 'quants_mat.gz'), type = 'alevin')

Error in mat[, j] <- readBin(con, double(), endian = "little", n = num.genes) : 
  number of items to replace is not a multiple of replacement length

Probably related to this:

The binary output format of alevin, quants_mat.gz, has been changed into a sparse single precision format. In pratice we saw the file size reduced to as big as half the size of the original file.

Transcript expression collapsing to gene-level expression

Sorry, I have raw read counts from FeatureCounts, can I still use tximport to collapsing transcrip expression to gene level?

Which measure to use for gene Fold change analysis between pre- and post-treatment samples and for gene comparison within a sample

Hi, new to bulkRNAseq analysis and a bit confused about different normalization measures that are used.

My understanding is that TPMs are fine for comparison of gene expression levels within a sample (so also for looking at correlation of expression between two genes in many samples separately) while for looking at fold change expression of list of genes of interest in two samples (e.g. pre and post treatment) the way to go is to transform the estimated counts from e.g. kallisto (imported with tximport using the CountsfromAbindance = "no" option) with the edgeR pipeline described in the tximport vignette. What confuses me is:

what is obtained at the end of the edgeR code described in the vignette are called CPMs (cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE) ) while to me these appear very similar to TMMs instead (since they are calculated accounting both for different average transcript length between samples and for composition biases between samples). Am I getting this wrong?
the vignette appears to suggest that processing the data with the estimated ("raw") counts with the edgeR pipeline has a similar effect as using scaledTPM or lengthscaledTPM but in fact what is obtained are very different counts. Quoting from one of your answer in (https://support.bioconductor.org/p/84883/): "These two have library size differences baked in. The column sum is equal to the number of mapped reads. So not comparable across samples." While as per 1) above I believe that the "CPMs" obtained with the edgeR vignette CAN be used for across sample comparison. Am I right?

In conclusion, what measure would you suggest to use for calculating logFC of a list of genes of interest between a pre and a post treatment sample? Is it the CPMs as obtained via the edgeR pipeline in the vignette or something else? And if the first, I guess those "CMPs" could be used also for looking at how pairs of genes correlate in different samples (one point per sample in a correlation plot)

Thank you very much in advance for any help tyou will ge able to give me.

Marco

filtering lowly expreed genes after or before normalization ?

Hi, I need to figure out which approach is more appropriate regarding filtering lowly expressed genes. According to tximport manual, it is recommended to follow following commands for EdgeR analysis:
library(edgeR)

cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)

and to continue with y as a DGE object. In my analysis I filtered out the lowly expressed genes based on the cpm value (for instance, cpm value is greater than 1 in at least the number of small group of samples) using "keep.lib.sizes=FALSE" after doing above mentioned normalization.
I am now confused if my approach is appropriate and if I should do the normalization after filtering?

Thanks for your help.
Best,

Unexpected behavior when processing RSEM transcript level files

I used this line of code to process RSEM ".isoforms.results" files:
txi.RSEM <- tximport(files, type = "rsem", reader = read_tsv, txIn = TRUE, txOut = TRUE)
The rownames of the resulting list objects, are gene ID's, which is not what I expected, since I put the txOut option to true.

When I explicitly define the column names like in the line of code below, it works like expected (the rownames of the resulting list objects are now transcript ID's):
txi.RSEM <- tximport(files, type = "none", geneIdCol = "gene_id", txIdCol = "transcript_id",
abundanceCol = "TPM", countsCol = "expected_count", lengthCol = "length", reader =
read_tsv, txIn = TRUE, importer = read.delim, txOut = TRUE)

Idea: Supporting alevin output

(Unsure if this idea belongs here, in the tximeta repo, or in the salmon repo. Apologies if I guessed wrong. Tagging @rob-p and @k3yavi for their thoughts).

I'm looking into using the alevin tool from salmon for processing 3'-tag scRNA-seq data (currently just in the planning stage for such a workflow).
In such a workflow I'll ultimately be wanting to get the gene-sample count matrix created by alevin into a SingleCellExperiment (ideally with all the metadata via the very awesome tximeta).

Looking at the alevin docs (https://salmon.readthedocs.io/en/latest/alevin.html#output):

by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.
A typical run of alevin will generate 3 files:
quants_mat.gz – Compressed count matrix.
quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
quants_mat_rows.txt – Row Index (CB-ids) of the matrix.
Alevin can also dump the count-matrix in a human readable – comma-separated-value (CSV) format

It's probably worth noting that the CSV version has cells along the rows and genes along the columns (i.e. opposite of SingleCellExperiment).

So a few questions:

@mikelove

Have you explored importing alevin output into R/Bioconductor?
Would you be interested in having this functionality in tximport?

@rob-p and @k3yavi

You note that the output format for scRNA-seq is itself an area of open research. Do you think the current formats are stable enough for me (and others?) to invest time in writing importers for R/Bioconductor?
Is there a more detailed description of the 'compressed count matrix' format (I'd like to see whether it maps to one of the sparse matrix formats in the R Matrix package).
Somewhat tangentially, have you considered HDF5 support (recognising that this I think this is quite a heavy thing to incorporate with and probably overkill for experiments with a few thousand cells)?

Thanks, I'm eager to hear your thoughts
Pete

ignoreAfterBar does not apply with txOut=T and countsFromAbundance != "no"

Thank you for an eminent and very convenient software.

I'm importing kallisto data from h5 files and want output the transcripts. With Gencode as a reference this comes with an inconvenient transcript naming adding |-separated information after the transcript ID. Exporting genes, this is handled by the ignoreAfterBar flag, but this never takes effect when exporting transcripts:

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L578-L595

Is this intentional?

better warning if no tx2gene provided

if users don't provide a tx2gene argument, this could either be a mistake -- they didn't know they needed to tell the function which tx belong to which genes -- or they could want to have tx-level output, the current warning is:

Error: all(c(geneIdCol, lengthCol, abundanceCol) %in% names(raw)) is not TRUE

this could say, either provide tx2gene or specify txOut=TRUE

work with cuffnorm collated files

update docs and/or code wrt sailfish support

Typo in instructions for using ensembldb

Stupidly made a PR against the BioC mirror https://github.com/Bioconductor-mirror/tximport/pull/1. Same line doesn't seem to exist in vignette in this repo, so unsure if advice not updated or has been removed

Read in files in parallel

I am working with a large smart-seq2 dataset with 23k samples (preprocessed with Salmon).

I have seen tximport supports sparse matrices for single cell data, which is nice, but even with readr it takes very long to parse the salmon output files.

I checked the code and it seems that importing simply runs in a loop. Do you think this could be easily replaced with e.g. mclapply?

Best,
Gregor

sparse matrix support?

I was wondering whether it would possible to support sparse matrices as output of tximport.
This might make tximport more attractive for people working with scRNA-seq data.

I'm currently running into problems when importing >400 single cells because of memory limits.
Since the memory issues occur while importing the cells, converting the data to sparse format based on the SummarizedExperiment would not help because I can't get up to that point yet.

A current solution is to import the data in parts, and convert those to sparse matrices.

Enabling reading from cloud storage

It is great that the tximport function allows users to supply their own importer function.

I keep most of my files on AWS S3. The aws.s3 package provides great functionality to access S3 objects, e.g. via the s3read_using function.

So I tried to read my Salmon output files directly from AWS, using the following importer:

importer = aws.s3::s3read_using(FUN = readr::read_tsv)

Unfortunately, the tximport functions refuses to run unless the file.exists function returns TRUE for all paths, so S3 paths are not valid.

https://github.com/mikelove/tximport/blob/9b9d3e6e0d0843cfec2048e5dbd0a8ddca189a71/R/tximport.R#L145

Perhaps checking whether a file.exists could be made optional?

Thanks a lot for making this great package available!
Thomas

Check for existence of abundance.h5 before checking for rhdf5?

Currently the check for the presence of rhdf5 precedes the check of whether there is actually an abundance.h5 file to read (in readInfRepKallisto). This might cause errors in situations where one has only the abundance.tsv file, rhdf5 is not available, and one doesn't set dropInfReps=TRUE.

Allow missing-ness in individual transcriptomes

I have use cases for building matrices of TPM/counts where the length(txId) != length(raw[[txIdCol]]):

Combining data from human/mouse xenografts with measure of the same human or mouse genetic background. Obviously you'll want to account for all human/mouse reads during mapping and quantitation. Maybe you could filter unwanted transcript measurements from your input files before tximport, but this doesn't scale nicely.
Newer experiments where quantitation included a small number of "extra" transcripts such as virus or noncoding RNAs.
Newer experiments where quantitation was performed using the same transcripts but sorted differently.

I've created a patch that allows the user to combine input files with a subset of common features. Maybe this is of general interest? Sorry about the random edits in the diff - if you update the repository and Github I could do a proper PR.

From 1cf014290593e9fceec4decfc364af9c34ecf905 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 14:30:36 -0400
Subject: [PATCH 1/4] Match rows during Salmon import, supporting import of
 sub/superset transcriptomes.

---
 R/tximport.R | 75 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index 93c1dc4..a80a4bc 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -4,7 +4,7 @@
 #' external software and optionally summarizes abundances, counts, and transcript lengths
 #' to the gene-level (default) or outputs transcript-level matrices
 #' (see \code{txOut} argument).
-#' While \code{tximport} summarizes to the gene-level by default, 
+#' While \code{tximport} summarizes to the gene-level by default,
 #' the user can also perform the import and summarization steps manually,
 #' by specifing \code{txOut=TRUE} and then using the function \code{summarizeToGene}.
 #' Note however that this is equivalent to \code{tximport} with
@@ -17,7 +17,7 @@
 #'   \item avoid gene-level summarization by specifying \code{txOut=TRUE}
 #'   \item set \code{geneIdCol} to an appropriate column in the files
 #' }
-#' 
+#'
 #' See \code{vignette('tximport')} for example code for generating a
 #' \code{tx2gene} data.frame from a \code{TxDb} object.
 #' Note that the \code{keys} and \code{select} functions used
@@ -41,16 +41,16 @@
 #' @param countsFromAbundance character, either "no" (default), "scaledTPM", or "lengthScaledTPM",
 #' for whether to generate estimated counts using abundance estimates scaled up to library size
 #' (scaledTPM) or additionally scaled using the average transcript length over samples and
-#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM, 
+#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM,
 #' then the counts are no longer correlated with average transcript length,
 #' and so the length offset matrix should not be used.
 #' @param tx2gene a two-column data.frame linking transcript id (column 1) to gene id (column 2).
-#' the column names are not relevant, but this column order must be used. 
+#' the column names are not relevant, but this column order must be used.
 #' this argument is required for gene-level summarization for methods
 #' that provides transcript-level estimates only
 #' (kallisto, Salmon, Sailfish)
 #' @param reader a function to replace read.delim in the pre-set importer functions,
-#' for example substituting read_tsv from the readr package will substantially 
+#' for example substituting read_tsv from the readr package will substantially
 #' speed up tximport
 #' @param geneIdCol name of column with gene id. if missing, the gene2tx argument can be used
 #' @param txIdCol name of column with tx id
@@ -66,7 +66,7 @@
 #' (default FALSE)
 #' @param txi list of matrices of trancript-level abundances, counts, and
 #' lengths produced by \code{tximport}, only used by \code{summarizeToGene}
-#' 
+#'
 #' @return a simple list with matrices: abundance, counts, length.
 #' A final element 'countsFromAbundance' carries through
 #' the character argument used in the tximport call.
@@ -77,20 +77,20 @@
 #' @describeIn tximport Import tx-level quantifications and summarize
 #' abundances, counts and lengths to gene-level (default)
 #' or simply output tx-level matrices
-#' 
+#'
 #' @references
 #'
 #' Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015):
 #' Differential analyses for RNA-seq: transcript-level estimates
 #' improve gene-level inferences. F1000Research.
 #' \url{http://dx.doi.org/10.12688/f1000research.7563.1}
-#' 
+#'
 #' @examples
 #'
 #' # load data for demonstrating tximport
 #' # note that the vignette shows more examples
 #' # including how to read in files quickly using the readr package
-#' 
+#'
 #' library(tximportData)
 #' dir <- system.file("extdata", package="tximportData")
 #' samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
@@ -124,7 +124,7 @@ tximport <- function(files,
   countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
   stopifnot(all(file.exists(files)))
   if (!txIn & txOut) stop("txOut only an option when transcript-level data is read in (txIn=TRUE)")
-  
+
   # kallisto presets
   if (type == "kallisto") {
     geneIdCol="gene_id"
@@ -134,7 +134,7 @@ tximport <- function(files,
     lengthCol <- "eff_length"
     importer <- reader
     }
-  
+
   # salmon/sailfish presets
   if (type %in% c("salmon","sailfish")) {
     geneIdCol="gene_id"
@@ -142,9 +142,9 @@ tximport <- function(files,
     abundanceCol <- "TPM"
     countsCol <- "NumReads"
     lengthCol <- "EffectiveLength"
-    importer <- function(x) reader(x, comment='#') 
+    importer <- function(x) reader(x, comment='#')
   }
-  
+
   # rsem presets
   if (type == "rsem") {
     txIn <- FALSE
@@ -154,11 +154,11 @@ tximport <- function(files,
     lengthCol <- "effective_length"
     importer <- reader
   }
-  
+
   if (type == "cufflinks") {
     stop("reading from collated files not yet implemented")
   }
-  
+
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -171,7 +171,7 @@ tximport <- function(files,
       if ((i == 1) &
           (type %in% c("salmon","sailfish")) &
           !("EffectiveLength" %in% names(raw))) {
-        lengthCol <- "Length" 
+        lengthCol <- "Length"
         # because the comment lines have the same comment character
         # as the header, need to name the column names
         importer <- function(x) {
@@ -183,7 +183,7 @@ tximport <- function(files,
         raw <- as.data.frame(importer(files[i]))
       }
       #####################################################################
-      
+
       # does the table contain gene association or was an external tx2gene table provided?
       if (is.null(tx2gene) & !txOut) {
         # e.g. Cufflinks includes the gene ID in the table
@@ -204,11 +204,11 @@ tximport <- function(files,
         }
       } else {
         # e.g. Salmon and kallisto do not include the gene ID, need an external table
-        stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(all(txId == raw[[txIdCol]]))
+          stopifnot(any(txId == raw[[txIdCol]]))
         }
       }
       # create empty matrices
@@ -220,9 +220,11 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
-      abundanceMatTx[,i] <- raw[[abundanceCol]]
-      countsMatTx[,i] <- raw[[countsCol]]
-      lengthMatTx[,i] <- raw[[lengthCol]]
+      idx <- match(txId, raw[[txIdCol]])
+      abundanceMatTx[,i] <- raw[idx, abundanceCol]
+      countsMatTx[,i] <- raw[idx, countsCol]
+      lengthMatTx[,i] <- raw[idx, lengthCol]
+      lengthMatTx[,i] <- raw[idx, lengthCol]
     }
     message("")

@@ -236,12 +238,12 @@ tximport <- function(files,

     txi[["countsFromAbundance"]] <- NULL
     txiGene <- summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance)
-    return(txiGene)  
-    
+    return(txiGene)
+
   # e.g. RSEM already has gene-level summaries
   # just combine the gene-level summaries across files
   } else {
-  
+
     message("reading in files")
     for (i in seq_along(files)) {
       message(i," ",appendLF=FALSE)
@@ -259,7 +261,7 @@ tximport <- function(files,
       countsMat[,i] <- raw[[countsCol]]
       lengthMat[,i] <- raw[[lengthCol]]
     }
-  } 
+  }
   message("")
   return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
               countsFromAbundance="no"))
@@ -283,11 +285,11 @@ summarizeToGene <- function(txi,
   abundanceMatTx <- txi$abundance
   countsMatTx <- txi$counts
   lengthMatTx <- txi$length
-  
+
   txId <- rownames(abundanceMatTx)
   stopifnot(all(txId == rownames(countsMatTx)))
   stopifnot(all(txId == rownames(lengthMatTx)))
-  
+
   # need to associate tx to genes
   # potentially remove unassociated transcript rows and warn user
   if (!is.null(tx2gene)) {
@@ -309,20 +311,20 @@ summarizeToGene <- function(txi,
     txId <- txId[sub.idx]
     geneId <- tx2gene$gene[match(txId, tx2gene$tx)]
   }
-  
+
   # summarize abundance and counts
   message("summarizing abundance")
   abundanceMat <- rowsum(abundanceMatTx, geneId)
   message("summarizing counts")
   countsMat <- rowsum(countsMatTx, geneId)
   message("summarizing length")
-  
-  # the next lines calculate a weighted average of transcript length, 
+
+  # the next lines calculate a weighted average of transcript length,
   # weighting by transcript abundance.
   # this can be used as an offset / normalization factor which removes length bias
   # for the differential analysis of estimated counts summarized at the gene level.
   weightedLength <- rowsum(abundanceMatTx * lengthMatTx, geneId)
-  lengthMat <- weightedLength / abundanceMat   
+  lengthMat <- weightedLength / abundanceMat

   # pre-calculate a simple average transcript length
   # for the case the abundances are all zero for all samples.
@@ -332,14 +334,14 @@ summarizeToGene <- function(txi,
   aveLengthSampGene <- tapply(aveLengthSamp, geneId, mean)

   stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat)))
-  
+
   # check for NaN and if possible replace these values with geometric mean of other samples.
   # (the geometic mean here implies an offset of 0 on the log scale)
-  # NaN come from samples which have abundance of 0 for all isoforms of a gene, and 
+  # NaN come from samples which have abundance of 0 for all isoforms of a gene, and
   # so we cannot calculate the weighted average. our best guess is to use the average
   # transcript length from the other samples.
   lengthMat <- replaceMissingLength(lengthMat, aveLengthSampGene)
-  
+
   if (countsFromAbundance != "no") {
     countsSum <- colSums(countsMat)
     if (countsFromAbundance == "lengthScaledTPM") {
@@ -350,7 +352,7 @@ summarizeToGene <- function(txi,
     newSum <- colSums(newCounts)
     countsMat <- t(t(newCounts) * (countsSum/newSum))
   }
-  
+
   return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
               countsFromAbundance=countsFromAbundance))
 }
@@ -383,4 +385,3 @@ replaceMissingLength <- function(lengthMat, aveLengthSampGene) {
 ##            dimnames=list(levels(f), colnames(m)))
 ##   }
 ## }
-
-- 
2.5.4 (Apple Git-61)


From 26ce2bd0d703df9d210aff0115e0888c5343b1d2 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 14:41:25 -0400
Subject: [PATCH 2/4] Test for any membership.

---
 R/tximport.R | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/tximport.R b/R/tximport.R
index a80a4bc..d01bd68 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -208,7 +208,7 @@ tximport <- function(files,
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(any(txId == raw[[txIdCol]]))
+          stopifnot(any(txId %in% raw[[txIdCol]]))
         }
       }
       # create empty matrices
-- 
2.5.4 (Apple Git-61)


From 005267ccf9c4df054c2b92a034b7bea12a54c6ee Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 15:13:19 -0400
Subject: [PATCH 3/4] Add argument exposing allowMissingIds.

---
 R/tximport.R | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index d01bd68..2fc53c6 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -118,7 +118,8 @@ tximport <- function(files,
                      lengthCol,
                      importer,
                      collatedFiles,
-                     ignoreTxVersion=FALSE) {
+                     ignoreTxVersion=FALSE,
+                     allowMissingIds=FALSE) {

   type <- match.arg(type, c("none","kallisto","salmon","sailfish","rsem"))
   countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
@@ -159,6 +160,9 @@ tximport <- function(files,
     stop("reading from collated files not yet implemented")
   }

+  # choose whether to allow missing transcripts
+  if (allowMissingIds) { allAny <- any } else { allAny <- all }
+  
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -196,19 +200,20 @@ tximport <- function(files,

 ")
         }
-        stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           geneId <- raw[[geneIdCol]]
         } else {
-          stopifnot(all(geneId == raw[[geneIdCol]]))
+          stopifnot(allAny(geneId %in% raw[[geneIdCol]]))
         }
       } else {
         # e.g. Salmon and kallisto do not include the gene ID, need an external table
-        stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(any(txId %in% raw[[txIdCol]]))
+          stopifnot(allAny(txId %in% raw[[txIdCol]]))
+          idx <- match(txId, raw[[txIdCol]])
         }
       }
       # create empty matrices
@@ -220,7 +225,6 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
-      idx <- match(txId, raw[[txIdCol]])
       abundanceMatTx[,i] <- raw[idx, abundanceCol]
       countsMatTx[,i] <- raw[idx, countsCol]
       lengthMatTx[,i] <- raw[idx, lengthCol]
-- 
2.5.4 (Apple Git-61)


From 2870a1d82d1ffc4a3a576d1fe8bcb19e65cff2d0 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 15:20:04 -0400
Subject: [PATCH 4/4] Fix index generation logic.

---
 R/tximport.R | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index 2fc53c6..7816f74 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -162,7 +162,7 @@ tximport <- function(files,

   # choose whether to allow missing transcripts
   if (allowMissingIds) { allAny <- any } else { allAny <- all }
-  
+
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -213,7 +213,6 @@ tximport <- function(files,
           txId <- raw[[txIdCol]]
         } else {
           stopifnot(allAny(txId %in% raw[[txIdCol]]))
-          idx <- match(txId, raw[[txIdCol]])
         }
       }
       # create empty matrices
@@ -225,6 +224,7 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
+      idx <- match(txId, raw[[txIdCol]])
       abundanceMatTx[,i] <- raw[idx, abundanceCol]
       countsMatTx[,i] <- raw[idx, countsCol]
       lengthMatTx[,i] <- raw[idx, lengthCol]
-- 
2.5.4 (Apple Git-61)

Import processed data

I am working with public data that's been quantified (using kallisto) on the transcript level. However, I am rather interested in a gene-level analysis and was hoping to aggregate the data to the gene level. I only have the processed transcript-level estimated counts matrix available, and an object to link transcripts to genes. I should also be able to get the transcript lengths.

However, I did not find documentation on how to do the aggregation using tximport if we can't get our hands on the quant files. Suppose we have the count matrix, tx2gene object and the transcript lengths, is it recommended to try and recreate the list structure created as output from tximport?

lose transcript names for salmon inferential replicates

Hello! I was using the following command to import salmon quantification with bootstraps
txi <- tximport(files, type="salmon", txOut=TRUE, dropInfReps=FALSE, varReduce=FALSE)
problem is the output lost the transcript names, just 1,2,3...for row names.
It was fine when I use the following command, without bootstraps
txi <- tximport(files, type="salmon", txOut=TRUE, dropInfReps=TRUE)
Has anyone had similar issue?

thelovelab / tximport Goto Github PK

tximport's Introduction

tximport

tximport's People

Contributors

Stargazers

Watchers

Forkers

tximport's Issues

Recommend Projects

Recommend Topics

Recommend Org