thelovelab / tximport Goto Github PK
View Code? Open in Web Editor NEWTranscript quantification import for modular pipelines
Transcript quantification import for modular pipelines
Thank you for an eminent and very convenient software.
I'm importing kallisto data from h5 files and want output the transcripts. With Gencode as a reference this comes with an inconvenient transcript naming adding |-separated information after the transcript ID. Exporting genes, this is handled by the ignoreAfterBar flag, but this never takes effect when exporting transcripts:
https://github.com/mikelove/tximport/blob/master/R/tximport.R#L578-L595
Is this intentional?
Hi, new to bulkRNAseq analysis and a bit confused about different normalization measures that are used.
My understanding is that TPMs are fine for comparison of gene expression levels within a sample (so also for looking at correlation of expression between two genes in many samples separately) while for looking at fold change expression of list of genes of interest in two samples (e.g. pre and post treatment) the way to go is to transform the estimated counts from e.g. kallisto (imported with tximport using the CountsfromAbindance = "no" option) with the edgeR pipeline described in the tximport vignette. What confuses me is:
what is obtained at the end of the edgeR code described in the vignette are called CPMs (cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE) ) while to me these appear very similar to TMMs instead (since they are calculated accounting both for different average transcript length between samples and for composition biases between samples). Am I getting this wrong?
the vignette appears to suggest that processing the data with the estimated ("raw") counts with the edgeR pipeline has a similar effect as using scaledTPM or lengthscaledTPM but in fact what is obtained are very different counts. Quoting from one of your answer in (https://support.bioconductor.org/p/84883/): "These two have library size differences baked in. The column sum is equal to the number of mapped reads. So not comparable across samples." While as per 1) above I believe that the "CPMs" obtained with the edgeR vignette CAN be used for across sample comparison. Am I right?
In conclusion, what measure would you suggest to use for calculating logFC of a list of genes of interest between a pre and a post treatment sample? Is it the CPMs as obtained via the edgeR pipeline in the vignette or something else? And if the first, I guess those "CMPs" could be used also for looking at how pairs of genes correlate in different samples (one point per sample in a correlation plot)
Thank you very much in advance for any help tyou will ge able to give me.
Marco
I use the package called Tximport for long time to generate the tables I use for RNA-Seq. The code is quite simple. Here it is:
gene_id_import <- read.delim("gene_id_import.txt") # This is the file with the transcripts version and their respective gene names.
library(tximport)
filepath = "C:/ad4gy/MM"
files<- sapply(list.dirs(path = filepath, recursive = FALSE)[grep("_quant2TRIMbias",list.dirs(path = filepath, recursive = FALSE))],function(x) paste0(x,"/quant.sf"))# This command searchs for the quant.sf files inside each folder
#Tximport with length scaled TPM values as output (scaled up to library size and average transcript length over samples)
txi<- tximport(files= files, type= "salmon", tx2gene = gene_id_import, countsFromAbundance = "lengthScaledTPM")# When I run this line, now it’s giving the error:
reading in files with read_tsv
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid bytes in UTF8 string.
"seqBias": [], "": "–gcBias", "auxDir": "aux_in
(right here) ------^
#save your output file typing:
write.table(txi, file = "counts.txt", quote = TRUE, sep = "\t")
I have no idea about this error or how to correct it. Besides, it’s in the last line of the code. I noticed the mistake started after I upgrade to R4.0.3
Do you any idea how to fix it?
I appreciate all the help?
Regards,
I get the error Error in 1:nrow(m): argument of length 0
when trying to import either kallisto or sailfish output. Here's a reprex.
Load packages used here.
library(dplyr)
library(readr)
library(tximport)
library(annotables) # devtools::install_github(stephenturner/annotables)
To summarize transcripts to the gene level we'll need to use a transcript-to-gene annotation. I provide this in the annotables package (blog post; install from github with devtools::install_github("stephenturner/annotables")
). The grch38_gt
table contains mappings for 216133 transcripts to 65988 genes.
annotables::grch38_gt
## Source: local data frame [216,133 x 2]
##
## ensgene enstxp
## (chr) (chr)
## 1 ENSG00000210049 ENST00000387314
## 2 ENSG00000211459 ENST00000389680
## 3 ENSG00000210077 ENST00000387342
## 4 ENSG00000210082 ENST00000387347
## 5 ENSG00000209082 ENST00000386347
## 6 ENSG00000198888 ENST00000361390
## 7 ENSG00000210100 ENST00000387365
## 8 ENSG00000210107 ENST00000387372
## 9 ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## .. ... ...
Download data from here: https://dl.dropboxusercontent.com/u/66281/sailfish-txps.txt to data/
directory. Take a look at what the file looks like:
readLines("data/sailfish-txps.txt", n=20)
## [1] "# sailfish (quasi-mapping-based) v0.7.6"
## [2] "# [ program ] => sailfish "
## [3] "# [ command ] => quant "
## [4] "# [ index ] => { sailfish-grch38-index }"
## [5] "# [ libType ] => { IU }"
## [6] "# [ threads ] => { 3 }"
## [7] "# [ geneMap ] => { Homo_sapiens.GRCh38.82.gtf }"
## [8] "# [ mates1 ] => { SRR493366_1.fastq }"
## [9] "# [ mates2 ] => { SRR493366_2.fastq }"
## [10] "# [ output ] => { SRR493366-sfem }"
## [11] "# [ mapping rate ] => { 95.2197% }"
## [12] "# Name\tLength\tTPM\tNumReads"
## [13] "ENST00000415118\t8\t0\t0"
## [14] "ENST00000448914\t13\t0\t0"
## [15] "ENST00000631435\t12\t0\t0"
## [16] "ENST00000434970\t9\t0\t0"
## [17] "ENST00000632684\t12\t0\t0"
## [18] "ENST00000431440\t16\t0\t0"
## [19] "ENST00000454691\t18\t0\t0"
## [20] "ENST00000451044\t17\t0\t0"
Try importing:
sf <- tximport("data/sailfish-txps.txt", type="salmon", gene2tx=grch38_gt)
## Error in 1:nrow(m): argument of length 0
Hi,
I have a methodological question about how to deal with multispecies mapping and tximport.
I have RNA-seq samples that are a blend of 3 different species interacting with each other. So I performed the mapping with Salmon using an hybrid reference composed of the transcriptomes of those 3 species.
Now I wonder how to correctly use tximport on the outputs of Salmon in order to generate one counting matrix for each those 3 species. Because normalization according to the library size will be done for each species independently I was wondering if I should separate the salmon outputs according to the species before running tximport.
From the documentation of tximport, I should use "countsFromAbundance=no". This parameter will then use the NumReads column of the Salmon output which appears to partly rely on the relative abundance for each transcript. Is NumReads partly based on the library size ? If it's not I guess that dividing the results per species before or after tximport won't have any effect but if NumReads dependent of the library size then I should run tximport on the complete output of Salmon.
Thanks for your help,
Florian Rocher
I used this line of code to import RSEM ".gene.results" files:
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE),and it returns error:"all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE" ,
then i explicitly define the column names like in the line of code below:
txi.rsem <- tximport("ceshi2.tsv", type = "rsem", txIn = FALSE, txOut = FALSE, tx2gene = NULL, countsFromAbundance = "no", varReduce = FALSE, dropInfReps = FALSE, ignoreTxVersion = FALSE, geneIdCol = "gene_id", txIdCol = "transcript_id", abundanceCol = "TPM", countsCol = "expected_count", lengthCol = "effective_length")
it returns the same error.
my files are just like the picture :
Sorry, I have raw read counts from FeatureCounts, can I still use tximport to collapsing transcrip expression to gene level?
It would be nice to allow tximport
to optionally derive the abundance for each transcript in each sample by computing some summary statistic (e.g. median or mean) of the inferential replicates, if they are available. These can sometimes be more robust or accurate than individual point estimates, and if inferential replicates are already being read in and summarized to obtain variances, the overhead for doing what is suggested above should be marginal.
I am working with a large smart-seq2 dataset with 23k samples (preprocessed with Salmon).
I have seen tximport supports sparse matrices for single cell data, which is nice, but even with readr
it takes very long to parse the salmon output files.
I checked the code and it seems that importing simply runs in a loop. Do you think this could be easily replaced with e.g. mclapply?
Best,
Gregor
Hi!
I have recently updated Salmon to the latest version (0.14.0) and I'm not anymore able to import Salmon outputs into R using the code
tximport(file, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
This is the error that I get when I run the code above:
txi <- tximport(file, type = "salmon", txOut = TRUE, countsFromAbundance = "no")
reading in files with read_tsv
Error in readBin(bootCon, "integer", n = expected.n) :
invalid 'n' argument
Any suggestion?
Thanks
Edo
Dear all,
I struggle with the error
Error in 1:nrow(x) : argument of length 0
I was hoping that tximport can help here.
Specifically, I imported a shapefile for block groups that include census tracts for the US. I am trying to compute dissimilarity, spatial proximity, absolute clustering, eta2, and isolation indexes for each census tract based on group blocks using OasisR package. The code worked for larger units using a different shapefile (Metropolitan areas based on tracts). However, now I am receiving an error I mentioned above. Here is the full code:
data<-readOGR(dsn="/Users/16033/Desktop/Exp_shapefiles_dropped", layer="Export_Output_2000b")
library(OasisR)
x<-data@data
tractlist <- unique(x$gisjoin_t1)
ntract <- length(tractlist)
result <- as.data.frame(matrix(0, nrow = ntract, ncol = 6))
names(result) <- c("gisjoin_t1", "Dissim", "SpatProx","AbsClus","Eta2","Isol")
result$gisjoin_t1 <- tractlist
for (i in 1:ntract)
{
prov <- subset(x, gisjoin_t1 == tractlist[i])
result$Dissim[i] <- Atkinson(prov[,2:3])
}
After this command I am getting an error:
Error in 1:nrow(x) : argument of length 0
... Same for other indexes.
Here is the first 50 observation from my file
reconstructed <- structure(list(gisjoin_t1 = c(200130000100, 200130000100, 200130000100, 200160000100, 200160000200, 200160000200, 200200000101, 200200000101, 200200000101, 200200000102, 200200000102, 200200000102, 200200000102, 200200000201, 200200000201, 200200000202, 200200000202, 200200000202, 200200000202, 200200000203, 200200000203, 200200000203, 200200000203, 200200000203, 200200000204, 200200000204, 200200000300, 200200000400, 200200000500, 200200000500, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000600, 200200000701, 200200000701, 200200000701, 200200000701, 200200000702, 200200000702, 200200000702, 200200000703, 200200000703, 200200000703, 200200000703, 200200000703), black_n = c(18, 13, 14, 8, 127, 30, 9, 17, 30, 12, 6, 5, 0, 37, 57, 19, 22, 27, 24, 44, 25, 63, 44, 30, 29, 23, 1106, 814, 49, 92, 25, 129, 121, 146, 136, 104, 100, 120, 258, 109, 85, 153, 121, 166, 262, 106, 99, 173, 34, 197), mhwhite_n = c(231, 127, 163, 279, 949, 831, 342, 1561, 2197, 762, 1160, 845, 1147, 1307, 1130, 968, 1341, 1230, 1518, 1081, 1796, 2410, 1551, 1151, 1196, 884, 3498, 4909, 559, 477, 101, 335, 291, 341, 383, 295, 271, 221, 790, 630, 504, 487, 450, 963, 1283, 603, 355, 642, 458, 700), his_n = c(151, 59, 129, 22, 323, 228, 22, 59, 69, 16, 33, 28, 41, 64, 93, 50, 48, 49, 30, 30, 38, 129, 59, 55, 48, 40, 503, 475, 41, 181, 12, 49, 101, 115, 146, 47, 98, 109, 124, 70, 109, 66, 63, 83, 136, 51, 40, 78, 44, 70)), row.names = 0:49, class = "data.frame")
As far as I read, I need to post debunk=FALSE somewhere, but I am not sure where. Thank you in advance
Sylwia
When using tximport to import transcript level quantification with the end goal of performing DTU analysis which of the 4 count scaling methods would you recommend? (Asking since dtuScaledTPM was introduced after the Swimming downstream paper recommending "scaledTPM").
Trying to import sailfish results that aren't in the typical dir/quant.sf
and dir/stats.tsv
convention, because I've moved/renamed some things:
> sf <- tximport("data/sailfish-txps.txt", type="salmon", gene2tx=grch38_gt)
reading in files
1 Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'data/stats.tsv': No such file or directory
Coming from this bit of code:
tmp2 <- read.table(file.path(dirname(x), "stats.tsv"),
tximport is strictly looking in the same path as the results file for stats.tsv
. Suggest allowing explicitly specifying this file as an argument, but default back to this?
Hi,
I noticed something weird using tximport by summarizing at the gene level.
Depending on the number of samples I load I don't have the same length and counts for my genes in the corresponding samples.
Quant_Matrix_tximport=tximport(**AlltheFiles**,txIn = T, txOut = F, type="salmon", countsFromAbundance ="lengthScaledTPM", tx2gene = tx2gene)
Quant_Matrix_tximport1=tximport(**20FirstFiles**,txIn = T, txOut = F, type="salmon", countsFromAbundance ="lengthScaledTPM", tx2gene = tx2gene)
Example:
When all samples are load I have Sample1:Gene 1 --> Counts=386.3943113
When 20 samples are loaded I have Sample1:Gene1 --> Counts=383.4657441
I used "no" option insted to check. I did not see any differences at the count level but at the length of the genes level.
Is this something that is expected ?
Florian Rocher
Hello,
May I ask for release tags? Tximport is part of the pigx-rnaseq pipeline (https://github.com/BIMSBbioinfo/pigx_rnaseq/) that I have just packaged for Debian. With release tags we would be informed about updates. The version on BioConductor is 1.12.3, which we could also use if that is what you prefer.
Many thanks
Steffen
Tried with latest version of tximport
installed with devtools::install_github("mikelove/tximport")
1.13.3
tximport::tximport(file.path(alevin_dir, 'quants_mat.gz'), type = 'alevin')
Error in mat[, j] <- readBin(con, double(), endian = "little", n = num.genes) :
number of items to replace is not a multiple of replacement length
Probably related to this:
The binary output format of alevin, quants_mat.gz, has been changed into a sparse single precision format. In pratice we saw the file size reduced to as big as half the size of the original file.
It is great that the tximport
function allows users to supply their own importer function.
I keep most of my files on AWS S3. The aws.s3 package provides great functionality to access S3 objects, e.g. via the s3read_using
function.
So I tried to read my Salmon output files directly from AWS, using the following importer:
importer = aws.s3::s3read_using(FUN = readr::read_tsv)
Unfortunately, the tximport
functions refuses to run unless the file.exists
function returns TRUE
for all paths, so S3 paths are not valid.
https://github.com/mikelove/tximport/blob/9b9d3e6e0d0843cfec2048e5dbd0a8ddca189a71/R/tximport.R#L145
Perhaps checking whether a file.exists
could be made optional?
Thanks a lot for making this great package available!
Thomas
(Unsure if this idea belongs here, in the tximeta repo, or in the salmon repo. Apologies if I guessed wrong. Tagging @rob-p and @k3yavi for their thoughts).
I'm looking into using the alevin tool from salmon for processing 3'-tag scRNA-seq data (currently just in the planning stage for such a workflow).
In such a workflow I'll ultimately be wanting to get the gene-sample count matrix created by alevin into a SingleCellExperiment (ideally with all the metadata via the very awesome tximeta).
Looking at the alevin docs (https://salmon.readthedocs.io/en/latest/alevin.html#output):
by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.
A typical run of alevin will generate 3 files:
quants_mat.gz – Compressed count matrix.
quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
quants_mat_rows.txt – Row Index (CB-ids) of the matrix.
Alevin can also dump the count-matrix in a human readable – comma-separated-value (CSV) format
It's probably worth noting that the CSV version has cells along the rows and genes along the columns (i.e. opposite of SingleCellExperiment).
So a few questions:
Thanks, I'm eager to hear your thoughts
Pete
I have use cases for building matrices of TPM/counts where the length(txId) != length(raw[[txIdCol]])
:
I've created a patch that allows the user to combine input files with a subset of common features. Maybe this is of general interest? Sorry about the random edits in the diff - if you update the repository and Github I could do a proper PR.
From 1cf014290593e9fceec4decfc364af9c34ecf905 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 14:30:36 -0400
Subject: [PATCH 1/4] Match rows during Salmon import, supporting import of
sub/superset transcriptomes.
---
R/tximport.R | 75 ++++++++++++++++++++++++++++++------------------------------
1 file changed, 38 insertions(+), 37 deletions(-)
diff --git a/R/tximport.R b/R/tximport.R
index 93c1dc4..a80a4bc 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -4,7 +4,7 @@
#' external software and optionally summarizes abundances, counts, and transcript lengths
#' to the gene-level (default) or outputs transcript-level matrices
#' (see \code{txOut} argument).
-#' While \code{tximport} summarizes to the gene-level by default,
+#' While \code{tximport} summarizes to the gene-level by default,
#' the user can also perform the import and summarization steps manually,
#' by specifing \code{txOut=TRUE} and then using the function \code{summarizeToGene}.
#' Note however that this is equivalent to \code{tximport} with
@@ -17,7 +17,7 @@
#' \item avoid gene-level summarization by specifying \code{txOut=TRUE}
#' \item set \code{geneIdCol} to an appropriate column in the files
#' }
-#'
+#'
#' See \code{vignette('tximport')} for example code for generating a
#' \code{tx2gene} data.frame from a \code{TxDb} object.
#' Note that the \code{keys} and \code{select} functions used
@@ -41,16 +41,16 @@
#' @param countsFromAbundance character, either "no" (default), "scaledTPM", or "lengthScaledTPM",
#' for whether to generate estimated counts using abundance estimates scaled up to library size
#' (scaledTPM) or additionally scaled using the average transcript length over samples and
-#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM,
+#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM,
#' then the counts are no longer correlated with average transcript length,
#' and so the length offset matrix should not be used.
#' @param tx2gene a two-column data.frame linking transcript id (column 1) to gene id (column 2).
-#' the column names are not relevant, but this column order must be used.
+#' the column names are not relevant, but this column order must be used.
#' this argument is required for gene-level summarization for methods
#' that provides transcript-level estimates only
#' (kallisto, Salmon, Sailfish)
#' @param reader a function to replace read.delim in the pre-set importer functions,
-#' for example substituting read_tsv from the readr package will substantially
+#' for example substituting read_tsv from the readr package will substantially
#' speed up tximport
#' @param geneIdCol name of column with gene id. if missing, the gene2tx argument can be used
#' @param txIdCol name of column with tx id
@@ -66,7 +66,7 @@
#' (default FALSE)
#' @param txi list of matrices of trancript-level abundances, counts, and
#' lengths produced by \code{tximport}, only used by \code{summarizeToGene}
-#'
+#'
#' @return a simple list with matrices: abundance, counts, length.
#' A final element 'countsFromAbundance' carries through
#' the character argument used in the tximport call.
@@ -77,20 +77,20 @@
#' @describeIn tximport Import tx-level quantifications and summarize
#' abundances, counts and lengths to gene-level (default)
#' or simply output tx-level matrices
-#'
+#'
#' @references
#'
#' Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015):
#' Differential analyses for RNA-seq: transcript-level estimates
#' improve gene-level inferences. F1000Research.
#' \url{http://dx.doi.org/10.12688/f1000research.7563.1}
-#'
+#'
#' @examples
#'
#' # load data for demonstrating tximport
#' # note that the vignette shows more examples
#' # including how to read in files quickly using the readr package
-#'
+#'
#' library(tximportData)
#' dir <- system.file("extdata", package="tximportData")
#' samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
@@ -124,7 +124,7 @@ tximport <- function(files,
countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
stopifnot(all(file.exists(files)))
if (!txIn & txOut) stop("txOut only an option when transcript-level data is read in (txIn=TRUE)")
-
+
# kallisto presets
if (type == "kallisto") {
geneIdCol="gene_id"
@@ -134,7 +134,7 @@ tximport <- function(files,
lengthCol <- "eff_length"
importer <- reader
}
-
+
# salmon/sailfish presets
if (type %in% c("salmon","sailfish")) {
geneIdCol="gene_id"
@@ -142,9 +142,9 @@ tximport <- function(files,
abundanceCol <- "TPM"
countsCol <- "NumReads"
lengthCol <- "EffectiveLength"
- importer <- function(x) reader(x, comment='#')
+ importer <- function(x) reader(x, comment='#')
}
-
+
# rsem presets
if (type == "rsem") {
txIn <- FALSE
@@ -154,11 +154,11 @@ tximport <- function(files,
lengthCol <- "effective_length"
importer <- reader
}
-
+
if (type == "cufflinks") {
stop("reading from collated files not yet implemented")
}
-
+
# if input is tx-level, need to summarize abundances, counts and lengths to gene-level
if (txIn) {
message("reading in files")
@@ -171,7 +171,7 @@ tximport <- function(files,
if ((i == 1) &
(type %in% c("salmon","sailfish")) &
!("EffectiveLength" %in% names(raw))) {
- lengthCol <- "Length"
+ lengthCol <- "Length"
# because the comment lines have the same comment character
# as the header, need to name the column names
importer <- function(x) {
@@ -183,7 +183,7 @@ tximport <- function(files,
raw <- as.data.frame(importer(files[i]))
}
#####################################################################
-
+
# does the table contain gene association or was an external tx2gene table provided?
if (is.null(tx2gene) & !txOut) {
# e.g. Cufflinks includes the gene ID in the table
@@ -204,11 +204,11 @@ tximport <- function(files,
}
} else {
# e.g. Salmon and kallisto do not include the gene ID, need an external table
- stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+ stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
if (i == 1) {
txId <- raw[[txIdCol]]
} else {
- stopifnot(all(txId == raw[[txIdCol]]))
+ stopifnot(any(txId == raw[[txIdCol]]))
}
}
# create empty matrices
@@ -220,9 +220,11 @@ tximport <- function(files,
countsMatTx <- mat
lengthMatTx <- mat
}
- abundanceMatTx[,i] <- raw[[abundanceCol]]
- countsMatTx[,i] <- raw[[countsCol]]
- lengthMatTx[,i] <- raw[[lengthCol]]
+ idx <- match(txId, raw[[txIdCol]])
+ abundanceMatTx[,i] <- raw[idx, abundanceCol]
+ countsMatTx[,i] <- raw[idx, countsCol]
+ lengthMatTx[,i] <- raw[idx, lengthCol]
+ lengthMatTx[,i] <- raw[idx, lengthCol]
}
message("")
@@ -236,12 +238,12 @@ tximport <- function(files,
txi[["countsFromAbundance"]] <- NULL
txiGene <- summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance)
- return(txiGene)
-
+ return(txiGene)
+
# e.g. RSEM already has gene-level summaries
# just combine the gene-level summaries across files
} else {
-
+
message("reading in files")
for (i in seq_along(files)) {
message(i," ",appendLF=FALSE)
@@ -259,7 +261,7 @@ tximport <- function(files,
countsMat[,i] <- raw[[countsCol]]
lengthMat[,i] <- raw[[lengthCol]]
}
- }
+ }
message("")
return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
countsFromAbundance="no"))
@@ -283,11 +285,11 @@ summarizeToGene <- function(txi,
abundanceMatTx <- txi$abundance
countsMatTx <- txi$counts
lengthMatTx <- txi$length
-
+
txId <- rownames(abundanceMatTx)
stopifnot(all(txId == rownames(countsMatTx)))
stopifnot(all(txId == rownames(lengthMatTx)))
-
+
# need to associate tx to genes
# potentially remove unassociated transcript rows and warn user
if (!is.null(tx2gene)) {
@@ -309,20 +311,20 @@ summarizeToGene <- function(txi,
txId <- txId[sub.idx]
geneId <- tx2gene$gene[match(txId, tx2gene$tx)]
}
-
+
# summarize abundance and counts
message("summarizing abundance")
abundanceMat <- rowsum(abundanceMatTx, geneId)
message("summarizing counts")
countsMat <- rowsum(countsMatTx, geneId)
message("summarizing length")
-
- # the next lines calculate a weighted average of transcript length,
+
+ # the next lines calculate a weighted average of transcript length,
# weighting by transcript abundance.
# this can be used as an offset / normalization factor which removes length bias
# for the differential analysis of estimated counts summarized at the gene level.
weightedLength <- rowsum(abundanceMatTx * lengthMatTx, geneId)
- lengthMat <- weightedLength / abundanceMat
+ lengthMat <- weightedLength / abundanceMat
# pre-calculate a simple average transcript length
# for the case the abundances are all zero for all samples.
@@ -332,14 +334,14 @@ summarizeToGene <- function(txi,
aveLengthSampGene <- tapply(aveLengthSamp, geneId, mean)
stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat)))
-
+
# check for NaN and if possible replace these values with geometric mean of other samples.
# (the geometic mean here implies an offset of 0 on the log scale)
- # NaN come from samples which have abundance of 0 for all isoforms of a gene, and
+ # NaN come from samples which have abundance of 0 for all isoforms of a gene, and
# so we cannot calculate the weighted average. our best guess is to use the average
# transcript length from the other samples.
lengthMat <- replaceMissingLength(lengthMat, aveLengthSampGene)
-
+
if (countsFromAbundance != "no") {
countsSum <- colSums(countsMat)
if (countsFromAbundance == "lengthScaledTPM") {
@@ -350,7 +352,7 @@ summarizeToGene <- function(txi,
newSum <- colSums(newCounts)
countsMat <- t(t(newCounts) * (countsSum/newSum))
}
-
+
return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
countsFromAbundance=countsFromAbundance))
}
@@ -383,4 +385,3 @@ replaceMissingLength <- function(lengthMat, aveLengthSampGene) {
## dimnames=list(levels(f), colnames(m)))
## }
## }
-
--
2.5.4 (Apple Git-61)
From 26ce2bd0d703df9d210aff0115e0888c5343b1d2 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 14:41:25 -0400
Subject: [PATCH 2/4] Test for any membership.
---
R/tximport.R | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/R/tximport.R b/R/tximport.R
index a80a4bc..d01bd68 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -208,7 +208,7 @@ tximport <- function(files,
if (i == 1) {
txId <- raw[[txIdCol]]
} else {
- stopifnot(any(txId == raw[[txIdCol]]))
+ stopifnot(any(txId %in% raw[[txIdCol]]))
}
}
# create empty matrices
--
2.5.4 (Apple Git-61)
From 005267ccf9c4df054c2b92a034b7bea12a54c6ee Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 15:13:19 -0400
Subject: [PATCH 3/4] Add argument exposing allowMissingIds.
---
R/tximport.R | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/R/tximport.R b/R/tximport.R
index d01bd68..2fc53c6 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -118,7 +118,8 @@ tximport <- function(files,
lengthCol,
importer,
collatedFiles,
- ignoreTxVersion=FALSE) {
+ ignoreTxVersion=FALSE,
+ allowMissingIds=FALSE) {
type <- match.arg(type, c("none","kallisto","salmon","sailfish","rsem"))
countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
@@ -159,6 +160,9 @@ tximport <- function(files,
stop("reading from collated files not yet implemented")
}
+ # choose whether to allow missing transcripts
+ if (allowMissingIds) { allAny <- any } else { allAny <- all }
+
# if input is tx-level, need to summarize abundances, counts and lengths to gene-level
if (txIn) {
message("reading in files")
@@ -196,19 +200,20 @@ tximport <- function(files,
")
}
- stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+ stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
if (i == 1) {
geneId <- raw[[geneIdCol]]
} else {
- stopifnot(all(geneId == raw[[geneIdCol]]))
+ stopifnot(allAny(geneId %in% raw[[geneIdCol]]))
}
} else {
# e.g. Salmon and kallisto do not include the gene ID, need an external table
- stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
+ stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
if (i == 1) {
txId <- raw[[txIdCol]]
} else {
- stopifnot(any(txId %in% raw[[txIdCol]]))
+ stopifnot(allAny(txId %in% raw[[txIdCol]]))
+ idx <- match(txId, raw[[txIdCol]])
}
}
# create empty matrices
@@ -220,7 +225,6 @@ tximport <- function(files,
countsMatTx <- mat
lengthMatTx <- mat
}
- idx <- match(txId, raw[[txIdCol]])
abundanceMatTx[,i] <- raw[idx, abundanceCol]
countsMatTx[,i] <- raw[idx, countsCol]
lengthMatTx[,i] <- raw[idx, lengthCol]
--
2.5.4 (Apple Git-61)
From 2870a1d82d1ffc4a3a576d1fe8bcb19e65cff2d0 Mon Sep 17 00:00:00 2001
From: Matt Shirley <[email protected]>
Date: Sun, 10 Apr 2016 15:20:04 -0400
Subject: [PATCH 4/4] Fix index generation logic.
---
R/tximport.R | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/R/tximport.R b/R/tximport.R
index 2fc53c6..7816f74 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -162,7 +162,7 @@ tximport <- function(files,
# choose whether to allow missing transcripts
if (allowMissingIds) { allAny <- any } else { allAny <- all }
-
+
# if input is tx-level, need to summarize abundances, counts and lengths to gene-level
if (txIn) {
message("reading in files")
@@ -213,7 +213,6 @@ tximport <- function(files,
txId <- raw[[txIdCol]]
} else {
stopifnot(allAny(txId %in% raw[[txIdCol]]))
- idx <- match(txId, raw[[txIdCol]])
}
}
# create empty matrices
@@ -225,6 +224,7 @@ tximport <- function(files,
countsMatTx <- mat
lengthMatTx <- mat
}
+ idx <- match(txId, raw[[txIdCol]])
abundanceMatTx[,i] <- raw[idx, abundanceCol]
countsMatTx[,i] <- raw[idx, countsCol]
lengthMatTx[,i] <- raw[idx, lengthCol]
--
2.5.4 (Apple Git-61)
I just wanted to indicate, that the currently linked bioconductor vignette is not the same as the tximport.Rmd in this repository.
Especially the edgeR-part in the vignette seems to be outdated in comparison to the .Rmd file (the changes were committed before the 26th of June, which is the date of the linked vignette).
Or am I missing something?
Hi, I need to figure out which approach is more appropriate regarding filtering lowly expressed genes. According to tximport manual, it is recommended to follow following commands for EdgeR analysis:
library(edgeR)
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
and to continue with y as a DGE object. In my analysis I filtered out the lowly expressed genes based on the cpm value (for instance, cpm value is greater than 1 in at least the number of small group of samples) using "keep.lib.sizes=FALSE" after doing above mentioned normalization.
I am now confused if my approach is appropriate and if I should do the normalization after filtering?
Thanks for your help.
Best,
It seems that ignoreAfterBar
is not used if transcript-level output is requested. Am I overlooking a reason why this should not be supported (i.e., return only the part of the transcript ID before the first |)?
Hello! I was using the following command to import salmon quantification with bootstraps
txi <- tximport(files, type="salmon", txOut=TRUE, dropInfReps=FALSE, varReduce=FALSE)
problem is the output lost the transcript names, just 1,2,3...for row names.
It was fine when I use the following command, without bootstraps
txi <- tximport(files, type="salmon", txOut=TRUE, dropInfReps=TRUE)
Has anyone had similar issue?
I wanted to use tximport to read in a bunch of sailfish output files then write out abundances/counts for both genes and transcripts. I ended up doing the import twice, once for txOut=TRUE and once where it was FALSE. Would be nice to allow usage where txOut=FALSE, then apply a later function that would perform the summarization as a separate step.
e.g.
txdata <- tximport(files, txOut=FALSE)
gndata <- txGeneSum(txdata)
I quantified bulk RNA-seq with kallisto quant with default arguments (therefore, no bootstraps) and ten threads.
kallisto quant --threads 10 --index {input.idx} --output-dir {output.kout} {input.fqs}
I then imported the h5 files via tximport.
tximport("abundance.h5", tx2gene=tx2gene, type="kallisto", countsFromAbundance="scaledTPM")
The results of tximport are saved to TSV. If I repeat this process again with no changes to get another TSV, I find that the two TSVs are different.
$ md5sum scounts.tsv.gz old.scounts.tsv.gz
35c170054847cd6af28d35262022fc85 scounts.tsv.gz
1c5b5477e03c64f298363a48baecacdb old.scounts.tsv.gz
$ zcat scounts.tsv.gz | sort | md5sum
c661db690d5efd537745312044f334df -
$ zcat old.scounts.tsv.gz | sort | md5sum
4eb4196d95072574eb8ab5c6fb04106e -
If I set kallisto to single-threaded execution, making no other changes, I get a deterministic result: the same TSV every run.
I report this issue here rather than with kallisto, because kallisto authors have already responded to threading and determinism here: pachterlab/kallisto#236 (comment). They say that the multi-threaded kallisto output only looks different due to randomness in which threads finish first.
Plans to move the Alevin import code to C++, big improvements in speed shown by @k3yavi
Code will probably go into fishpond as it is designed to be a package for Salmon to Bioc methods and utilities.
Hi, @mikelove
I tried to perform tximport against the result files of salmon,
but it seems not properly be performed.
library("tximport")
srr <- paste0("SRR", 3670977:3670992)
file_salmon_options <- file.path("analysis/salmon_options", srr, "quant.sf")
names(file_salmon_options) <- srr
txi.salmon_options <- tximport(
file_salmon_options, type="salmon", txOut=TRUE)
# reading in files with read_tsv
# 1 Error in readBin(bootCon, "integer", n = expected.n) :
# invalid 'n' argument
The job of salmon is performed with the option "--numBootstraps 100",
and I think this part might be related to the error,
because the error is related to a metadata file (meta_info.json),
which is saved in the directory containing the result of salmon,
and specifying the option changes the value of "num_bootstraps" as 100.
https://github.com/mikelove/tximport/blob/8373c95d36fd342f2f53598ef6bc984021a982b2/R/infReps.R#L75
Could you confirm that this error is reproduced in your environment?
The script of the salmon I performed is this:
#!/bin/bash
srrs=({77..92})
for srr in ${srrs[@]}; do
tools/salmon-latest_linux_x86_64/bin/salmon \
quant \
-i data/transcripts_index_salmon \
-p 4 \
-l A \
--numBootstraps 100 \
--seqBias \
--gcBias \
--posBias \
-r data/SRR36709${srr}.fastq.gz \
-o analysis/salmon_options/SRR36709${srr}
done
and the results of salmon I performed is here↓
https://www.dropbox.com/s/qw495oagjj5wb8l/analysis.zip?dl=0
The result of sessionInfo is as follows.
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /home/koki/Software/R-3.5.0/lib/libRblas.so
LAPACK: /home/koki/Software/R-3.5.0/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.3.1 jsonlite_1.6 tximport_1.8.0
loaded via a namespace (and not attached):
[1] compiler_3.5.0 R6_2.4.0 hms_0.4.2 tools_3.5.0
[5] rhdf5_2.24.0 pillar_1.4.1 tibble_2.1.3 Rcpp_0.12.19
[9] crayon_1.3.4 pkgconfig_2.0.2 rlang_0.3.0.1 Rhdf5lib_1.2.1
Stupidly made a PR against the BioC mirror https://github.com/Bioconductor-mirror/tximport/pull/1. Same line doesn't seem to exist in vignette in this repo, so unsure if advice not updated or has been removed
Hi,mikelove
I used the kallisto for RNA-seq, then import the result "*.h5" by the tximport tools, but I get the following error:
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.
I am not sure if this is an issue with tximport or with my own data.
Currently the check for the presence of rhdf5
precedes the check of whether there is actually an abundance.h5
file to read (in readInfRepKallisto
). This might cause errors in situations where one has only the abundance.tsv
file, rhdf5
is not available, and one doesn't set dropInfReps=TRUE
.
this line won't work, number of lines depends on call
https://github.com/mikelove/tximport/blob/master/R/tximport.R#L82
if users don't provide a tx2gene
argument, this could either be a mistake -- they didn't know they needed to tell the function which tx belong to which genes -- or they could want to have tx-level output, the current warning is:
Error: all(c(geneIdCol, lengthCol, abundanceCol) %in% names(raw)) is not TRUE
this could say, either provide tx2gene
or specify txOut=TRUE
Alevin returns a whitelist.txt
file - add option to import only barcodes listed therein.
I used this line of code to process RSEM ".isoforms.results" files:
txi.RSEM <- tximport(files, type = "rsem", reader = read_tsv, txIn = TRUE, txOut = TRUE)
The rownames of the resulting list objects, are gene ID's, which is not what I expected, since I put the txOut option to true.
When I explicitly define the column names like in the line of code below, it works like expected (the rownames of the resulting list objects are now transcript ID's):
txi.RSEM <- tximport(files, type = "none", geneIdCol = "gene_id", txIdCol = "transcript_id",
abundanceCol = "TPM", countsCol = "expected_count", lengthCol = "length", reader =
read_tsv, txIn = TRUE, importer = read.delim, txOut = TRUE)
instead of using the length column in quant.sf which is actual tx length, use the effective length in stats.tsv
Hi! I just want to import the kallisto abundance.h5/abundance.tsv by tximport, but I failed.
~/RNA-seq/Analysis/GSE112055/quant_out/SRR6868519]$ cat run_info.json
{
"n_targets": 141398,
"n_bootstraps": 100,
"n_processed": 59161050,
"n_pseudoaligned": 50375628,
"n_unique": 18126755,
"p_pseudoaligned": 85.1,
"p_unique": 30.6,
"kallisto_version": "0.46.2",
"index_version": 10,
"start_time": "Mon May 4 17:44:24 2020",
"call": "kallisto quant -i /home2/ymwang/linqin/RNA-seq/fasta/release99/transcripts_Mus.idx -o quant_out/SRR6868519 -t 24 -b 100 download_GSE112055/SRR6868519.sra_1.fastq download_GSE112055/SRR6868519.sra_2.fastq"
}
~/RNA-seq/Analysis/GSE112055/quant_out/SRR6868519]$ head abundance.tsv
target_id length eff_length est_counts tpm
ENSMUST00000177564.1 16 8.71429 0 0
ENSMUST00000196221.1 9 8.66667 0 0
ENSMUST00000179664.1 11 8.25 0 0
ENSMUST00000178537.1 12 9.25 0 0
ENSMUST00000178862.1 14 8 0 0
ENSMUST00000179520.1 11 8.25 0 0
ENSMUST00000179883.1 16 8.71429 0 0
ENSMUST00000195858.1 10 9.66667 0 0
ENSMUST00000179932.1 12 9.25 0 0
> library(tximport)
> library(rhdf5)
> sample_names = c("SRR6868519",paste0("SRR686852", 0:8))
> quant_files = file.path("/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/quant_out", sample_names, "abundance.h5")
> file.exists(quant_files)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> txi = tximport(files = quant_files, type = "kallisto", txOut = TRUE)
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
> quant_files = file.path("/home2/ymwang/linqin/RNA-seq/Analysis/GSE112055/quant_out", sample_names, "abundance.tsv")
> file.exists(quant_files)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> txi = tximport(files = quant_files, type = "kallisto", txOut = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
>
Then, I reproduced the example(https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html). Interestingly, it had different results for two kallisto files in package tximportData.
#kallisto_boot dir
> files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5")
> names(files) <- paste0("sample", 1:6)
> all(file.exists(files))
[1] TRUE
> txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
> files <- file.path(dir, "kallisto_boot", samples$run, "abundance.tsv.gz")
> names(files) <- paste0("sample", 1:6)
> all(file.exists(files))
[1] TRUE
> txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
Error: Unable to read dataset.
Not all required filters available.
Missing filters: deflate
In addition: Warning messages:
1: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
2: In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
#kallisto dir
> files <- file.path(dir, "kallisto", samples$run, "abundance.tsv.gz")
> file.exists(files)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
> txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6
summarizing abundance
summarizing counts
summarizing length
Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
> txi.kallisto.tsv <- tximport(files, type = "kallisto", txOut=T)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1 2 3 4 5 6
Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
What's the problem with it? does tximport need special outputs from kallisto with special arguments?
If you have a question about usage of the software, do not post here, instead post to the Bioconductor support site:
https://support.bioconductor.org
This GitHub repo is to facilitate browsing of the code, and pull requests. For non-minor pull requests, it's a very good idea to first post to the support site link above, where we can discuss beforehand. I probably will not accept major PR that haven't been discussed beforehand, because I would need to have time to ensure that it won't cause unintended consequences on the user base.
I was wondering whether it would possible to support sparse matrices as output of tximport.
This might make tximport more attractive for people working with scRNA-seq data.
I'm currently running into problems when importing >400 single cells because of memory limits.
Since the memory issues occur while importing the cells, converting the data to sparse format based on the SummarizedExperiment
would not help because I can't get up to that point yet.
A current solution is to import the data in parts, and convert those to sparse matrices.
The rsem table doesn't have integers so it is common to use tximport before running differential gene expression downstream analysis. I am wondering what tximport do in rsem table? what kind of modification?
I am trying to use SARTools (deseq2 method) to get my differential expressed genes but the input count table needs to be an integer so I was thinking of just using round() as I am not sure what tximport modification is doing for the rsem table.
I am working with public data that's been quantified (using kallisto
) on the transcript level. However, I am rather interested in a gene-level analysis and was hoping to aggregate the data to the gene level. I only have the processed transcript-level estimated counts matrix available, and an object to link transcripts to genes. I should also be able to get the transcript lengths.
However, I did not find documentation on how to do the aggregation using tximport
if we can't get our hands on the quant files. Suppose we have the count matrix, tx2gene object and the transcript lengths, is it recommended to try and recreate the list structure created as output from tximport
?
I can't find a way to do the featureCounts tximport. I have this error although I chose countsFromAbundance = "no".
# path <- file.path(targets$sample_label, "abundance.tsv")
# all(file.exists(path))
#
# Tx_1 <- roslin.annotation[, c(8, 24)]
# # colnames(Tx_1)
# names(Tx_1) <- c("target_id", "gene_name")
# Tx_1 <- as_tibble(Tx_1)
gene_counts <- tximport(path,
type = "none",
tx2gene = Tx_1,
txOut = FALSE,
countsFromAbundance = "no",
importer = read_tsv, # did this as I was getting an error > reading in files with read_tsv
1 Error in importer(files[i]) : could not find function "importer"
countsCol = "counts",
lengthCol = "Length",
txIdCol = "target_id",
ignoreTxVersion = TRUE,
ignoreAfterBar = TRUE)
Rows: 45713 Columns: 3
── Column specification ────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): target_id
dbl (2): Length, counts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Error in tximport(path, type = "none", tx2gene = Tx_1, txOut = FALSE, :
argument "abundanceCol" is missing, with no default
Since the DESeqDataSetFromTximport
function is only available in the development version of DESeq2, does it make sense to move the function into tximport until the next DESeq2 stable release?
Hi
I'm working with some Salmon files using mouse RNA-Seq. I have followed the steps of the vignette and it seems to work, until I reach the tximport step when I run the following command:
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, reader = read_tsv)
I get the following error:
Error: all(file.exists(files)) is not TRUE
I am not sure if this is an issue with tximport or with my own data.
My tx2gene looks like this:
TXNAME,GENEID
ENSMUSG00000064372,mt-Tp
ENSMUSG00000064371,mt-Tt
ENSMUSG00000064370,mt-Cytb
ENSMUSG00000064369,mt-Te
and my files looks like this:
sample_1
"/Documents/data/samples/deconv//salmon/sample_1/quant.genes.sf"/Documents/data/samples/deconv//salmon/sample_2/quant.genes.sf"
sample_2
"
sample_3
"~/Documents/data/samples/deconv//salmon/sample_3/quant.genes.sf"
Thanks in advance for pointing out to a solution or a glaring error from my side.
Regards
Desired behavior is to return FPKMs when abundanceCol='FPKM' and throw an error if an invalid value is set - TPMs are returned regardless of its value currently without a warning to the user.
e.g.
tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE, abundanceCol = 'FPKM')
tximport_1.18.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.