russel88 / datest Goto Github PK
View Code? Open in Web Editor NEWCompare different differential abundance and expression methods
License: GNU General Public License v3.0
Compare different differential abundance and expression methods
License: GNU General Public License v3.0
Hi,
Is there a way to run one posthoc analysis that includes both predictor and covar?
I use:
filt.DA.full <- DA.aov(filt, predictor = "Group",covars = "Collection_day",allResults = T)
I tried
DA.TukeyHSD(filt.DA.full, variable = "Collection_day", covar = "Predictor")
but this just ignores the covar.
Best,
Martin
Good afternoon :)
First of all this is a really useful package. Its a bit overwhelming trying to identify which "package" to use for differential abundance analysis and everything I read/people I talk to have different opinions on what is the best.
The list you have is amazing, and the addition of ANCOM and ANCOM-BC (for 16s data) would be really great.
https://github.com/FrederickHuangLin/ANCOMBC
If this is not feasible/has already been excluded for a reason I apologise.
Ben
+Name/description of method
(FROM ABSTRACT IF PAPER) estimates the unknown sampling fractions and corrects the bias induced by their differences among samples. The absolute abundance data are modeled using a linear regression framework. This formulation makes a fundamental advancement in the field because, unlike the existing methods, it (a) provides statistically valid test with appropriate p-values, (b) provides confidence intervals for differential abundance of each taxon, (c) controls the False Discovery Rate (FDR), (d) maintains adequate power, and (e) is computationally simple to implement.
+Link to publication if it exists -
ANCOM-BC = https://www.nature.com/articles/s41467-020-17041-7
ANCOM = https://www.tandfonline.com/doi/full/10.3402/mehd.v26.27663
+Is the method already implemented in R? - yes
+Link to code repository if relevant (GitHub, Bitbucket or similar)
ANCOMBC = https://github.com/FrederickHuangLin/ANCOMBC
ANCOM = https://github.com/FrederickHuangLin/ANCOM
Hi everyone,
yes, this is not really an issue. Just wanted to ask if you have an idea what the runtime a normal analysis could be.
I started a test run (first time I'm trying this tool, so not sure what I'm doing), with admittedly pretty big input data (2.5 mil genes, 8 samples), and currently it's still running (22h).
Is this normal/possible, or should I be worried?
Code and output is currently this, and according to top my R instance is still running:
> x <- read.delim("/exports/mm-hpc/bacteriologie/bastian/absolute_counts_per_gene.csv.only_save_patients_before_after.csv",row.names="contig",sep = "\t",header=TRUE)
> library(DAtest)
DAtest version 2.6.2
> mytest <- testDA(x, c(0,1,0,1,0,1,0,1),cores = 1)
Seed is set to 123
1722860 empty features removed
predictor is assumed to be a continuous/quantitative variable
|
| | 0%Estimating sequencing depths...
Resampling to get new data matrices...
perm= 1
perm= 2
perm= 3
perm= 4
perm= 5
perm= 6
perm= 7
perm= 8
perm= 9
perm= 10
perm= 11
perm= 12
perm= 13
perm= 14
perm= 15
perm= 16
perm= 17
perm= 18
perm= 19
perm= 20
perm= 21
perm= 22
perm= 23
perm= 24
perm= 25
perm= 26
perm= 27
perm= 28
perm= 29
perm= 30
perm= 31
perm= 32
perm= 33
perm= 34
perm= 35
perm= 36
perm= 37
perm= 38
perm= 39
perm= 40
perm= 41
perm= 42
perm= 43
perm= 44
perm= 45
perm= 46
perm= 47
perm= 48
perm= 49
perm= 50
perm= 51
perm= 52
perm= 53
perm= 54
perm= 55
perm= 56
perm= 57
perm= 58
perm= 59
perm= 60
perm= 61
perm= 62
perm= 63
perm= 64
perm= 65
perm= 66
perm= 67
perm= 68
perm= 69
perm= 70
perm= 71
perm= 72
perm= 73
perm= 74
perm= 75
perm= 76
perm= 77
perm= 78
perm= 79
perm= 80
perm= 81
perm= 82
perm= 83
perm= 84
perm= 85
perm= 86
perm= 87
perm= 88
perm= 89
perm= 90
perm= 91
perm= 92
perm= 93
perm= 94
perm= 95
perm= 96
perm= 97
perm= 98
perm= 99
perm= 100
Number of thresholds chosen (all possible thresholds) = 944
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...
Unrelated: Greetings to Martin M., from whose poster I saw the link :).
There is a work-around in this thread, maybe it should be the default for now?
Hi Russel,
Please have a look at the error information I pasted in the below, I can not install the DAtest using the command "install_github("Russel88/DAtest")". Could you please let me know how to fix it? Thanks.
Best regards,
Peng
install_github("Russel88/DAtest")
Downloading GitHub repo Russel88/DAtest@master
from URL https://api.github.com/repos/Russel88/DAtest/zipball/master
Installing DAtest
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
'/private/var/folders/rg/2w1f9r6s6gv6fg1m3zj5xrl00000gn/T/Rtmp1Gw7Ah/devtools91d67aa0169/Russel88-DAtest-fb3a1af'
--library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests
Dear all,
I am analyzing a mice experiment where we followed gut microbiota over time. Mice (expet a control) recieved a DSS stress to induce severe colitis and mices recived 4 different treatments (Fecal Microbial Transplants) (+ some controls) in order to test if it helped to recover.
I would like to test OTU which are influenced by treatment at all the time points. Hence,starting from a phyloseq object, I have started to use the following DAtest workflow:
physeq %>%
preDA(min.samples = 200,
min.reads = 10,
min.abundance = 0) %>%
DA.neb(predictor = "treatment",
paired = "mouse_label",
covars = c("day"),
allResults = TRUE,
relative = FALSE) -> results
Then I am exploring the results using, in order to have the comparaisons for all the different contrasts treatment and time:
results %>%
DA.lsmeans(variable = c("day", "predictor"))
or
emmeans(results[[i]],
pairwise ~ predictor | day,
adjust = "FDR", type = "response")$contrasts %>%
as.data.frame()
However, I am not sure this is correct since according to the formula in DA.neb() the interaction is not taken into account i.e., + instead of *
tryCatch(fit <- lme4::glmer.nb(as.formula(paste("x ~ predictor+offset(log(libSize)) + (1|paired)+",
paste(names(covars), collapse = "+"), sep = "")),
...), error = function(x) fit <- NULL)
Do you know if this is correct? Would it be possible to add * in the formula ?
If you have any suggestion to help solving my question, feel free !
Thanks a ton.
Hi,
I am a semi new user to the DAtest package and I am a phyloseq user. I use the DA.test first and find the method that is the best for my data (llm2) in this case. I ude the groupSig and determine the difference but are now are having some issues with the DA.aov and DA.TukeyHSD package.
I use the code: (BB_GF=phyloseq object)
results <- DA.lao(BB_GF, predictor="Mouse_classification", allResults = TRUE)
$OTU_166
Call:
aov(formula = as.formula(form))
Terms:
predictor Residuals
Sum of Squares 8.187160e-07 1.721421e-05
Deg. of Freedom 1 27
Residual standard error: 0.0007984756
Estimated effects may be unbalanced
#then use the TukeyHSD
res.tukey <- DA.TukeyHSD(results, variable = "predictor")
Error in colnames<-
(*tmp*
, value = rownames(as.data.frame(TukeyHSD(results[[k]], :
attempt to set 'colnames' on an object with less than two dimensions
I hope you can understand my issue even though it is not that well explained by me.
What am I doing wrong? (I know I am definitely doing something wrong). Actually I am feeling a bit stupid!
UpSetDA
propose implementation:
library(ComplexHeatmap)
UpSetDA<-function(x,tests=names(x$results),alpha=0.01,...) {
lt<-lapply(x$results[tests],function(result) result[result$pval.adj<=alpha,'Feature'])
m = ComplexHeatmap::make_comb_mat(lt)
ComplexHeatmap::UpSet(m,...,
top_annotation = upset_top_annotation(m, add_numbers = TRUE),
right_annotation = upset_right_annotation(m, add_numbers = TRUE)
)
}
(More?) Useful wherever vennDA might be used.
Example:
testa <- allDA(df, predictor = vec,relative=FALSE, core.check=FALSE)
compareTest<-c("lrm","lim","aov")
vennDA(testa, tests = compareTest,0.001)
UpSetDA(testa,compareTest,0.001)
Hello,
I want to test the difference between two species at the Phylum level. Based on their relative abundance plots, I do not expect them to be different. When I try to run DA test at the phylum level I get the following error:
Set to spike all features. Can't calculate FPR or AUC. Change k argument
In addition: Warning message:
In testDA(p1, predictor = "Species") : Dataset contains very few features
If I try this at Class or a different taxonomic level, I don't get the error. Should I assume there aren't differences at the Phylum level?
Thank you,
Mia
Hi,
Thank you for the great package.
What does it mean when all of the tests have a score = 0. That there are essentially no features that differ in abundance, so the function cannot calculate a score?
This has happened to me with two different datasets, and particularly since I started using dada2 and ASV methods, where far fewer features are shared among samples.
In one of those instances there was a clear difference in the PCoA plot and accompanying statistics between the two treatments I was looking at. My question here is why does that matter? Wouldn't you be interested in what features are differing between treatments because you see differences in beta diversity? I'm missing why you shouldn't move forward with analyses in that case. You have this listed as your assumptions in the introduction section of the github. It reads:
'Assumptions
The method assumes that most features are NOT associated with the predictor. It is therefore advised to first run an ordination (PCA/PCoA) or similar and if there is clear separation associated with the predictor, DAtest should not be used. In this case, only the False Positive Rate (FPR) can be used.'
What would you do with the FPR in that case? Would you use that to rank which test to use instead of the score?
In this last time I used DAtest, there was no clear separation in treatments in PCoA, but I still received a score of all 0s for the tests.
Thank you for any guidance.
-Carly
Describe the bug
Hello, a collaborator recommended I try the DAtest package for differential abundance analysis on some phyloseq data of mine. I'm still learning how to use the package, so the error is most likely due to a problem with how I've set up my data. When I try to run testDA() on one of my datasets, I get the error "Error in { : task 1 failed - "task 1 failed - "subscript out of bounds." This has happened both when I use the transposed OTU table and a vector of my sample group (i.e. plant tissue type) levels as the predictor, as well as when I use the source phyloseq object and the sample groups from my metadata. Any suggestions on how to resolve this problem are much appreciated!
To Reproduce
Here's the code I tried, and the resulting error: (csv file is attached for reproducibility)
seed.stigma.raw.asv.csv
seed.stigma.raw.asv <- otu_table(seed.stigma.raw.phy) %>% data.frame()
seed.stigma.raw.asv.t <- t(seed.stigma.raw.asv) %>% data.frame()
vec <- c("stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma",
"stigma", "stigma", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "stigma", "stigma",
"stigma", "stigma", "stigma", "stigma", "stigma", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled")
test <- testDA(seed.stigma.raw.asv.t, predictor = vec)
test2 <- testDA(seed.stigma.raw.phy, predictor = "tissue_type")
options(error=recover)
Seed is set to 123
predictor is assumed to be a categorical variable with 2 levels: seed_pooled, stigma
|=================================================================================================| 100%neb, per, adx, vli, qpo, poi, wil, ttt, ltt, ltt2, ds2, ds2x, lim, lli, lli2, lrm, llm, llm2 were excluded due to failure
Error in { : task 1 failed - "task 1 failed - "subscript out of bounds""
In addition: Warning messages:
1: In tests != c("lim", "lli", "lli2", "vli") :
longer object length is not a multiple of shorter object length
2: In tests != c("lim", "lli", "lli2", "vli") :
longer object length is not a multiple of shorter object length
OS and package versions (please complete the following information):
Hello,
I have a question about the min.sample argument in the preDA function. I am fairly new to using DAtest and was wondering if there is a specific amount of minimum samples that the feature is present in that is considered "good"? I was going through the supplementary script attached to your paper [Russel et al. 2018 DAtest: a framework for choosing differential abundance or expression method] and you chose these parameters for the preDA function: preDA(Sub, min.samples = 4, min.reads = 100). Why exactly were those parameters chosen? Just trying to get an understanding so I can apply it to my own data.
Thank you so much for your help!!
Morgan
Hi @Russel88
I am a new R user. After performing a statistical significance test, I would like to show the p-value of the microbiota abundant graphs on top of the compared groups or near to the figure legends. Could you please let me know how can I do this?
Thanks,
Saif
Hi,
I saw that it is recommended we provide either (1) raw counts & compositional data, or (2) externally normalized or absolute abundance data.
Perhaps I have been using the terminology differently this entire time but my understanding is that raw counts is equivalent to the absolute abundances. Particularly, for microbiome data, where each sample represents a snapshot of the entire microbiome, actual "absolute abundance" is technically impossible. I have come to understand that the raw counts were considered the absolute abundances of the sample, and that relative abundance data represents proportion-normalized abundances.
Could I please get clarification on this? I'd like to make sure I am using the package correctly. Thanks a bunch for all your efforts on the convenience of this package!
Trying to run t <- allDA(sub,predictor=conds)
and it seems to get to 100% completed but I get an error at the last moment of Error in tax_table(data, errorIfNULL = FALSE) : could not find function "tax_table"
By the way this is fantastic work thanks so much for this
Hi! Not an issue at all. I was just hoping to start a dialog on best practices. If there is a more appropriate venue for this discussion (e.g., a google group) please let me know.
In brief I am working with 16S rRNA data--5 host species, 10 individuals per species. I am interested in identifying differentially abundant OTUs across host species. I previously used LEfSe but am interested in exploring DAtest. I processed the data using dada2 and now have it in a phyloseq object.
I would really appreciate feedback on whether I am on the right track and how best to identify OTUs that are differentially abundant by host species.
Here is my DAtest workflow.
first I pre-process the data to reduce the number of features tested
data.new <- preDA(gpt_gg_no_cyano, min.samples = 5, min.reads = 100, min.abundance = 0)
Then I use host species as the predictor to run the test.
mytest <- testDA(data.new, predictor = "Sp", effectSize = 20, out.all = TRUE)
Results:
summary(mytest)
-Method AUC FPR Spike.detect.rate Rank
-DESeq2 man. geoMeans (ds2) 0.768 0.005 0.017 1.25
-Log Linear reg. 2 (llm2) 0.690 0.019 0.025 1.50
-DESeq2 (ds2x) 0.736 0.010 0.017 1.75
-ANOVA (aov) 0.732 0.023 0.017 2.25
-Log LIMMA 2 (lli2) 0.674 0.024 0.017 3.25
-Log ANOVA 2 (lao2) 0.663 0.027 0.008 5.00
-Log LIMMA (lli) 0.574 0.033 0.000 6.75
-Log ANOVA (lao) 0.569 0.034 0.000 7.50
-Log Linear reg. (llm) 0.569 0.034 0.000 7.50
-Kruskal-Wallis (kru) 0.500 0.021 0.000 8.25
The rest were NA
llm2
and aov
and compare all 5 methods:res.allData <- allDA(data.new, predictor = "Sp", paired = NULL, covars = NULL, tests = c("ds2", "llm2", "aov", "lli2", "ds2x"), relative = TRUE, cores = (detectCores() - 1), rng.seed = 123, p.adj = "fdr", args = list(), out.all = NULL, alpha = 0.1, core.check = TRUE)
I end up with a table of p-values by feature. My thinking here is to select OTUs that are significant for several (>3?) of the 5 methods.
the other issue I am having is which host species the OTUs are enriched in. My understanding is that ds2
and ds2x
only do one vs. all comparisons? Is this correct?
Thanks!
Please provide the following:
Hello
I am trying to analyse RNA-seq data with testDA function but i get this return :
606 empty features removed
Running on 3 cores
predictor is assumed to be a categorical variable with 6 levels: ct1, ct2, ct3, rst1, rst2, rst3
Spikeing...
Error in rowSums(count_table[, predictor == 1]) :
'x' must be an array of at least two dimensions
Hi
Is it possible to do abundance comparisons between groups on rank (i.e. family) level instead of specific OTU.
Thanks, Ditte
Hi Jakob,
Good afternoon. I have one issue for running the aldex method, could you please let me know how to figure out it? please have a look at the error information:
results <- DA.adx (TS, predictor = "Genotype")
[1] "aldex.clr: generating Monte-Carlo instances and clr values"
[1] "operating in serial mode"
[1] "computing center with all features"
Error in validObject(.Object) :
invalid class “aldex.clr” object: invalid object for slot "conds" in class "aldex.clr": got class "factor", should be or extend class "character"
Best regards,
Peng
Hi, I am having trouble installing this on a Mac (Mojave) using Rstudio 1.2.1335 and R v 3.6. I think it has something to do with compilation errors.
I either run into "library not found for -lgfortran" for either minqa or RcppEigen:
I am able to solve this by adding a makevars file that points to a homebrew installation of gcc
However when I do the makevars workaround, I encounter a cross-compilation error for nloptr
I hope you can advise me on this
Hi,
I tried to install the package for different methods. But I got errors showing below.
package ‘testthat’ successfully unpacked and MD5 sums checked Error: Failed to install 'DAtest' from GitHub: (converted from warning) cannot remove the prior installation of package ‘testthat’
I also tried to remove the package 'testthat', but I don't have it.
remove.packages("testthat") Removing package from ‘C:/Users/lqwnc/Documents/R/win-library/4.0’ (as ‘lib’ is unspecified)
Wondering do you have other suggestions? Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.