russel88 / datest Goto Github PK

Compare different differential abundance and expression methods

License: GNU General Public License v3.0

R 100.00%

microbiome rnaseq relative-abundances compare-methods differential-abundance-methods otu marker-gene-analysis 16s proteomics metabolomics

datest's Introduction

DAtest

DAtest is a package for comparing different differential abundance methods used in microbial marker-gene (e.g. 16S rRNA), RNA-seq and protein/metabolite abundance analysis.

There are many methods for testing differential abundance and no gold standard, but results can vary a lot between the different statistical methods. This package aims at aiding the analyst in choosing a method for a specific dataset based on empirical testing.

Check the wiki for details on installation and usage of this package

datest's People

Contributors

Stargazers

Watchers

Forkers

askerdb msmortensen bsalehe ssmufer liguangshuo crlero khemlalnirmalkar marcelladane

datest's Issues

Preform comparisons on rank (i.e. family) level instead of OTU

Is it possible to do abundance comparisons between groups on rank (i.e. family) level instead of specific OTU.

Thanks, Ditte

Addition of ANCOM and ANCOM-BC

Good afternoon :)

First of all this is a really useful package. Its a bit overwhelming trying to identify which "package" to use for differential abundance analysis and everything I read/people I talk to have different opinions on what is the best.

The list you have is amazing, and the addition of ANCOM and ANCOM-BC (for 16s data) would be really great.
https://github.com/FrederickHuangLin/ANCOMBC

If this is not feasible/has already been excluded for a reason I apologise.

Ben

+Name/description of method
(FROM ABSTRACT IF PAPER) estimates the unknown sampling fractions and corrects the bias induced by their differences among samples. The absolute abundance data are modeled using a linear regression framework. This formulation makes a fundamental advancement in the field because, unlike the existing methods, it (a) provides statistically valid test with appropriate p-values, (b) provides confidence intervals for differential abundance of each taxon, (c) controls the False Discovery Rate (FDR), (d) maintains adequate power, and (e) is computationally simple to implement.

+Link to publication if it exists -
ANCOM-BC = https://www.nature.com/articles/s41467-020-17041-7
ANCOM = https://www.tandfonline.com/doi/full/10.3402/mehd.v26.27663

+Is the method already implemented in R? - yes

+Link to code repository if relevant (GitHub, Bitbucket or similar)
ANCOMBC = https://github.com/FrederickHuangLin/ANCOMBC
ANCOM = https://github.com/FrederickHuangLin/ANCOM

showing p-value on the graphs

Hi @Russel88
I am a new R user. After performing a statistical significance test, I would like to show the p-value of the microbiota abundant graphs on top of the compared groups or near to the figure legends. Could you please let me know how can I do this?
Thanks,
Saif

Clarification on terminology of "raw counts" vs. "absolute abundance"

Hi,

I saw that it is recommended we provide either (1) raw counts & compositional data, or (2) externally normalized or absolute abundance data.

Perhaps I have been using the terminology differently this entire time but my understanding is that raw counts is equivalent to the absolute abundances. Particularly, for microbiome data, where each sample represents a snapshot of the entire microbiome, actual "absolute abundance" is technically impossible. I have come to understand that the raw counts were considered the absolute abundances of the sample, and that relative abundance data represents proportion-normalized abundances.

Could I please get clarification on this? I'd like to make sure I am using the package correctly. Thanks a bunch for all your efforts on the convenience of this package!

DA.aov/DA.lao and DA.TukeyHSD

Hi,
I am a semi new user to the DAtest package and I am a phyloseq user. I use the DA.test first and find the method that is the best for my data (llm2) in this case. I ude the groupSig and determine the difference but are now are having some issues with the DA.aov and DA.TukeyHSD package.

I use the code: (BB_GF=phyloseq object)
results <- DA.lao(BB_GF, predictor="Mouse_classification", allResults = TRUE)

output for one OTU##

$OTU_166
Call:
aov(formula = as.formula(form))

Terms:
predictor Residuals
Sum of Squares 8.187160e-07 1.721421e-05
Deg. of Freedom 1 27

Residual standard error: 0.0007984756
Estimated effects may be unbalanced

#then use the TukeyHSD
res.tukey <- DA.TukeyHSD(results, variable = "predictor")
Error in colnames<-(*tmp*, value = rownames(as.data.frame(TukeyHSD(results[[k]], :
attempt to set 'colnames' on an object with less than two dimensions

I hope you can understand my issue even though it is not that well explained by me.
What am I doing wrong? (I know I am definitely doing something wrong). Actually I am feeling a bit stupid!

Adding interaction terms in DA.neb()

Dear all,

I am analyzing a mice experiment where we followed gut microbiota over time. Mice (expet a control) recieved a DSS stress to induce severe colitis and mices recived 4 different treatments (Fecal Microbial Transplants) (+ some controls) in order to test if it helped to recover.

I would like to test OTU which are influenced by treatment at all the time points. Hence,starting from a phyloseq object, I have started to use the following DAtest workflow:


physeq %>%
        preDA(min.samples = 200,
                  min.reads = 10, 
                  min.abundance = 0) %>%
        DA.neb(predictor = "treatment", 
                  paired = "mouse_label",  
                  covars = c("day"),
                  allResults = TRUE,
                  relative = FALSE) -> results

Then I am exploring the results using, in order to have the comparaisons for all the different contrasts treatment and time:

results %>%
  DA.lsmeans(variable = c("day", "predictor"))

emmeans(results[[i]], 
        pairwise ~ predictor | day,
          adjust = "FDR", type = "response")$contrasts %>% 
  as.data.frame()

However, I am not sure this is correct since according to the formula in DA.neb() the interaction is not taken into account i.e., + instead of *


tryCatch(fit <- lme4::glmer.nb(as.formula(paste("x ~ predictor+offset(log(libSize)) + (1|paired)+", 
                                                        paste(names(covars), collapse = "+"), sep = "")), 
                                       ...), error = function(x) fit <- NULL)

Do you know if this is correct? Would it be possible to add * in the formula ?
If you have any suggestion to help solving my question, feel free !

Thanks a ton.

Score = 0 for all tests and elaboration on running PCoA prior to DAtest

Hi,

Thank you for the great package.

What does it mean when all of the tests have a score = 0. That there are essentially no features that differ in abundance, so the function cannot calculate a score?

This has happened to me with two different datasets, and particularly since I started using dada2 and ASV methods, where far fewer features are shared among samples.

In one of those instances there was a clear difference in the PCoA plot and accompanying statistics between the two treatments I was looking at. My question here is why does that matter? Wouldn't you be interested in what features are differing between treatments because you see differences in beta diversity? I'm missing why you shouldn't move forward with analyses in that case. You have this listed as your assumptions in the introduction section of the github. It reads:

'Assumptions
The method assumes that most features are NOT associated with the predictor. It is therefore advised to first run an ordination (PCA/PCoA) or similar and if there is clear separation associated with the predictor, DAtest should not be used. In this case, only the False Positive Rate (FPR) can be used.'

What would you do with the FPR in that case? Would you use that to rank which test to use instead of the score?

In this last time I used DAtest, there was no clear separation in treatments in PCoA, but I still received a score of all 0s for the tests.

Thank you for any guidance.
-Carly

Any ideas about the runtime?

Hi everyone,

yes, this is not really an issue. Just wanted to ask if you have an idea what the runtime a normal analysis could be.
I started a test run (first time I'm trying this tool, so not sure what I'm doing), with admittedly pretty big input data (2.5 mil genes, 8 samples), and currently it's still running (22h).
Is this normal/possible, or should I be worried?

Code and output is currently this, and according to top my R instance is still running:

> x <- read.delim("/exports/mm-hpc/bacteriologie/bastian/absolute_counts_per_gene.csv.only_save_patients_before_after.csv",row.names="contig",sep = "\t",header=TRUE)
> library(DAtest)
DAtest version 2.6.2
> mytest <- testDA(x, c(0,1,0,1,0,1,0,1),cores = 1)
Seed is set to 123
1722860 empty features removed
predictor is assumed to be a continuous/quantitative variable

  |                                                                            
  |                                                                      |   0%Estimating sequencing depths...
Resampling to get new data matrices...
perm= 1
perm= 2
perm= 3
perm= 4
perm= 5
perm= 6
perm= 7
perm= 8
perm= 9
perm= 10
perm= 11
perm= 12
perm= 13
perm= 14
perm= 15
perm= 16
perm= 17
perm= 18
perm= 19
perm= 20
perm= 21
perm= 22
perm= 23
perm= 24
perm= 25
perm= 26
perm= 27
perm= 28
perm= 29
perm= 30
perm= 31
perm= 32
perm= 33
perm= 34
perm= 35
perm= 36
perm= 37
perm= 38
perm= 39
perm= 40
perm= 41
perm= 42
perm= 43
perm= 44
perm= 45
perm= 46
perm= 47
perm= 48
perm= 49
perm= 50
perm= 51
perm= 52
perm= 53
perm= 54
perm= 55
perm= 56
perm= 57
perm= 58
perm= 59
perm= 60
perm= 61
perm= 62
perm= 63
perm= 64
perm= 65
perm= 66
perm= 67
perm= 68
perm= 69
perm= 70
perm= 71
perm= 72
perm= 73
perm= 74
perm= 75
perm= 76
perm= 77
perm= 78
perm= 79
perm= 80
perm= 81
perm= 82
perm= 83
perm= 84
perm= 85
perm= 86
perm= 87
perm= 88
perm= 89
perm= 90
perm= 91
perm= 92
perm= 93
perm= 94
perm= 95
perm= 96
perm= 97
perm= 98
perm= 99
perm= 100
Number of thresholds chosen (all possible thresholds) = 944
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...

Unrelated: Greetings to Martin M., from whose poster I saw the link :).

min.sample argument in preDA function

Hello,

I have a question about the min.sample argument in the preDA function. I am fairly new to using DAtest and was wondering if there is a specific amount of minimum samples that the feature is present in that is considered "good"? I was going through the supplementary script attached to your paper [Russel et al. 2018 DAtest: a framework for choosing differential abundance or expression method] and you chose these parameters for the preDA function: preDA(Sub, min.samples = 4, min.reads = 100). Why exactly were those parameters chosen? Just trying to get an understanding so I can apply it to my own data.

Thank you so much for your help!!
Morgan

Implement UpsetDA

Name/description of method UpSetDA
Link to publication if it exists https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html
Is the method already implemented in R? Yes

propose implementation:

library(ComplexHeatmap)
UpSetDA<-function(x,tests=names(x$results),alpha=0.01,...) {
  lt<-lapply(x$results[tests],function(result) result[result$pval.adj<=alpha,'Feature'])
  m = ComplexHeatmap::make_comb_mat(lt)
  ComplexHeatmap::UpSet(m,...,
                        top_annotation = upset_top_annotation(m, add_numbers = TRUE),
                        right_annotation = upset_right_annotation(m, add_numbers = TRUE)
                        )
}

(More?) Useful wherever vennDA might be used.
Example:

testa <- allDA(df, predictor = vec,relative=FALSE,       core.check=FALSE)
compareTest<-c("lrm","lim","aov")
vennDA(testa, tests = compareTest,0.001)
UpSetDA(testa,compareTest,0.001)

Best Practices for >2 treatments

Hi! Not an issue at all. I was just hoping to start a dialog on best practices. If there is a more appropriate venue for this discussion (e.g., a google group) please let me know.

In brief I am working with 16S rRNA data--5 host species, 10 individuals per species. I am interested in identifying differentially abundant OTUs across host species. I previously used LEfSe but am interested in exploring DAtest. I processed the data using dada2 and now have it in a phyloseq object.

I would really appreciate feedback on whether I am on the right track and how best to identify OTUs that are differentially abundant by host species.

Here is my DAtest workflow.

first I pre-process the data to reduce the number of features tested
data.new <- preDA(gpt_gg_no_cyano, min.samples = 5, min.reads = 100, min.abundance = 0)
Then I use host species as the predictor to run the test.
mytest <- testDA(data.new, predictor = "Sp", effectSize = 20, out.all = TRUE)
Results:
summary(mytest)

-Method AUC FPR Spike.detect.rate Rank
-DESeq2 man. geoMeans (ds2) 0.768 0.005 0.017 1.25
-Log Linear reg. 2 (llm2) 0.690 0.019 0.025 1.50
-DESeq2 (ds2x) 0.736 0.010 0.017 1.75
-ANOVA (aov) 0.732 0.023 0.017 2.25
-Log LIMMA 2 (lli2) 0.674 0.024 0.017 3.25
-Log ANOVA 2 (lao2) 0.663 0.027 0.008 5.00
-Log LIMMA (lli) 0.574 0.033 0.000 6.75
-Log ANOVA (lao) 0.569 0.034 0.000 7.50
-Log Linear reg. (llm) 0.569 0.034 0.000 7.50
-Kruskal-Wallis (kru) 0.500 0.021 0.000 8.25

The rest were NA

I focus here on the top 5 methods. I then run posthoc analysis on llm2 and aov and compare all 5 methods:

res.allData <- allDA(data.new, predictor = "Sp", paired = NULL, covars = NULL, tests = c("ds2", "llm2", "aov", "lli2", "ds2x"), relative = TRUE, cores = (detectCores() - 1), rng.seed = 123, p.adj = "fdr", args = list(), out.all = NULL, alpha = 0.1, core.check = TRUE)

I end up with a table of p-values by feature. My thinking here is to select OTUs that are significant for several (>3?) of the 5 methods.
the other issue I am having is which host species the OTUs are enriched in. My understanding is that ds2 and ds2x only do one vs. all comparisons? Is this correct?

Thanks!

allDA(): Undeclared reference to function

Trying to run t <- allDA(sub,predictor=conds) and it seems to get to 100% completed but I get an error at the last moment of Error in tax_table(data, errorIfNULL = FALSE) : could not find function "tax_table"

By the way this is fantastic work thanks so much for this

Phylum level - warning

Hello,

I want to test the difference between two species at the Phylum level. Based on their relative abundance plots, I do not expect them to be different. When I try to run DA test at the phylum level I get the following error:

Set to spike all features. Can't calculate FPR or AUC. Change k argument
In addition: Warning message:
In testDA(p1, predictor = "Species") : Dataset contains very few features

If I try this at Class or a different taxonomic level, I don't get the error. Should I assume there aren't differences at the Phylum level?

Thank you,
Mia

Posthoc for aov with predictor and covar?

Hi,

Is there a way to run one posthoc analysis that includes both predictor and covar?
I use:
filt.DA.full <- DA.aov(filt, predictor = "Group",covars = "Collection_day",allResults = T)

I tried
DA.TukeyHSD(filt.DA.full, variable = "Collection_day", covar = "Predictor")
but this just ignores the covar.

Best,
Martin

Parallelization does not work in R 4.0 + Rstudio

There is a work-around in this thread, maybe it should be the default for now?

rstudio/rstudio#6692

Fail to install the DAtest

Hi Russel,
Please have a look at the error information I pasted in the below, I can not install the DAtest using the command "install_github("Russel88/DAtest")". Could you please let me know how to fix it? Thanks.
Best regards,
Peng

install_github("Russel88/DAtest")
Downloading GitHub repo Russel88/DAtest@master
from URL https://api.github.com/repos/Russel88/DAtest/zipball/master
Installing DAtest
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL
'/private/var/folders/rg/2w1f9r6s6gv6fg1m3zj5xrl00000gn/T/Rtmp1Gw7Ah/devtools91d67aa0169/Russel88-DAtest-fb3a1af'
--library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests

installing source package ‘DAtest’ ...
** R
** preparing package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called ‘pscl’
ERROR: lazy loading failed for package ‘DAtest’
removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/DAtest’
Installation failed: Command failed (1)

Getting error "subscript out of bounds" when trying to run testDA()

Describe the bug
Hello, a collaborator recommended I try the DAtest package for differential abundance analysis on some phyloseq data of mine. I'm still learning how to use the package, so the error is most likely due to a problem with how I've set up my data. When I try to run testDA() on one of my datasets, I get the error "Error in { : task 1 failed - "task 1 failed - "subscript out of bounds." This has happened both when I use the transposed OTU table and a vector of my sample group (i.e. plant tissue type) levels as the predictor, as well as when I use the source phyloseq object and the sample groups from my metadata. Any suggestions on how to resolve this problem are much appreciated!

To Reproduce
Here's the code I tried, and the resulting error: (csv file is attached for reproducibility)
seed.stigma.raw.asv.csv

seed.stigma.raw.asv <- otu_table(seed.stigma.raw.phy) %>% data.frame()
seed.stigma.raw.asv.t <- t(seed.stigma.raw.asv) %>% data.frame()
vec <- c("stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "stigma", "stigma", "stigma", "stigma", "stigma", "stigma",
"stigma", "stigma", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "seed_pooled", "stigma", "stigma",
"stigma", "stigma", "stigma", "stigma", "stigma", "seed_pooled", "seed_pooled", "seed_pooled",
"seed_pooled", "seed_pooled", "seed_pooled")
test <- testDA(seed.stigma.raw.asv.t, predictor = vec)
test2 <- testDA(seed.stigma.raw.phy, predictor = "tissue_type")
options(error=recover)

Seed is set to 123
predictor is assumed to be a categorical variable with 2 levels: seed_pooled, stigma
|=================================================================================================| 100%neb, per, adx, vli, qpo, poi, wil, ttt, ltt, ltt2, ds2, ds2x, lim, lli, lli2, lrm, llm, llm2 were excluded due to failure
Error in { : task 1 failed - "task 1 failed - "subscript out of bounds""
In addition: Warning messages:
1: In tests != c("lim", "lli", "lli2", "vli") :
longer object length is not a multiple of shorter object length
2: In tests != c("lim", "lli", "lli2", "vli") :
longer object length is not a multiple of shorter object length

OS and package versions (please complete the following information):

OS: MacOS Ventura 13.3.1
DAtest version: 2.7.5
Versions of other packages related to the bug: DESeq2_1.40.2

adx was excluded due to failure

Hi Jakob,

Good afternoon. I have one issue for running the aldex method, could you please let me know how to figure out it? please have a look at the error information:

results <- DA.adx (TS, predictor = "Genotype")
[1] "aldex.clr: generating Monte-Carlo instances and clr values"
[1] "operating in serial mode"
[1] "computing center with all features"
Error in validObject(.Object) :
invalid class “aldex.clr” object: invalid object for slot "conds" in class "aldex.clr": got class "factor", should be or extend class "character"

Best regards,
Peng

Error in rowSums

Hello

I am trying to analyse RNA-seq data with testDA function but i get this return :

606 empty features removed
Running on 3 cores
predictor is assumed to be a categorical variable with 6 levels: ct1, ct2, ct3, rst1, rst2, rst3
Spikeing...
Error in rowSums(count_table[, predictor == 1]) :
'x' must be an array of at least two dimensions

Add corncob

Please provide the following:

Name/description of method
Link to publication if it exists
Is the method already implemented in R?
Link to code repository if relevant (GitHub, Bitbucket or similar)

Cannot install

Hi，
I tried to install the package for different methods. But I got errors showing below.
package ‘testthat’ successfully unpacked and MD5 sums checked Error: Failed to install 'DAtest' from GitHub: (converted from warning) cannot remove the prior installation of package ‘testthat’
I also tried to remove the package 'testthat', but I don't have it.
remove.packages("testthat") Removing package from ‘C:/Users/lqwnc/Documents/R/win-library/4.0’ (as ‘lib’ is unspecified)

Wondering do you have other suggestions? Thank you very much!

Errors in Mac Installation

Hi, I am having trouble installing this on a Mac (Mojave) using Rstudio 1.2.1335 and R v 3.6. I think it has something to do with compilation errors.

I either run into "library not found for -lgfortran" for either minqa or RcppEigen:
I am able to solve this by adding a makevars file that points to a homebrew installation of gcc

However when I do the makevars workaround, I encounter a cross-compilation error for nloptr

I hope you can advise me on this