fgcz / prolfqua Goto Github PK

Differential Expression Analysis tool box R lang package for omics data

Home Page: https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441

License: MIT License

R 94.03% Dockerfile 0.32% TeX 5.63% Shell 0.02%

proteomics-data-analysis rstats-package hypothesis-testing quality-control r-package differential-expression-analysis protein-quantification

prolfqua's People

Contributors

Stargazers

Watchers

Forkers

fabianegli hechth cpanse animesh aliyoussef96 timothyez

prolfqua's Issues

add two columns with both sides of the contrast

Idea is to create MA plots.

Protein Boxplot/ supervised heatmap

add moderated p.values

Based on the discussion below, I understand that you have fitted the same linear model to a few hundred separate datasets (with each subject being a separate dataset). To apply limma empirical Bayes moderation you need to have three vectors storing information from the separate fits:

A vector sigma containing the residual standard errors from the separate fits. This vector is not obtainable from your data.frame. The residual standard error is the square root of the residual mean square and is shown in the output from summary(fit) for each linear model. If you save the summary to out <- summary(fit) then the residual standard error is stored as out$sigma.

A vector df.residual containing the residual degrees of freedom for each regression, which will not all be the same some of the fits have missing values. This will be the same as df in your data.frame.

A vector tstat containing the unmoderated t-statistics from the separate linear models. This will be the same as t-value in your data.frame.

Then you can compute moderated t-statistics and p-values by

sv <- squeezeVar(sigma^2, df=df.residual)
moderated.t <- tstat * sigma / sqrt(sv$var.post)
df.total <- df.residual + sv$df.prior
p.value <- 2*pt( abs(moderated.t), df=df.total, lower.tail=FALSE)

color densities using factor variable

implement the option to color lines in intensity_distribution_density using the factor variable from the lfqdata$config

Unify naming conventions

avoid switching between Camel, Pascal, snake case and mixtures thereof.

Citation points to bioRxiv

I just wanted to let you know that using citation("prolfqua") in R still points to the bioRxiv article and not the published article in J. Proteome Res. 2023, 22, 4, 1092–1104. The bioRxiv article also seems to still lack the information that the article is now published in J. Proteome Res. 2023, 22, 4, 1092–1104.

Remove plot_FDPvsTPR from Benchmark

since we added plot_precision_recall we can remove plot_FDPvsTPR, which is quite similar but worse.

Improve handling of rank deficient models

x$linfct_A
                               (Intercept) celltype_lncap celltype_pc3 celltype_pnt1
celltype_pc3 - celltype_pnt1             0              0            1            -1
celltype_lncap - celltype_pnt1           0              1            0            -1

 lme4::fixef(resXX$res_modelling$modellingResult$modelProtein$linear_model[[408]])
 (Intercept) celltype_pc3 
   -4.262731     3.987698

debug408.txt

currently the function my_contest returns NA for rank defficient models. Should it not be possible to somehow estimate a contrast for the conditions where measurements were observed?

Export analysis results to ExpressionSet and ExpressionSetViewer

ExpressionSetViewer
https://github.com/mengchen18/ExpressionSetViewer
https://www.bioconductor.org/packages/devel/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf

remove Contrasts$get_linfct() from Contrast Interface

The idea would be to move it to the model class, because some of the Contrast classes do not use linear functions.

Error with setup_analysis()

Hi,

I would like to try the vignette. The first steps go well until this one:
adata <- setup_analysis(startdata, config)

I get the following error:
"creating sampleName
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], :
missing value where TRUE/FALSE needed"

It seems that my data contains missing values but I don't know how to handle them.
I tried :
data <- read.table("path_to/proteinGroups.txt", sep="\t",na.strings = c(""), header=T)
data <- data %>% drop_na(c("Protein.IDs", "Gene.names", "Intensity"))
startdata <- prolfqua::tidyMQ_ProteinGroups(data)
...steps up to:
adata <- setup_analysis(startdata, config)

but I still get the same error.

Here is the link to my protein groups file: https://www.swisstransfer.com/d/34536145-ba5e-4994-8364-fa2a5aca223a

mixedModelQuestion

Reduce package size

remove test data and replace with simulated data
move SAINT binaries to a separate package, which can depend on prolfqua.

Add unit test for Contrasts parser

Error at normalisation step

Hi,

I'm running through the vignette (here: https://fgcz.github.io/prolfqua/articles/Comparing2Groups.html)

I get to here:

lfqdata$to_wide()$data[1:3,1:7]

and I get the error:

Error in `tidyr::spread()`:
! Each row of output must be identified by a unique combination of keys.
ℹ Keys are shared for 6767 rows

This is also the same error I get when try running the normalisation (robscale)

lt <- lfqdata$get_Transformer()
transformed <- lt$log2()$robscale()$lfq

I then tried to normalise via that described here: https://fgcz.github.io/prolfqua/reference/LFQDataTransformer.html#examples

lfqTrans <-lfqdata$get_Transformer()

x <- lfqTrans$intensity_array(log2)

x$lfq$config$table$is_response_transformed

x <- x$intensity_matrix(robust_scale)

plotter <- x$lfq$get_Plotter()
plotter$intensity_distribution_density()

However, I get the response:

Warning message:
In x$intensity_matrix(robust_scale) :
  data already transformed. If you still want to log2 tranform, set force = TRUE

And checking the two steps have "run", the robust_scale request has made no changes to the data, there are no changes to the log2 transformation column or an additional robust_ scale column in the tibble.

The warning response above suggests that I should run with force=T, but states that this is to allow another log2 transformation, which is not required as the intensities have already been log transformed.

Running with force=T, does add another column with robust_scale appended to the column name. But has this been log transformed again?

Any help greatly appreciated.

missing presence show using venn diagram

Possible code

sum$missingness_per_condition()
tr <- 2
nrMiss <- tmp$data %>% pivot_wider(id_cols = "protein_Id", names_from = "bait", values_from = nrMeasured)
nrMiss[,2:4][nrMiss[,2:4] < tr] <- 0
nrMiss[,2:4][nrMiss[,2:4] >= tr] <- 1

upset(as.data.frame(nrMiss), order.by = "freq")

tidyMQ_PeptideProtein examples fix required

 +#' zip_archive <- system.file("samples/maxquant_txt/twoGroup3Reps.zip", package="LFQService")
 +#' # res <- tidyMQ_PeptideProtein(zip_archive)

Is not working

add heatmap with only sample wise clustering

Something between the current heatmap and the raster plot in LFQData.
When clustering row-wise, we must remove proteins with few observations (otherwise, distance computation fails).

covr - Test Coverage for prolfqua

Is your feature request related to a problem? Please describe.
investigate on how to apply covr to prolfqua examples

Describe the solution you'd like
https://cran.r-project.org/package=covr

FragPipe v.19.1 combined_protein output not working with prolfqua::tidy_FragPipe_combined_protein

Hello and thank you for the open access package. Quick question about a problem I'm running into.

Describe the bug
FragPipe v.19.1 combined_protein.tsv output gives the error when prolfqua::tidy_FragPipe_combined_protein is called:

Warning: argument is not being debuggedError in 1:which(cnam == "Combined Total Spectral Count") : 
  argument of length 0

To Reproduce

library(tibble)
library(dplyr)
library(prolfqua)

datadir<- file.path(find.package("prolfquadata"),"quantdata")

#read input annotation
annotation <- readxl::read_xlsx("myannotationfile.xlsx")
#read input protein
protein <- tibble::as_tibble(read.csv("~combined_protein.tsv", header=TRUE, sep="\t", stringsAsFactors = FALSE))

undebug(prolfqua::tidy_FragPipe_combined_protein)
protein <- prolfqua::tidy_FragPipe_combined_protein(protein)
protein <- protein |> dplyr::filter(unique.stripped.peptides > 1)
merged <- dplyr::inner_join(annotation, protein)

Additional context
FragPipe v19.1 only includes one type of Intensity per experimental replicate and LFQ; could this be the reason for the error? I am including my combined TSV file for your reference.

Thanks again for any help.
combined_protein.tsv.txt

Broken links to Article source files

The article pages under https://fgcz.github.io/prolfqua/articles/ contain a link to their source below the title and these links seem to be broken.

e.g. on https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html the link [Source: ../vignettes/CreatingConfigurations.Rmd points to https://github.com/wolski/prolfqua/blob/vignettes/CreatingConfigurations.Rmd which gives a 404 error.

Issue with setup_analysis in comparing2groups vignette

Hello,

I am attempting to follow the script outlined in the Comparing Two Groups with prolfqua vignette, however I keep receiving the error "Error in basename(data[[table$fileName]]) : a character vector argument expected" when running setup_analysis to create sampleName from fileName column. Do you have any advice on what the issue may be? Our data from MaxQuant should be in the same format as the one used in the example.

Thank you,

Juliana

Add function to summarize significant fold changes to ContrastPlotter

Integrate this function into the ContrastPlotter class.
upSet() - shows up-set plot
barplot() - showing the number of significant proteins per contrast.

Glm support

Add glm binomial model for missingness analysis.

Spellcheck text in vignettes

Vignette text needs spellchecking and grammar correction

Example run scripts for MSFragger and MaxQuant.

Example run script.R where LFQService processes MQanalysis and MSStats file output.

review ContrastsMissing get_contrasts

get_contrasts
** get_imputed_contrasts
*** missigness_impute_factors_interactions(pepIntensity, config, value = "long" )
**** .missigness_impute_interactions
***** interaction_missing_stats
*** get_contrast(ungroup(imp), config$table$hierarchy_keys(), Contr)
*** aggregate_contrast(ungroup(imputed2), subject_Id = config$table$hierarchy_keys_depth())
** summarize_stats

Sample Rscript to process fragpipe lfqMBR outputfile

Greetings,

That you for providing this opensource package. By any chance would you be able to direct me to the location where the sample RScript is to process FragPipe LFQMBR output file?

Regards,
Ben

Improvements to MQSummary2.Rmd

redundante Spalte in Talbe 2 quantile probs.
cs y axis label

add unit/class/function tests

Simple Impute - if mean below 10% - diff to be estimated at 0

Hei Witold,

As discussed today it would be a meaningful extenstion to the simple Impute function if the log2FC would be estimated at 0 if the mean is smaller than probs (default at 0.1).

Now the log2FC is estimated in the wrong direction!

gruess - jonas
p26828_BlandAltmann_feedOneVSfeedTwo_withImputation.pdf

Rename ContrastsSimpleImpute to ContrastsMissing.

To avoid confusion, since we do not do imputation but are estimating the contrasts in the presence of missing data, we rename this contrast class.

Have a look at msnbase robustSummary

check if usefull and if it performs better than medpolish

DiaNN read report function

Hi Thanks for developing an excellent framework for LFQ. I just wanted to highlight something that I've found.

The function "diann_read_output" does not result in the required data frame per the parameters supplied.

There are two separate issues:

Supplying the nrpeptides parameter does not filter the data frame the during import function, i.e. setting nrPeptide=2 has no effect.

In the section of the R code:

filter_PG <- function(PG, nrPeptides = 2, Q.Value = 0.01){
    PG <- PG |> dplyr::filter(nrPeptides >= nrPeptides)
    PG <- PG |> dplyr::filter(.data$Lib.PG.Q.Value < Q.Value)
    PG <- PG |> dplyr::filter(.data$PG.Q.Value < Q.Value)
  }

The variable nrPeptides set in the function call is being overridden by the local nrPeptide from the data frame and appears to be checking against itself. Would it be possible to change the parameter name to nrPeptides_min or something other than nrPeptides?

The filtering by Q.value is performed after the nrPeptides column is calculated, so the resulting data frame will contain proteins quantified by a single peptide.

The above section of code ideally needs to be separated, whereby the peptides are filtered by Q.value before the nrPeptides column is calculated. The data frame can be filtered by the minimum peptide threshold afterwards.

I hope this makes sense.

Best,

Craig

Add FDR controlled sample size estimation

https://journal.r-project.org/archive/2009/RJ-2009-019/RJ-2009-019.pdf

Generate normalized matrices in addition.

More descriptive axis labels for the MA plot

The MA plot axis labels are quite cryptic for unexperienced users.

LICENSE

why not GPLv3?

Add Precision Recall plot to Benchmark class

As discussed in the article https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
the precision-recall plot can be more informative than the ROC curve. Therefore we will be adding this plot to the Benchmark class.

Define column with annotations to get through analysis

volcano plot with fixed ylim (scales) cannot be "freed up" with scales="free"

Hoi Witold,

As you may have seen in the comments, one user SH realized that the volcanos are by default fixed on ylim.
Relaxing this with the scales parameter does not work .

-> in all these volcanos, the y-axis always has the same fixed length.
pl$volcano(scales = c("fixed"))
pl$volcano_plotly(scales = "free")
pl$volcano(scales = "free")
pl$volcano(scales = "free_y")

Gruess & bis am Dienstag
j

manage bioc submission

prolfqua
prolfquaDate via using ExperimentHub (bioc Development (3.14))

Error Handling: Contrast and model specification

Hei Witold,
As discussed, when not specifying the factors or model correctly. It would be great to get a more telling error message instead of no results.

Gruess jonas

Add column with FDR BH

improve sample size estimation

https://www.statmethods.net/stats/power.html

Ion Quant error: Index out of bounds for length 10001

Dear Fragpipe Team

I am running FragPipe via Xwindow-GUI on a Linux machine. I get an IonQuant error for "index out of bounds".
Do you have any idea what went wrong?

There are 15 raw files in the game and a huge metaproteomicsDB of 8.2G (which I manage to run successfully to run the full FragPipe-MBR workflow for 3 test files).

Best regards
jonas

///
2021-02-18 09:27:27 [INFO] - matched run percentage: 0.1495889380762246
2021-02-18 09:27:27 [INFO] - Fitting a mixture model...
2021-02-18 09:27:27 [INFO] - Estimating match-between-runs FDR...
2021-02-18 09:27:30 [ERROR] - Index 11095 out of bounds for length 10001
Process 'IonQuant' finished, exit code: 1
Process returned non-zero exit code, stopping
///

add ropeca type protein level p-value calculation

For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. The protein-level significance of the detection is then calculated using beta distribution. Under the null hypothesis, the p-values of the peptides follow the uni-form distribution U(0,1). Furthermore, the order statistics from U(0,1) distribution follow a beta distribution.
More specifically, the i th order statistic of sample size n has a beta distribution B(gamma,delta) with parameters gamma = i and delta = n − i + 1 . The significance of the median p-value for a protein with n peptides is hence calculated using the cumulative distribution function of the beta distribution’s probability density function.

where P_m is the observed median p-value of peptides belonging to the protein.
Finally, the FDR is calculated using the Benjamini-Hochberg procedure.

integrate proDA

Add proDA as a further modeling option in prolfqua.

Use strategy pattern in contrasts_linfct

I did add the model name and the contrast function to the model functions. Hence, now I can remove the if statement selecting the contrast function from the contrasts_linfct.

add in-silico talk

hkeysLevel

Review usage especially if parameters rev and names are needed.