fgcz / prolfqua Goto Github PK
View Code? Open in Web Editor NEWDifferential Expression Analysis tool box R lang package for omics data
Home Page: https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441
License: MIT License
Differential Expression Analysis tool box R lang package for omics data
Home Page: https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441
License: MIT License
The MA plot axis labels are quite cryptic for unexperienced users.
I did add the model name and the contrast function to the model functions. Hence, now I can remove the if statement selecting the contrast function from the contrasts_linfct
.
Idea is to create MA plots.
avoid switching between Camel, Pascal, snake case and mixtures thereof.
I just wanted to let you know that using citation("prolfqua") in R still points to the bioRxiv article and not the published article in J. Proteome Res. 2023, 22, 4, 1092–1104. The bioRxiv article also seems to still lack the information that the article is now published in J. Proteome Res. 2023, 22, 4, 1092–1104.
Hei Witold,
As discussed today it would be a meaningful extenstion to the simple Impute function if the log2FC would be estimated at 0 if the mean is smaller than probs (default at 0.1).
Now the log2FC is estimated in the wrong direction!
gruess - jonas
p26828_BlandAltmann_feedOneVSfeedTwo_withImputation.pdf
Hello and thank you for the open access package. Quick question about a problem I'm running into.
Describe the bug
FragPipe v.19.1 combined_protein.tsv output gives the error when prolfqua::tidy_FragPipe_combined_protein is called:
Warning: argument is not being debuggedError in 1:which(cnam == "Combined Total Spectral Count") :
argument of length 0
To Reproduce
library(tibble)
library(dplyr)
library(prolfqua)
datadir<- file.path(find.package("prolfquadata"),"quantdata")
#read input annotation
annotation <- readxl::read_xlsx("myannotationfile.xlsx")
#read input protein
protein <- tibble::as_tibble(read.csv("~combined_protein.tsv", header=TRUE, sep="\t", stringsAsFactors = FALSE))
undebug(prolfqua::tidy_FragPipe_combined_protein)
protein <- prolfqua::tidy_FragPipe_combined_protein(protein)
protein <- protein |> dplyr::filter(unique.stripped.peptides > 1)
merged <- dplyr::inner_join(annotation, protein)
Additional context
FragPipe v19.1 only includes one type of Intensity per experimental replicate and LFQ; could this be the reason for the error? I am including my combined TSV file for your reference.
Thanks again for any help.
combined_protein.tsv.txt
since we added plot_precision_recall we can remove plot_FDPvsTPR, which is quite similar but worse.
x$linfct_A
(Intercept) celltype_lncap celltype_pc3 celltype_pnt1
celltype_pc3 - celltype_pnt1 0 0 1 -1
celltype_lncap - celltype_pnt1 0 1 0 -1
lme4::fixef(resXX$res_modelling$modellingResult$modelProtein$linear_model[[408]])
(Intercept) celltype_pc3
-4.262731 3.987698
currently the function my_contest returns NA for rank defficient models. Should it not be possible to somehow estimate a contrast for the conditions where measurements were observed?
Hello,
I am attempting to follow the script outlined in the Comparing Two Groups with prolfqua vignette, however I keep receiving the error "Error in basename(data[[table$fileName]]) : a character vector argument expected" when running setup_analysis to create sampleName from fileName column. Do you have any advice on what the issue may be? Our data from MaxQuant should be in the same format as the one used in the example.
Thank you,
Juliana
check if usefull and if it performs better than medpolish
why not GPLv3?
The idea would be to move it to the model class, because some of the Contrast classes do not use linear functions.
implement the option to color lines in intensity_distribution_density using the factor variable from the lfqdata$config
https://journal.r-project.org/archive/2009/RJ-2009-019/RJ-2009-019.pdf
Generate normalized matrices in addition.
Hi,
I'm running through the vignette (here: https://fgcz.github.io/prolfqua/articles/Comparing2Groups.html)
I get to here:
lfqdata$to_wide()$data[1:3,1:7]
and I get the error:
Error in `tidyr::spread()`:
! Each row of output must be identified by a unique combination of keys.
ℹ Keys are shared for 6767 rows
This is also the same error I get when try running the normalisation (robscale)
lt <- lfqdata$get_Transformer()
transformed <- lt$log2()$robscale()$lfq
I then tried to normalise via that described here: https://fgcz.github.io/prolfqua/reference/LFQDataTransformer.html#examples
lfqTrans <-lfqdata$get_Transformer()
x <- lfqTrans$intensity_array(log2)
x$lfq$config$table$is_response_transformed
x <- x$intensity_matrix(robust_scale)
plotter <- x$lfq$get_Plotter()
plotter$intensity_distribution_density()
However, I get the response:
Warning message:
In x$intensity_matrix(robust_scale) :
data already transformed. If you still want to log2 tranform, set force = TRUE
And checking the two steps have "run", the robust_scale request has made no changes to the data, there are no changes to the log2 transformation column or an additional robust_ scale column in the tibble.
The warning response above suggests that I should run with force=T, but states that this is to allow another log2 transformation, which is not required as the intensities have already been log transformed.
Running with force=T, does add another column with robust_scale appended to the column name. But has this been log transformed again?
Any help greatly appreciated.
Greetings,
That you for providing this opensource package. By any chance would you be able to direct me to the location where the sample RScript is to process FragPipe LFQMBR output file?
Regards,
Ben
Integrate this function into the ContrastPlotter class.
upSet() - shows up-set plot
barplot() - showing the number of significant proteins per contrast.
Hoi Witold,
As you may have seen in the comments, one user SH realized that the volcanos are by default fixed on ylim.
Relaxing this with the scales parameter does not work .
-> in all these volcanos, the y-axis always has the same fixed length.
pl$volcano(scales = c("fixed"))
pl$volcano_plotly(scales = "free")
pl$volcano(scales = "free")
pl$volcano(scales = "free_y")
Gruess & bis am Dienstag
j
For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. The protein-level significance of the detection is then calculated using beta distribution. Under the null hypothesis, the p-values of the peptides follow the uni-form distribution U(0,1). Furthermore, the order statistics from U(0,1) distribution follow a beta distribution.
More specifically, the i th order statistic of sample size n has a beta distribution B(gamma,delta) with parameters gamma = i and delta = n − i + 1 . The significance of the median p-value for a protein with n peptides is hence calculated using the cumulative distribution function of the beta distribution’s probability density function.
where P_m is the observed median p-value of peptides belonging to the protein.
Finally, the FDR is calculated using the Benjamini-Hochberg procedure.
Possible code
sum$missingness_per_condition()
tr <- 2
nrMiss <- tmp$data %>% pivot_wider(id_cols = "protein_Id", names_from = "bait", values_from = nrMeasured)
nrMiss[,2:4][nrMiss[,2:4] < tr] <- 0
nrMiss[,2:4][nrMiss[,2:4] >= tr] <- 1
upset(as.data.frame(nrMiss), order.by = "freq")
To avoid confusion, since we do not do imputation but are estimating the contrasts in the presence of missing data, we rename this contrast class.
Add glm binomial model for missingness analysis.
Hei Witold,
As discussed, when not specifying the factors or model correctly. It would be great to get a more telling error message instead of no results.
Gruess jonas
Add proDA
as a further modeling option in prolfqua
.
Review usage especially if parameters rev and names are needed.
Based on the discussion below, I understand that you have fitted the same linear model to a few hundred separate datasets (with each subject being a separate dataset). To apply limma empirical Bayes moderation you need to have three vectors storing information from the separate fits:
A vector sigma containing the residual standard errors from the separate fits. This vector is not obtainable from your data.frame. The residual standard error is the square root of the residual mean square and is shown in the output from summary(fit) for each linear model. If you save the summary to out <- summary(fit) then the residual standard error is stored as out$sigma.
A vector df.residual containing the residual degrees of freedom for each regression, which will not all be the same some of the fits have missing values. This will be the same as df in your data.frame.
A vector tstat containing the unmoderated t-statistics from the separate linear models. This will be the same as t-value in your data.frame.
Then you can compute moderated t-statistics and p-values by
sv <- squeezeVar(sigma^2, df=df.residual)
moderated.t <- tstat * sigma / sqrt(sv$var.post)
df.total <- df.residual + sv$df.prior
p.value <- 2*pt( abs(moderated.t), df=df.total, lower.tail=FALSE)
Hi,
I would like to try the vignette. The first steps go well until this one:
adata <- setup_analysis(startdata, config)
I get the following error:
"creating sampleName
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], :
missing value where TRUE/FALSE needed"
It seems that my data contains missing values but I don't know how to handle them.
I tried :
data <- read.table("path_to/proteinGroups.txt", sep="\t",na.strings = c(""), header=T)
data <- data %>% drop_na(c("Protein.IDs", "Gene.names", "Intensity"))
startdata <- prolfqua::tidyMQ_ProteinGroups(data)
...steps up to:
adata <- setup_analysis(startdata, config)
but I still get the same error.
Here is the link to my protein groups file: https://www.swisstransfer.com/d/34536145-ba5e-4994-8364-fa2a5aca223a
Is your feature request related to a problem? Please describe.
investigate on how to apply covr to prolfqua examples
Describe the solution you'd like
https://cran.r-project.org/package=covr
Vignette text needs spellchecking and grammar correction
The article pages under https://fgcz.github.io/prolfqua/articles/ contain a link to their source below the title and these links seem to be broken.
e.g. on https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html the link [Source: ../vignettes/CreatingConfigurations.Rmd
points to https://github.com/wolski/prolfqua/blob/vignettes/CreatingConfigurations.Rmd which gives a 404 error.
Dear Fragpipe Team
I am running FragPipe via Xwindow-GUI on a Linux machine. I get an IonQuant error for "index out of bounds".
Do you have any idea what went wrong?
There are 15 raw files in the game and a huge metaproteomicsDB of 8.2G (which I manage to run successfully to run the full FragPipe-MBR workflow for 3 test files).
Best regards
jonas
///
2021-02-18 09:27:27 [INFO] - matched run percentage: 0.1495889380762246
2021-02-18 09:27:27 [INFO] - Fitting a mixture model...
2021-02-18 09:27:27 [INFO] - Estimating match-between-runs FDR...
2021-02-18 09:27:30 [ERROR] - Index 11095 out of bounds for length 10001
Process 'IonQuant' finished, exit code: 1
Process returned non-zero exit code, stopping
///
Example run script.R where LFQService processes MQanalysis and MSStats file output.
As discussed in the article https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
the precision-recall plot can be more informative than the ROC curve. Therefore we will be adding this plot to the Benchmark class.
+#' zip_archive <- system.file("samples/maxquant_txt/twoGroup3Reps.zip", package="LFQService")
+#' # res <- tidyMQ_PeptideProtein(zip_archive)
Is not working
Hi Thanks for developing an excellent framework for LFQ. I just wanted to highlight something that I've found.
The function "diann_read_output" does not result in the required data frame per the parameters supplied.
There are two separate issues:
In the section of the R code:
filter_PG <- function(PG, nrPeptides = 2, Q.Value = 0.01){
PG <- PG |> dplyr::filter(nrPeptides >= nrPeptides)
PG <- PG |> dplyr::filter(.data$Lib.PG.Q.Value < Q.Value)
PG <- PG |> dplyr::filter(.data$PG.Q.Value < Q.Value)
}
The variable nrPeptides set in the function call is being overridden by the local nrPeptide from the data frame and appears to be checking against itself. Would it be possible to change the parameter name to nrPeptides_min or something other than nrPeptides?
The above section of code ideally needs to be separated, whereby the peptides are filtered by Q.value before the nrPeptides column is calculated. The data frame can be filtered by the minimum peptide threshold afterwards.
I hope this makes sense.
Best,
Craig
Something between the current heatmap and the raster plot in LFQData.
When clustering row-wise, we must remove proteins with few observations (otherwise, distance computation fails).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.