Coder Social home page Coder Social logo

fgcz / prolfqua Goto Github PK

View Code? Open in Web Editor NEW
36.0 6.0 6.0 778.57 MB

Differential Expression Analysis tool box R lang package for omics data

Home Page: https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441

License: MIT License

R 94.13% Dockerfile 0.31% TeX 5.54% Shell 0.02%
proteomics-data-analysis rstats-package hypothesis-testing quality-control r-package differential-expression-analysis protein-quantification

prolfqua's People

Contributors

cpanse avatar jjgg avatar wolski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

prolfqua's Issues

Use strategy pattern in contrasts_linfct

I did add the model name and the contrast function to the model functions. Hence, now I can remove the if statement selecting the contrast function from the contrasts_linfct.

Citation points to bioRxiv

I just wanted to let you know that using citation("prolfqua") in R still points to the bioRxiv article and not the published article in J. Proteome Res. 2023, 22, 4, 1092–1104. The bioRxiv article also seems to still lack the information that the article is now published in J. Proteome Res. 2023, 22, 4, 1092–1104.

FragPipe v.19.1 combined_protein output not working with prolfqua::tidy_FragPipe_combined_protein

Hello and thank you for the open access package. Quick question about a problem I'm running into.

Describe the bug
FragPipe v.19.1 combined_protein.tsv output gives the error when prolfqua::tidy_FragPipe_combined_protein is called:

Warning: argument is not being debuggedError in 1:which(cnam == "Combined Total Spectral Count") : 
  argument of length 0

To Reproduce

library(tibble)
library(dplyr)
library(prolfqua)

datadir<- file.path(find.package("prolfquadata"),"quantdata")

#read input annotation
annotation <- readxl::read_xlsx("myannotationfile.xlsx")
#read input protein
protein <- tibble::as_tibble(read.csv("~combined_protein.tsv", header=TRUE, sep="\t", stringsAsFactors = FALSE))

undebug(prolfqua::tidy_FragPipe_combined_protein)
protein <- prolfqua::tidy_FragPipe_combined_protein(protein)
protein <- protein |> dplyr::filter(unique.stripped.peptides > 1)
merged <- dplyr::inner_join(annotation, protein)

Additional context
FragPipe v19.1 only includes one type of Intensity per experimental replicate and LFQ; could this be the reason for the error? I am including my combined TSV file for your reference.

Thanks again for any help.
combined_protein.tsv.txt

Improve handling of rank deficient models

x$linfct_A
                               (Intercept) celltype_lncap celltype_pc3 celltype_pnt1
celltype_pc3 - celltype_pnt1             0              0            1            -1
celltype_lncap - celltype_pnt1           0              1            0            -1

 lme4::fixef(resXX$res_modelling$modellingResult$modelProtein$linear_model[[408]])
 (Intercept) celltype_pc3 
   -4.262731     3.987698 

debug408.txt

currently the function my_contest returns NA for rank defficient models. Should it not be possible to somehow estimate a contrast for the conditions where measurements were observed?

Issue with setup_analysis in comparing2groups vignette

Hello,

I am attempting to follow the script outlined in the Comparing Two Groups with prolfqua vignette, however I keep receiving the error "Error in basename(data[[table$fileName]]) : a character vector argument expected" when running setup_analysis to create sampleName from fileName column. Do you have any advice on what the issue may be? Our data from MaxQuant should be in the same format as the one used in the example.

Thank you,

Juliana

Error at normalisation step

Hi,

I'm running through the vignette (here: https://fgcz.github.io/prolfqua/articles/Comparing2Groups.html)

I get to here:

lfqdata$to_wide()$data[1:3,1:7]

and I get the error:

Error in `tidyr::spread()`:
! Each row of output must be identified by a unique combination of keys.
ℹ Keys are shared for 6767 rows

This is also the same error I get when try running the normalisation (robscale)

lt <- lfqdata$get_Transformer()
transformed <- lt$log2()$robscale()$lfq

I then tried to normalise via that described here: https://fgcz.github.io/prolfqua/reference/LFQDataTransformer.html#examples

lfqTrans <-lfqdata$get_Transformer()

x <- lfqTrans$intensity_array(log2)

x$lfq$config$table$is_response_transformed

x <- x$intensity_matrix(robust_scale)

plotter <- x$lfq$get_Plotter()
plotter$intensity_distribution_density()

However, I get the response:

Warning message:
In x$intensity_matrix(robust_scale) :
  data already transformed. If you still want to log2 tranform, set force = TRUE

And checking the two steps have "run", the robust_scale request has made no changes to the data, there are no changes to the log2 transformation column or an additional robust_ scale column in the tibble.

The warning response above suggests that I should run with force=T, but states that this is to allow another log2 transformation, which is not required as the intensities have already been log transformed.

Running with force=T, does add another column with robust_scale appended to the column name. But has this been log transformed again?

Any help greatly appreciated.

add in-silico talk

<iframe width="516" height="290" src="https://www.youtube.com/embed/acDiXq2xbOw" title="prolfqua: A comprehensive R package for protein differential expression analysis" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

volcano plot with fixed ylim (scales) cannot be "freed up" with scales="free"

Hoi Witold,

As you may have seen in the comments, one user SH realized that the volcanos are by default fixed on ylim.
Relaxing this with the scales parameter does not work .

-> in all these volcanos, the y-axis always has the same fixed length.
pl$volcano(scales = c("fixed"))
pl$volcano_plotly(scales = "free")
pl$volcano(scales = "free")
pl$volcano(scales = "free_y")

Gruess & bis am Dienstag
j

add ropeca type protein level p-value calculation

For protein-level inference of differential expression, the median of peptide-level p-values is used as a score for each protein taking the direction of change into account. The protein-level significance of the detection is then calculated using beta distribution. Under the null hypothesis, the p-values of the peptides follow the uni-form distribution U(0,1). Furthermore, the order statistics from U(0,1) distribution follow a beta distribution.
More specifically, the i th order statistic of sample size n has a beta distribution B(gamma,delta) with parameters gamma = i and delta = n − i + 1 . The significance of the median p-value for a protein with n peptides is hence calculated using the cumulative distribution function of the beta distribution’s probability density function.

where P_m is the observed median p-value of peptides belonging to the protein.
Finally, the FDR is calculated using the Benjamini-Hochberg procedure.

missing presence show using venn diagram

Possible code

sum$missingness_per_condition()
tr <- 2
nrMiss <- tmp$data %>% pivot_wider(id_cols = "protein_Id", names_from = "bait", values_from = nrMeasured)
nrMiss[,2:4][nrMiss[,2:4] < tr] <- 0
nrMiss[,2:4][nrMiss[,2:4] >= tr] <- 1

upset(as.data.frame(nrMiss), order.by = "freq")

Glm support

Add glm binomial model for missingness analysis.

Reduce package size

  • remove test data and replace with simulated data
  • move SAINT binaries to a separate package, which can depend on prolfqua.

hkeysLevel

Review usage especially if parameters rev and names are needed.

add moderated p.values

Based on the discussion below, I understand that you have fitted the same linear model to a few hundred separate datasets (with each subject being a separate dataset). To apply limma empirical Bayes moderation you need to have three vectors storing information from the separate fits:

A vector sigma containing the residual standard errors from the separate fits. This vector is not obtainable from your data.frame. The residual standard error is the square root of the residual mean square and is shown in the output from summary(fit) for each linear model. If you save the summary to out <- summary(fit) then the residual standard error is stored as out$sigma.

A vector df.residual containing the residual degrees of freedom for each regression, which will not all be the same some of the fits have missing values. This will be the same as df in your data.frame.

A vector tstat containing the unmoderated t-statistics from the separate linear models. This will be the same as t-value in your data.frame.

Then you can compute moderated t-statistics and p-values by

sv <- squeezeVar(sigma^2, df=df.residual)
moderated.t <- tstat * sigma / sqrt(sv$var.post)
df.total <- df.residual + sv$df.prior
p.value <- 2*pt( abs(moderated.t), df=df.total, lower.tail=FALSE)

Error with setup_analysis()

Hi,

I would like to try the vignette. The first steps go well until this one:
adata <- setup_analysis(startdata, config)

I get the following error:
"creating sampleName
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], :
missing value where TRUE/FALSE needed"

It seems that my data contains missing values but I don't know how to handle them.
I tried :
data <- read.table("path_to/proteinGroups.txt", sep="\t",na.strings = c(""), header=T)
data <- data %>% drop_na(c("Protein.IDs", "Gene.names", "Intensity"))
startdata <- prolfqua::tidyMQ_ProteinGroups(data)
...steps up to:
adata <- setup_analysis(startdata, config)

but I still get the same error.

Here is the link to my protein groups file: https://www.swisstransfer.com/d/34536145-ba5e-4994-8364-fa2a5aca223a

Ion Quant error: Index out of bounds for length 10001

Dear Fragpipe Team

I am running FragPipe via Xwindow-GUI on a Linux machine. I get an IonQuant error for "index out of bounds".
Do you have any idea what went wrong?

There are 15 raw files in the game and a huge metaproteomicsDB of 8.2G (which I manage to run successfully to run the full FragPipe-MBR workflow for 3 test files).

Best regards
jonas

///
2021-02-18 09:27:27 [INFO] - matched run percentage: 0.1495889380762246
2021-02-18 09:27:27 [INFO] - Fitting a mixture model...
2021-02-18 09:27:27 [INFO] - Estimating match-between-runs FDR...
2021-02-18 09:27:30 [ERROR] - Index 11095 out of bounds for length 10001
Process 'IonQuant' finished, exit code: 1
Process returned non-zero exit code, stopping
///

DiaNN read report function

Hi Thanks for developing an excellent framework for LFQ. I just wanted to highlight something that I've found.

The function "diann_read_output" does not result in the required data frame per the parameters supplied.

There are two separate issues:

  1. Supplying the nrpeptides parameter does not filter the data frame the during import function, i.e. setting nrPeptide=2 has no effect.

In the section of the R code:

filter_PG <- function(PG, nrPeptides = 2, Q.Value = 0.01){
    PG <- PG |> dplyr::filter(nrPeptides >= nrPeptides)
    PG <- PG |> dplyr::filter(.data$Lib.PG.Q.Value < Q.Value)
    PG <- PG |> dplyr::filter(.data$PG.Q.Value < Q.Value)
  }

The variable nrPeptides set in the function call is being overridden by the local nrPeptide from the data frame and appears to be checking against itself. Would it be possible to change the parameter name to nrPeptides_min or something other than nrPeptides?

  1. The filtering by Q.value is performed after the nrPeptides column is calculated, so the resulting data frame will contain proteins quantified by a single peptide.

The above section of code ideally needs to be separated, whereby the peptides are filtered by Q.value before the nrPeptides column is calculated. The data frame can be filtered by the minimum peptide threshold afterwards.

I hope this makes sense.

Best,

Craig

add heatmap with only sample wise clustering

Something between the current heatmap and the raster plot in LFQData.
When clustering row-wise, we must remove proteins with few observations (otherwise, distance computation fails).

review ContrastsMissing get_contrasts

  • get_contrasts
    ** get_imputed_contrasts
    *** missigness_impute_factors_interactions(pepIntensity, config, value = "long" )
    **** .missigness_impute_interactions
    ***** interaction_missing_stats
    *** get_contrast(ungroup(imp), config$table$hierarchy_keys(), Contr)
    *** aggregate_contrast(ungroup(imputed2), subject_Id = config$table$hierarchy_keys_depth())
    ** summarize_stats

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.