Coder Social home page Coder Social logo

vitek-lab / msstatstmt Goto Github PK

View Code? Open in Web Editor NEW
26.0 7.0 13.0 62.68 MB

R-based package for detecting differentially abundant proteins in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling

Home Page: https://Vitek-Lab.github.io/MSstatsTMT

R 100.00%
mass-spectrometry proteomics msstats tmt-data-analysis labeling tmt

msstatstmt's Introduction

MSstatsTMT

MSstatsTMT is an R-based package for detecting differentially abundant proteins in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling. It is applicable to isobaric labeling quantitative proteomics, including iTRAQ and TMT. The official webpage is http://msstats.org/msstatstmt/.

msstatstmt's People

Contributors

deril2605 avatar devonjkohler avatar huang704 avatar jwokaty avatar kayla-morrell avatar meenachoi avatar mstaniak avatar nturaga avatar sichengh avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

msstatstmt's Issues

unclear columns in annotation.pd

I am trying to load some data into MSstatsTMT and the meaning of the columns in annotation.pd is very unclear to me (I use both TMT and bioinformatics software regularly).

"Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a
single mass spectrometry experiment. If the channal doesn’t have sample, please add ‘Empty’
under Condition."

TMT is always a mixture of samples, what is this column supposed to convey?

"TechRepMixture : Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if ‘TechRepMixture’ = 1, 2 are the two technical replicates of
one mixture, then they should match with same ‘Mixture’ value."

"Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture
should correspond to multuple fractions. For example, if ‘Fraction’ = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they
should have same ‘TechRepMixture’ and ‘Mixture’ value"

I would suggest you describe a specific example to make it clear. E.g. I have a 16plex experiment, with 16 samples in 4 conditions:

condition 1: channels 1:4
condition 2: channels 5:8
condition 3: channels 9:12
condition 4: channels 13:16

after the samples are labelled and pooled, the pool is fractionated into 4 fractions, each of them is run individually and produces a separate raw file, the raw files are co-analysed (specified as fractions in PD) to produce a single PSM.csv. What should the columns of annotation.pd say?

Error when using TMP for protein summarization.

Hi,
I've been playing with MSstatsTMT for a while, I'm using MaxQuant as search tool, but when I run protein.summarization function with method='MedianPolish' it reports following error: Error in inds_combine(.vars, ind_list) : Position must be between 0 and n.
This error only occurs on MedianPolish but not the other two.
Hope get some clues from you.

Best,
Weixian

annotation file issues for output of Proteom Discoverer

Hi,

This is my first time analysing proteomic data (but well versed in R). I have been trying to use MSstatsTMT to study a certain condition. However, I am facing issues trying to create the annotation file required for my situation. The data I am looking at is contained in the following link : https://www.ebi.ac.uk/pride/archive/projects/PXD020296

I'd be extremely glad if you could help me create an annotation file in situations involving High-pH off-line fractionation.

Thanks,
Uday

Error with groupComparisonTMT

I recently updated to R version 4.1.0 and used devtools to install the Github version of MSstatsTMT 1.99.0 or 2.0.0
I encountered a problem with .checkGroupComparisonInput() function, during groupComparisonTMT. I tried to rerun my old code that worked fine before I updated R to the most recent version 2 weeks ago.

> test_contrast_test <- groupComparisonTMT(data = input.pd,
+                                           contrast.matrix = 'pairwise',
+                                           moderated = TRUE,
+                                           adj.method = 'BH',
+                                           remove_norm_channel = TRUE,
+                                           remove_empty_channel = TRUE)
Error in .checkGroupComparisonInput(input) : 
  Please check the required input. ** columns : Protein, BioReplicate, Abundance, Run, Channel, Condition, TechRepMixture, Mixture , are missed.

When I checked my input.pd columns, they seem to be normal.

> colnames(input.pd)
[1] "Run"            "Protein"        "Abundance"      "Channel"        "BioReplicate"   "Condition"     
[7] "TechRepMixture" "Mixture" 

If I manually, run .checkGroupComparisonInput(input) from the source code, it returns the data frame okay.

.checkGroupComparisonInput = function(input) {
  required_cols = c("Protein", "BioReplicate", "Abundance", "Run", 
                    "Channel", "Condition", "TechRepMixture", "Mixture")
  if (!all(required_cols %in% colnames(input))) {
    missing_cols = !(required_cols %in% colnames(input))
    missing_msg = paste(required_cols[missing_cols],
                        collapse = ", ")
    if (sum(missing_cols) == 1) {
      stop(paste("Please check the required input. ** columns :",
                 missing_msg, "is missed."))
    } else {
      stop(paste("Please check the required input. ** columns :",
                 missing_msg, ", are missed."))
    }
  }
  if (data.table::uniqueN(input$Condition) < 2){
    stop(paste("Please check the Condition column in annotation file.", 
               "There must be at least two conditions!"))
  }
  input
}

output <- checkGroupComparisonInput(input.pd)
head(output == input.pd)
     Run Protein Abundance Channel BioReplicate Condition TechRepMixture Mixture
[1,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[2,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[3,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[4,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[5,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[6,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE

Here are some of my session_Info()

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.4.1       usethis_2.0.1        ComplexHeatmap_2.8.0 mclust_5.4.7         expss_0.10.7        
 [6] robustbase_0.93-8    psych_2.1.3          ggrepel_0.9.1        data.table_1.14.0    MSstatsTMT_1.99.0   
[11] readxl_1.3.1         forcats_0.5.1        stringr_1.4.0        dplyr_1.0.6          purrr_0.3.4         
[16] readr_1.4.0          tidyr_1.1.3          tibble_3.1.2         ggplot2_3.3.3        tidyverse_1.3.1     
[21] edgeR_3.34.0         limma_3.48.0  

Thank you very much
Vasin

MSstatsTMT does not work after dplyr update to 1.0.2.

I updated dplyr to 1.0.2 and MSstatsTMT. The mistake it returns is:

** Negative log2 intensities were replaced with NA.
Summarizing for Run : all.raw ( 1 of 1 )
Error in [.data.table(raw, , require.col) :
j (the 2nd argument inside [...]) is a single symbol but column name 'require.col' is not found. Perhaps you intended DT[, ..require.col]. This difference to data.frame is deliberate and explained in FAQ 1.1.

the same code works with dplyr 0.8.3

onattach and onload

check if there are possible problems with logging when using MSstatsTMT
(based on slack convo)

Question about TechRepMixture

Hi @huang704

Is it possible to specify TechRepMixture across two or more Mixtures in a continuously increasing manner?

e.g.
Bildschirmfoto 2020-06-24 um 13 09 23

Or would that imply a different design in comparison to a repeated/matching specification (1,2,3; 1,2,3; ...)?

which() not working with data.table

input <- input[, which(colnames(input) %in% c(which.pro, which.NumProteins, 'Annotated.Sequence', 'Charge',

If input is a data.table rather than a data.frame, which() won't index correctly! Either the class of input must initially be set to data.frame, or the indexing method should be changed.

Fragpipe

Fragpipe has been giving significantly better TMT data than Maxquant for us. What is the best way to import Fragpipe data into MSstatsTMT and MSstatsTMTPTM?

MaxQtoMSstatsTMTFormat failing to use which.proteinid = "Gene.names"

Hi MSstats team,

First of all, congratulations on making such a nice package!

I've very happy with the overall experience, but I may have found a bug. I used the previous version (1.8.2) where the MaxQtoMSstatsTMTFormat function was able to generate a table using the gene names. However, after updating it to the 2.2 version, it is only generating a table with the protein ID, even though I am setting which.proteinid = "Gene.names".

I have very limited knowledge of R, but I tried debugging the MaxQtoMSstatsTMTFormat function to see if I could find any fix. The only thing that stood out to me is that the MSstatsConvert::MSstatsImport function is removing the dots in the column names of the proteinGroups and evidence objects. I tried using "Genenames" as an input but it didn't solve the issue.

Here is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.4   ggrepel_0.9.1    ggpubr_0.4.0     ggplot2_3.3.5    MSstatsTMT_2.2.0

loaded via a namespace (and not attached):
 [1] MSstats_4.1.1         gtools_3.9.2          tidyselect_1.1.1      purrr_0.3.4           splines_4.1.1        
 [6] lmerTest_3.1-3        lattice_0.20-45       log4r_0.4.2           carData_3.0-4         colorspace_2.0-2     
[11] vctrs_0.3.8           generics_0.1.1        utf8_1.2.2            survival_3.2-13       marray_1.72.0        
[16] rlang_0.4.12          pillar_1.6.4          nloptr_1.2.2.3        glue_1.5.0            withr_2.4.2          
[21] DBI_1.1.1             plyr_1.8.6            lifecycle_1.0.1       stringr_1.4.0         munsell_0.5.0        
[26] ggsignif_0.6.3        gtable_0.3.0          caTools_1.18.2        fansi_0.5.0           preprocessCore_1.56.0
[31] broom_0.7.10          Rcpp_1.0.7            KernSmooth_2.23-20    scales_1.1.1          backports_1.3.0      
[36] checkmate_2.0.0       limma_3.50.0          abind_1.4-5           lme4_1.1-27.1         gplots_3.1.1         
[41] stringi_1.7.5         rstatix_0.7.0         dplyr_1.0.7           numDeriv_2016.8-1.1   grid_4.1.1           
[46] MSstatsConvert_1.4.0  tools_4.1.1           bitops_1.0-7          magrittr_2.0.1        tibble_3.1.6         
[51] car_3.0-12            tidyr_1.1.4           crayon_1.4.2          pkgconfig_2.0.3       MASS_7.3-54          
[56] ellipsis_0.3.2        Matrix_1.3-4          data.table_1.14.2     assertthat_0.2.1      minqa_1.2.4          
[61] R6_2.5.1              boot_1.3-28           nlme_3.1-153          compiler_4.1.1

Best,
Luiz

missing value handling question

Some of the values are missing in TMT experiments and are represented by NA after processing using PD. I believe that in my applications, the reason the values are missing is that they are genuinely 0 (meaning ni signal is present in the channel). MSstatsTMT imputes these values, since 0s are not allowed in protein summarization and statistical tests.

Would it, however, not be better to multiply all the values by a parge number (say 1e5) and then assign missing values a value of smth like 10?

Suppression of ratios during proteinSummarization

I am performing an IP-MS/MS with TMTPro. There are 4 cell lines

  1. channels 1:4 (126:128N). No proteins are tagged.
  2. channels 5:8 (128C:130N). Q9H9L4 is flag tagged.
  3. channels 9:12 (130C:132N). Q9P2N6 is flag tagged.
  4. channels 13:16 (132C:134N). Q9H7Z6 is flag tagged.

Ip is performed on all cell lines with anti-FLAG beads.

I ran the samples in MS3 SPS mode and the quantitation performed as expected: Q9H9L4 is almost exclusively present in 128C:130N, Q9P2N6 in 130C:132N and Q9H7Z6 in 132C:134N channels. This is seen on level of peptides after PDtoMSstatsTMTFormat conversion:

baits

The black lines represent individual PSMs, the red line is the average of all PSMs as computed by ggplot2.

However, when the quantitation is rolled up to protein level, the ratios become extremely suppressed and now the baits are "present" in all channels.

baits_and_protein_level

The black lines represent individual PSMs, the red line is the average of all PSMs as computed by ggplot2 and the blue line is quantitation on protein level by MSStatsTMT.

Thus the ratios computed by proteinSummarization() are extremely suppressed compared to ratios I would get by simply finding the average of all peptides.

Below is the data:

example_data.zip

and code

 NSLComplex <- c("Q9H9L4" , "Q9P2N6", "Q9H7Z6")
PSMTable <- read.csv ("example_data.csv" , stringsAsFactors = F)

annotation.pd= data.frame (Run = rep ("NSL_tmt_unfr_6hours.raw", 16), 
                           Channel = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"), 
                           Condition = c(rep("control", 4),  rep("K2", 4), rep("K3", 4), rep("K8", 4)), 
                           Fraction = rep(1, 16), 
                           BioReplicate = "empty", Mixture = 1, TechRepMixture = 1, 
                           stringsAsFactors = FALSE)

PSMTMTStats <- PDtoMSstatsTMTFormat(PSMTable, annotation.pd,  rmPSM_withMissing_withinRun = TRUE)
protTMTStats <- proteinSummarization(PSMTMTStats, method="msstats", global_norm = FALSE)

PSMTMTStats <- rename (PSMTMTStats, "Protein" = "ProteinName", "Abundance" = "Intensity")
  
NSLComplexPSM <-  PSMTMTStats  %>% filter (Protein %in% NSLComplex) %>% group_by(Protein, PSM) %>% mutate (Abundance = Abundance/sum (Abundance)) 
NSLComplexProt <- protTMTStats %>% filter (Protein %in% NSLComplex) %>% group_by (Protein) %>% mutate (Abundance = Abundance/sum(Abundance))

NSLComplexPSM$Channel <- factor (NSLComplexPSM$Channel, levels = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"))
NSLComplexProt$Channel <- factor (NSLComplexProt$Channel, levels = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"))

ggplot (NSLComplexPSM, aes (Channel,  Abundance, group = PSM)) + geom_line () + 
  facet_wrap (~Protein) + 
  stat_summary(aes(y = Abundance), fun.y=mean, colour="red", geom="line",group=1, size = 1) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_line(data=NSLComplexProt, aes(Channel, Abundance,group=1), color="blue")

Some samples missing from output of protein summarization

Hello,

I've started to play around with MSstats TMT and just have a query about the output of proteinSummarization. I have 20 replicates so would expect 20 rows per protein (1 row for each replicate). This is the case for the majority of my proteins. However, in some cases I have less than 20 rows (abundance values per protein). For example, for one protein I have 17 rows of protein abundance corresponding to 17 samples. Some of the samples' abundance is listed as NA, whereas three of my samples just aren't listed at all for this particular protein.

What is the difference between a sample not being listed versus it having an NA abundance. I had assumed that if it was not listed, then that protein was simply missing in that sample.. But then what's the meaning of NA?

I have attached an example of two proteins from my output file, the first shows 20 abundance values for that particular protein, whilst the second protein only has 17 abundance values, some of which (those belonging to a 2nd TMT mixture) are NA.
MSstats proteinSummarization example.xlsx

Thanks for your help with this!

Error in proteinSummarization - possible annotation issue?

I'm running into a weird error when using the "msstats" method in proteinSummarization(). I believe this may have to do with my annotation file as I am not using a converter and doing this manually.

Here is an example of the error I encounter:

quant.msstats <- proteinSummarization(msstatsinput,
+                                       method="msstats", 
+                                       global_norm=TRUE,
+                                       reference_norm=TRUE,
+                                       remove_norm_channel = TRUE,
+                                       remove_empty_channel = TRUE)
INFO  [2021-12-08 16:14:23] ** MSstatsTMT - proteinSummarization function
INFO  [2021-12-08 16:15:09] Summarizing for Run : TMT_V46_P1E ( 1  of  194 )
  |=                                                  |   2%Aggregate function missing, defaulting to 'length'
<simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1>
Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE],  : 
  Elements listed in `by` must be valid column names in x and y
In addition: Warning message:
In merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE],  :
  You are trying to join data.tables where 'y' argument is 0 columns data.table.

I'm hoping someone can point me in the right direction where I'm going wrong. Thanks!

Fraction variable in PD annotation

Hello, I am facing the following issue:
I have an experiment of TMTpro 16 plex in 4 runs with 58 samples + 6 (internal control). The experiment was fractionated in gradient length (table and experiment information attached).
In order to create the AnnotationPD file should I consider fractions with the same values or not? In that case, which one is more adequate to use (same fractions values or different ones)? And what should be the impact if I use both cases to compare?

TMT16_HiRIEF_Materials and Methods.pdf

dataProcess breaks proteinSummarization in TMT analysis (MSstatsTMT 1.0.1)

output.msstats <- dataProcess(sub_data,

In line 668 of MSstats::dataProcess, a dataframe ("tempmissingwork") is to be generated from parameters with different dimensions. In my data, the "nameID" has a dimension of (12,7) and the "tempfeatureID" has a dimension of (1345,4). This makes the following command to break the code:

tempmissingwork <- data.frame(tempfeatureID, LABEL = "L", GROUP_ORIGINAL = nameID$GROUP_ORIGINAL, SUBJECT_ORIGINAL = nameID$SUBJECT_ORIGINAL, RUN = nameID$RUN, GROUP = nameID$GROUP, SUBJECT = nameID$SUBJECT, SUBJECT_NESTED = nameID$SUBJECT_NESTED, INTENSITY = NA, ABUNDANCE = NA, FRACTION = nameID$FRACTION)

The nameID is:

SUBJECT_ORIGINAL GROUP_ORIGINAL GROUP SUBJECT SUBJECT_NESTED RUN FRACTION
1.1.126 mdx 3 1 3.1 1_126 1
1.2.126 mdx 3 11 3.21 1_126 1
1.3.126 mdx 3 21 3.31 1_126 1
1.4.126 mdx 3 31 3.41 1_126 1
1.5.126 mdx 3 41 3.51 1_126 1
1.6.126 mdx 3 51 3.61 1_126 1
1.7.126 mdx 3 61 3.71 1_126 1
1.8.126 mdx 3 71 3.81 1_126 1
1.9.126 mdx 3 81 3.91 1_126 1
1.10.126 mdx 3 91 3.101 1_126 1
1.11.126 mdx 3 101 3.111 1_126 1
1.12.126 mdx 3 111 3.121 1_126 1

and the tempfeatureID:

PROTEIN PEPTIDE TRANSITION FEATURE
Q6PGF7 [R].eGSGTGEEGk.[Q]_2 NA_NA [R].eGSGTGEEGk.[Q]_2_NA_NA
Q9CZ13 [R].NALVSHLDGTTPVcEDIGR.[S]_3 NA_NA [R].NALVSHLDGTTPVcEDIGR.[S]_3_NA_NA
Q7TSK3 [R].dVNDHAPR.[F]_3 NA_NA [R].dVNDHAPR.[F]_3_NA_NA
Q8VE33 [R].LNLGEEVPVIIHR.[D]_3 NA_NA [R].LNLGEEVPVIIHR.[D]_3_NA_NA
... ... ... ...

The complete dataset consists of thousands of proteins, however I tested the MSstatsTMT workflow with a subset of 20 and 50 proteins, but the code still breaks at this point.

I assume the code should be something like this:
tempmissingwork <- merge(tempfeatureID, data.frame( LABEL = "L", GROUP_ORIGINAL = nameID$GROUP_ORIGINAL, SUBJECT_ORIGINAL = nameID$SUBJECT_ORIGINAL, RUN = nameID$RUN, GROUP = nameID$GROUP, SUBJECT = nameID$SUBJECT, SUBJECT_NESTED = nameID$SUBJECT_NESTED, INTENSITY = NA, ABUNDANCE = NA, FRACTION = nameID$FRACTION))

input.mq does not extract Condition information from annotation file

Hi there,

I am using MSstatsTMT to process proteinGroups and evidence output files from Maxquant. MaxQtoMSstatsTMTFormat() finished successfully, but with all NAs in Condition, Mixture, BioReplicate columns. This led to a failure in the following comparison test. Would you suggest how to figure out this problem? Attached please find my input files. Thanks.

MaxQuant_Protein_Groups_for_b1_sample.txt
b1MSstatsTMT_annotation_file_sample.txt
MaxQuant_Evidence_for_b1_sample.txt

contrast matrix definition

the online tutorial describes contrast matrix:

contrast.matrix : Comparison between conditions of interests. 1) default is pairwise, which compare all possible pairs between two conditions. 2) Otherwise, users can specify the comparisons of interest. Based on the levels of conditions, specify 1 or -1 to the conditions of interests and 0 otherwise. The levels of conditions are sorted alphabetically.

I am not familiar with what contrast matrix is. is limma used for statistical tests and is contrast matrix functionality for limma?

MaxQtoMSstatsTMTFormat fails

Hi, when running:
MaxQtoMSstatsTMTFormat(evidence = raw.mq, proteinGroups = proteinGroups.mq, annotation = annotation.mq,allow.cartesian=TRUE)

I get the below error. I rerun with "allow.cartesian=TRUE", but the error persists.
Any advice on this? Attached is my annotation file.
annotation_towers.txt

Help appreciated.
GPR

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 1854844 rows; more than 1650276 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

annotation.pd file not accepted

Hi everyone.
I am trying to run MSstats TMT for a large-scale proteomics study but am running into below error within R 4.3.1 and the newest version of R studio.

Error in .mergeAnnotation(input, annotation) :
** Please check the annotation file. The channel name must be matched with that in input data.

I checked the annotation file and am using the same format as in the example file. I tried multiple ways of annotating the channels, but receive the same error.

The file size is too big to upload as attachment. Is there another way I can share with you?

I was using a previous MSstatsTMT version, but decided to upgrade. It is likely an easy fix, but I cannot seem to find the problem.

Thanks for your help.

Using MSstatsTMT with a more complicated experimental design

Hi MSstats group,

I have struggled to get my data into the MSstatsTMT pipeline. I'm working with data from Proteome Discoverer and using the MSstatsTMT::PDtoMSstatsTMTFormat function to coerce my data to MSstats format.

I have traced the problem to two sources:

  1. My PSM report from ProteomeDiscoverer was saved as an excel document and thus its column names differ from what is expected by MSstatsTMT.

The column names expected by MSstats have had all spaces and special characters replaced with ., e.g.
"Spectrum File" == "Spectrum.File". This is only a nuisance however, as I can replace these characters with . myself

NOTE: the column 'Charge' is also required when importing the data from PD into MSstatsTMT, but is not mentioned in the documentation. If there is no column 'Charge' MSstats stops with an error.

  1. MSstats expects that the Spectrum.File column from the PSM report should match Run in the user provided annotation file.

Run : MS run ID. It should be the same as Spectrum.File info in raw.pd.

My problem is this:

# There are 36 unique MS runs:
length(unique(input.pd$Spectrum.File)) ==  36

# There are 48 unique MS samples:
length(unique(annotation.pd$Run)) == 48

We performed 3 TMT experiments, in each analyzing 16-TMT labeled samples concatenated into a single Mixture.
This mixture was fractionated into 12 samples (Fractions) to increase analytical depth. Thus, there were 12 LCMS injections for each TMT experiment. Thereby 3 x 12 = 36 injections and Spectrum Files. However, there are 16 x 3 = 48 TMT samples.

Because each Spectrum.File annotation corresponds to measurements made from all 16 labeled samples in an experiment, a Spectrum.File matches multiple sample Runs and cannot be mapped to a single Condition as is required by the annotation.pd file passed to PDtoMSstatsTMTFormat.

MSstatsTMT within the shiny tool with MQ data crashes

Hi,

I've recently attended the MSstats workshop and have so far I've been using the desktop shiny tool analyse MSFragger output with MSstatsTMT.
However, MSFragger does not offer a function important for my analysis, whereas MQ does, so I would also like to process MQ data.
I've been trying to upload my files, but the tool keeps crashing with the following error message:

Reached in evidence
Reached in proteins_group
Reached in maxq annot
Reached in get_data
File type is maxq
Reached in maxq
Warning: Error in data.table::fread: File 'Protein.Accessions' does not exist or is non-readable. getwd()=='C:/Software/MSstats-Shiny-main/MSstats-Shiny-main'

I've changed the Protein ID field for Proteins, Leading.razor.protein, Gene.names etc. But the error is always the same. I've tried everything. As far as I understand, Protein.Accessions is a name of the column in msstats output generated by MSFragger (and also the converter tool?) Am I missing some libraries/tools to run this?

difficulties in creating annotation.pd

Hey,

I am trying to create the annotation.pd file and I needed some help understanding a certain things. So the experimental design is as follows :

image

There were 27 human post-mortem tissues from the dorsolateral prefrontal cortex, which were digested and labeled with 3 batches of 11 plex TMT reagents. Two pooled global internal standards (GIS) were labeled with channels 126 and 131 C in each batch. There were 9 samples from 3 groups (control, AsymAD and AD) randomized and labeled in the remainder channels. The labeled sample mixture was fractionated by off-line high-pH reversed-phase chromatography into 24 fractions. All fractions were analyzed on an Orbitrap Fusion Lumos Tribrid mass spectrometer.

Sample trait file is as contained in the attached file : sampletraits.xlsx

The raw files are co-analysed (specified as fractions in PD) to produce a single PSM.csv. How many rows and what should the columns of annotation.pd be?

Uday

Run column changing to duplicate of Mixture after using PDtoMSstatsTMTFormat()

After using PDtoMSstatsTMTFormat(), my output dataframe has a duplicate of the mixture name where the run information should be.

My PD dataframe (named peptides):
PSMs.Workflow.ID PSMs.Peptide.ID Checked Confidence Identifying.Node PSM.Ambiguity Annotated.Sequence
1 -22 1902915 False High Sequest HT (A2) Unambiguous [R].eQLTEGEEIAQEIDGR.[F]
2 -22 1129169 False High MS Amanda 2.0 (A4) Unambiguous [K].gSFSEQGINEFLR.[E]
3 -22 1129152 False High MS Amanda 2.0 (A4) Unambiguous [R].gFAFVTFDDHDSVDk.[I]
4 -22 1129148 False High MS Amanda 2.0 (A4) Unambiguous [R].aVSWTFSEENVIR.[E]
5 -22 3199194 False High Sequest HT (A2) Unambiguous [R].sLGYAYVNFQQPADAER.[A]
6 -22 3199177 False High Sequest HT (A2) Unambiguous [K].qLEAIDQLHLEYAkR.[A]
Modifications Number.of.Protein.Groups Number.of.Proteins Master.Protein.Accessions
1 N-Term(TMT6plex) 1 1 Q9NRY4
2 N-Term(TMT6plex) 1 5 Q15084-2
3 N-Term(TMT6plex); K15(TMT6plex) 2 4 P09651-2; P09651
4 N-Term(TMT6plex) 1 2 Q9Y320
5 N-Term(TMT6plex) 2 5 Q13310-3; P11940
6 N-Term(TMT6plex); K14(TMT6plex) 1 3 O43707
Protein.Accessions Number.of.Missed.Cleavages Charge Delta.Score Delta.Cn Rank Search.Engine.Rank
1 Q9NRY4 0 3 0.3475 0 1 1
2 Q15084-2; Q15084-5; Q15084-3; Q15084-4; Q15084 0 3 0.7133 0 1 1
3 P09651-2; Q32P51; P09651-3; P09651 0 3 0.6073 0 1 1
4 Q9Y320; Q9Y320-2 0 2 0.7547 0 1 1
5 P11940-2; Q13310-2; Q13310-3; Q13310; P11940 0 2 0.6507 0 1 1
6 O43707-3; O43707-2; O43707 1 4 0.6108 0 1 1
mz.in.Da MHplus.in.Da Theo.MHplus.in.Da Delta.M.in.ppm Delta.mz.in.Da Activation.Type MS.Order Isolation.Interference.in.Percent
1 682.6784 2046.021 2046.019 0.91 0.00062 CID MS2 13.265330
2 571.6313 1712.879 1712.881 -0.73 -0.00042 CID MS2 51.300690
3 720.0338 2158.087 2158.086 0.52 0.00038 CID MS2 9.272033
4 883.9666 1766.926 1766.927 -0.82 -0.00073 CID MS2 0.000000
5 1079.5407 2158.074 2158.077 -1.19 -0.00129 CID MS2 0.000000
6 572.0807 2285.301 2285.302 -0.38 -0.00022 CID MS2 0.000000
Average.Reporter.SN Ion.Inject.Time.in.ms RT.in.min First.Scan Spectrum.File File.ID Abundance.126
1 63.0 3.211 12.7153 8732 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw F1.6 33.0
2 49.6 1.174 11.1366 7464 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 77.6
3 33.2 0.848 11.1594 7481 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 22.4
4 5.2 1.181 11.1605 7482 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 5.7
5 7.1 1.929 9.2801 5952 20200116_UVPD_SartoriRodriguesM_Fraction_40.raw F1.39 12.9
6 82.2 0.525 9.2778 5950 20200116_UVPD_SartoriRodriguesM_Fraction_40.raw F1.39 149.9
Abundance.127N Abundance.127C Abundance.128N Abundance.128C Abundance.129N Abundance.129C Abundance.130N Abundance.130C Quan.Info
1 119.5 75.2 96.5 66.5 55.5 45.1 90.8 40.6
2 93.9 76.8 75.0 69.6 27.1 21.9 25.4 15.3
3 50.6 27.2 39.2 32.1 28.1 43.5 60.4 38.6
4 8.9 6.3 6.5 4.8 4.5 5.9 5.0 5.5
5 13.4 9.4 11.8 11.5 4.8 4.5 2.4 2.1
6 154.9 104.7 152.3 136.0 39.6 53.9 22.0 22.5
Amanda.Score CharmeRT.Combined.Score Search.Space MS.Amanda.Rank Search.Depth XCorr Percolator.q.Value Percolator.PEP
1 NA NA 0 0 0 4.00 0 2.410e-06
2 209.45 209.45 641 1 1 NA 0 2.339e-06
3 180.59 180.59 1082 1 1 NA 0 1.247e-06
4 163.13 163.13 648 1 1 NA 0 1.887e-05
5 NA NA 0 0 0 6.07 0 6.123e-07
6 NA NA 0 0 0 4.83 0 8.321e-06

The annotation dataframe (named annotation_df):
Run Fraction TechRepMixture Channel Condition BioReplicate Mixture
1 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 126 RTT_NPC 1 Mixture1
2 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 127N WT_NPC 2 Mixture1
3 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 127C RTT_NPC 2 Mixture1
4 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 128N WT_NPC 3 Mixture1
5 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 128C RTT_NPC 3 Mixture1
6 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 129N WT_NEU 1 Mixture1

I ran the following command:
TMTpeptides <- PDtoMSstatsTMTFormat(peptides, annotation_df, rmPSM_withMissing_withinRun = TRUE, useNumProteinsColumn = F)

The resulting dataframe:
ProteinName PeptideSequence Charge PSM Mixture TechRepMixture Run Channel
1 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 126
2 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 127C
3 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 127N
4 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 128C
5 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 128N
6 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 129C
BioReplicate Condition Intensity
1 1 RTT_NPC 85.2
2 2 RTT_NPC 79.8
3 2 WT_NPC 103.4
4 3 RTT_NPC 71.7
5 3 WT_NPC 71.0
6 1 RTT_NEU 76.6

All of my Run names have been changed to Mixture1_1. Is there an easy way to fix this? Thank you!

Support for Fragpipe

Hi Ting,

I'm wondering if you have TMT quant output from Fragpipe supported yet?

Best,
Weixian

Fractionated Data Not Handled With Data Frame Input to proteinSummarization()

Hello,

I'm processing data from a fractionated sample and I think I've found a bug in proteinSummarization(). I've built a data frame with all of the necessary columns and proteinSummarization() reads it just fine and returns a result data frame. However when I use a fractionated sample and include the Fraction column protein summaries are returned per fraction/run rather than across all the fractions for the plex.

I dug through the code and as best I can tell the attached patch fixes the issue. I was also able to route around it by directly calling MSconvert::MSstatsBalancedDesign myself using the default for handle_fractions option the way it's done in the various conversion functions shipped with MSstatsTMT. I'm not sure if there's a reason why handle_fractions was set to FALSE for the plain data frame code path through proteinSummarization though. Do you have any guidance on how to handle this? Thank you!

  • David Nusinow

protein_summary_fractions_data_frame.patch

Problem with annotation file for proteome discoverer

Hello,

I currently want to run the MSStatsTMT pipeline on the Galaxy Server and this has worked in the past for MaxQuant data. Now, I want to process the same data set with Proteome Discoverer and MSStatsTMT. While using MSStatsTMT implemented on the Galaxy server, I always faced the same error messages: " ** Please check the annotation file. The channel name must be matched with that in input data." Following the documentation, I named the channels in the annotation file (126, ..., 131) according to the columns in the PSM file (Abundance: 126, ..., Abundance 131). Afterwards, I tried some different combinations but none of them has worked yet, thus I would be really thankful for any advice.

I currently use the PSM output from a TMT6plex data set processed with Proteome discoverer 2.5.0. on the Galaxy server. Attached to this message, you can have a look on a shortened PSM.txt file and the annotation file.

Thank you for your help!
M. Maldacker

TMT10_PSMs_short.txt
Diff_Row_PD_annotation.txt

Logging options

Logging: suppress MSstats log from dataProcess while keeping logs for MSstatsTMT proteinSummarization

Exporting the protein-level data

Hello!
I want to export the protein-level data after proteinSummarization. I tried getSummarizedTMT but it didn't work. Could you pleased let me know how I can export the protein-level data?

What is the 'raw.mq' file supposed to be?

Hello,

I'd love to use this tool but I'm stuck on a very early step, building the annotation file for MaxQuant (v2.1.0.0) output.
Following the 'annotation.mq' documentation in the manual, I see references to the 'raw.mq' file - but I'm at a loss to find any file by that name.

Is it possible the file name has been changed since the documentation was written?

Pieter

old version of MSstatsTMT

Hello MSstats team

first of all, I really appreciate for this amazing package.
I have one quick question about MSstatsTMT.
As I am not familiar with R code or package, it might be a basic question.
Where can I find older version of MsstatsTMT package??
I would like to match the result with previous one which was run before update ( I run first data on about May 2020)

I appreciate for your help.

Sincerely,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.