The msstatstmt from vitek-lab

Some samples missing from output of protein summarization

Hello,

I've started to play around with MSstats TMT and just have a query about the output of proteinSummarization. I have 20 replicates so would expect 20 rows per protein (1 row for each replicate). This is the case for the majority of my proteins. However, in some cases I have less than 20 rows (abundance values per protein). For example, for one protein I have 17 rows of protein abundance corresponding to 17 samples. Some of the samples' abundance is listed as NA, whereas three of my samples just aren't listed at all for this particular protein.

What is the difference between a sample not being listed versus it having an NA abundance. I had assumed that if it was not listed, then that protein was simply missing in that sample.. But then what's the meaning of NA?

I have attached an example of two proteins from my output file, the first shows 20 abundance values for that particular protein, whilst the second protein only has 17 abundance values, some of which (those belonging to a 2nd TMT mixture) are NA.
MSstats proteinSummarization example.xlsx

Thanks for your help with this!

Check if nrow(input) > 0 before merging annotation

And add a helpful error messages

annotation.pd file not accepted

Hi everyone.
I am trying to run MSstats TMT for a large-scale proteomics study but am running into below error within R 4.3.1 and the newest version of R studio.

Error in .mergeAnnotation(input, annotation) :
** Please check the annotation file. The channel name must be matched with that in input data.

I checked the annotation file and am using the same format as in the example file. I tried multiple ways of annotating the channels, but receive the same error.

The file size is too big to upload as attachment. Is there another way I can share with you?

I was using a previous MSstatsTMT version, but decided to upgrade. It is likely an easy fix, but I cannot seem to find the problem.

Thanks for your help.

input.mq does not extract Condition information from annotation file

Hi there,

I am using MSstatsTMT to process proteinGroups and evidence output files from Maxquant. MaxQtoMSstatsTMTFormat() finished successfully, but with all NAs in Condition, Mixture, BioReplicate columns. This led to a failure in the following comparison test. Would you suggest how to figure out this problem? Attached please find my input files. Thanks.

MaxQuant_Protein_Groups_for_b1_sample.txt
b1MSstatsTMT_annotation_file_sample.txt
MaxQuant_Evidence_for_b1_sample.txt

Norm channel doesn't handle capitalization correctly.

If "Norm" channel is named in all capitals it does not get removed from the summarized data (even with remove_norm_channel selected). This should handle any capitalization of Norm.

onattach and onload

check if there are possible problems with logging when using MSstatsTMT
(based on slack convo)

MaxQtoMSstatsTMTFormat failing to use which.proteinid = "Gene.names"

Hi MSstats team,

First of all, congratulations on making such a nice package!

I've very happy with the overall experience, but I may have found a bug. I used the previous version (1.8.2) where the MaxQtoMSstatsTMTFormat function was able to generate a table using the gene names. However, after updating it to the 2.2 version, it is only generating a table with the protein ID, even though I am setting which.proteinid = "Gene.names".

I have very limited knowledge of R, but I tried debugging the MaxQtoMSstatsTMTFormat function to see if I could find any fix. The only thing that stood out to me is that the MSstatsConvert::MSstatsImport function is removing the dots in the column names of the proteinGroups and evidence objects. I tried using "Genenames" as an input but it didn't solve the issue.

Here is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.4   ggrepel_0.9.1    ggpubr_0.4.0     ggplot2_3.3.5    MSstatsTMT_2.2.0

loaded via a namespace (and not attached):
 [1] MSstats_4.1.1         gtools_3.9.2          tidyselect_1.1.1      purrr_0.3.4           splines_4.1.1        
 [6] lmerTest_3.1-3        lattice_0.20-45       log4r_0.4.2           carData_3.0-4         colorspace_2.0-2     
[11] vctrs_0.3.8           generics_0.1.1        utf8_1.2.2            survival_3.2-13       marray_1.72.0        
[16] rlang_0.4.12          pillar_1.6.4          nloptr_1.2.2.3        glue_1.5.0            withr_2.4.2          
[21] DBI_1.1.1             plyr_1.8.6            lifecycle_1.0.1       stringr_1.4.0         munsell_0.5.0        
[26] ggsignif_0.6.3        gtable_0.3.0          caTools_1.18.2        fansi_0.5.0           preprocessCore_1.56.0
[31] broom_0.7.10          Rcpp_1.0.7            KernSmooth_2.23-20    scales_1.1.1          backports_1.3.0      
[36] checkmate_2.0.0       limma_3.50.0          abind_1.4-5           lme4_1.1-27.1         gplots_3.1.1         
[41] stringi_1.7.5         rstatix_0.7.0         dplyr_1.0.7           numDeriv_2016.8-1.1   grid_4.1.1           
[46] MSstatsConvert_1.4.0  tools_4.1.1           bitops_1.0-7          magrittr_2.0.1        tibble_3.1.6         
[51] car_3.0-12            tidyr_1.1.4           crayon_1.4.2          pkgconfig_2.0.3       MASS_7.3-54          
[56] ellipsis_0.3.2        Matrix_1.3-4          data.table_1.14.2     assertthat_0.2.1      minqa_1.2.4          
[61] R6_2.5.1              boot_1.3-28           nlme_3.1-153          compiler_4.1.1

Best,
Luiz

What is the 'raw.mq' file supposed to be?

Hello,

I'd love to use this tool but I'm stuck on a very early step, building the annotation file for MaxQuant (v2.1.0.0) output.
Following the 'annotation.mq' documentation in the manual, I see references to the 'raw.mq' file - but I'm at a loss to find any file by that name.

Is it possible the file name has been changed since the documentation was written?

Pieter

MSstatsTMT within the shiny tool with MQ data crashes

Hi,

I've recently attended the MSstats workshop and have so far I've been using the desktop shiny tool analyse MSFragger output with MSstatsTMT.
However, MSFragger does not offer a function important for my analysis, whereas MQ does, so I would also like to process MQ data.
I've been trying to upload my files, but the tool keeps crashing with the following error message:

Reached in evidence
Reached in proteins_group
Reached in maxq annot
Reached in get_data
File type is maxq
Reached in maxq
Warning: Error in data.table::fread: File 'Protein.Accessions' does not exist or is non-readable. getwd()=='C:/Software/MSstats-Shiny-main/MSstats-Shiny-main'

I've changed the Protein ID field for Proteins, Leading.razor.protein, Gene.names etc. But the error is always the same. I've tried everything. As far as I understand, Protein.Accessions is a name of the column in msstats output generated by MSFragger (and also the converter tool?) Am I missing some libraries/tools to run this?

Run column changing to duplicate of Mixture after using PDtoMSstatsTMTFormat()

After using PDtoMSstatsTMTFormat(), my output dataframe has a duplicate of the mixture name where the run information should be.

My PD dataframe (named peptides):
PSMs.Workflow.ID PSMs.Peptide.ID Checked Confidence Identifying.Node PSM.Ambiguity Annotated.Sequence
1 -22 1902915 False High Sequest HT (A2) Unambiguous [R].eQLTEGEEIAQEIDGR.[F]
2 -22 1129169 False High MS Amanda 2.0 (A4) Unambiguous [K].gSFSEQGINEFLR.[E]
3 -22 1129152 False High MS Amanda 2.0 (A4) Unambiguous [R].gFAFVTFDDHDSVDk.[I]
4 -22 1129148 False High MS Amanda 2.0 (A4) Unambiguous [R].aVSWTFSEENVIR.[E]
5 -22 3199194 False High Sequest HT (A2) Unambiguous [R].sLGYAYVNFQQPADAER.[A]
6 -22 3199177 False High Sequest HT (A2) Unambiguous [K].qLEAIDQLHLEYAkR.[A]
Modifications Number.of.Protein.Groups Number.of.Proteins Master.Protein.Accessions
1 N-Term(TMT6plex) 1 1 Q9NRY4
2 N-Term(TMT6plex) 1 5 Q15084-2
3 N-Term(TMT6plex); K15(TMT6plex) 2 4 P09651-2; P09651
4 N-Term(TMT6plex) 1 2 Q9Y320
5 N-Term(TMT6plex) 2 5 Q13310-3; P11940
6 N-Term(TMT6plex); K14(TMT6plex) 1 3 O43707
Protein.Accessions Number.of.Missed.Cleavages Charge Delta.Score Delta.Cn Rank Search.Engine.Rank
1 Q9NRY4 0 3 0.3475 0 1 1
2 Q15084-2; Q15084-5; Q15084-3; Q15084-4; Q15084 0 3 0.7133 0 1 1
3 P09651-2; Q32P51; P09651-3; P09651 0 3 0.6073 0 1 1
4 Q9Y320; Q9Y320-2 0 2 0.7547 0 1 1
5 P11940-2; Q13310-2; Q13310-3; Q13310; P11940 0 2 0.6507 0 1 1
6 O43707-3; O43707-2; O43707 1 4 0.6108 0 1 1
mz.in.Da MHplus.in.Da Theo.MHplus.in.Da Delta.M.in.ppm Delta.mz.in.Da Activation.Type MS.Order Isolation.Interference.in.Percent
1 682.6784 2046.021 2046.019 0.91 0.00062 CID MS2 13.265330
2 571.6313 1712.879 1712.881 -0.73 -0.00042 CID MS2 51.300690
3 720.0338 2158.087 2158.086 0.52 0.00038 CID MS2 9.272033
4 883.9666 1766.926 1766.927 -0.82 -0.00073 CID MS2 0.000000
5 1079.5407 2158.074 2158.077 -1.19 -0.00129 CID MS2 0.000000
6 572.0807 2285.301 2285.302 -0.38 -0.00022 CID MS2 0.000000
Average.Reporter.SN Ion.Inject.Time.in.ms RT.in.min First.Scan Spectrum.File File.ID Abundance.126
1 63.0 3.211 12.7153 8732 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw F1.6 33.0
2 49.6 1.174 11.1366 7464 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 77.6
3 33.2 0.848 11.1594 7481 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 22.4
4 5.2 1.181 11.1605 7482 20200116_UVPD_SartoriRodriguesM_Fraction_39.raw F1.38 5.7
5 7.1 1.929 9.2801 5952 20200116_UVPD_SartoriRodriguesM_Fraction_40.raw F1.39 12.9
6 82.2 0.525 9.2778 5950 20200116_UVPD_SartoriRodriguesM_Fraction_40.raw F1.39 149.9
Abundance.127N Abundance.127C Abundance.128N Abundance.128C Abundance.129N Abundance.129C Abundance.130N Abundance.130C Quan.Info
1 119.5 75.2 96.5 66.5 55.5 45.1 90.8 40.6
2 93.9 76.8 75.0 69.6 27.1 21.9 25.4 15.3
3 50.6 27.2 39.2 32.1 28.1 43.5 60.4 38.6
4 8.9 6.3 6.5 4.8 4.5 5.9 5.0 5.5
5 13.4 9.4 11.8 11.5 4.8 4.5 2.4 2.1
6 154.9 104.7 152.3 136.0 39.6 53.9 22.0 22.5
Amanda.Score CharmeRT.Combined.Score Search.Space MS.Amanda.Rank Search.Depth XCorr Percolator.q.Value Percolator.PEP
1 NA NA 0 0 0 4.00 0 2.410e-06
2 209.45 209.45 641 1 1 NA 0 2.339e-06
3 180.59 180.59 1082 1 1 NA 0 1.247e-06
4 163.13 163.13 648 1 1 NA 0 1.887e-05
5 NA NA 0 0 0 6.07 0 6.123e-07
6 NA NA 0 0 0 4.83 0 8.321e-06

The annotation dataframe (named annotation_df):
Run Fraction TechRepMixture Channel Condition BioReplicate Mixture
1 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 126 RTT_NPC 1 Mixture1
2 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 127N WT_NPC 2 Mixture1
3 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 127C RTT_NPC 2 Mixture1
4 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 128N WT_NPC 3 Mixture1
5 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 128C RTT_NPC 3 Mixture1
6 20200116_UVPD_SartoriRodriguesM_Fraction_6.raw 6 1 129N WT_NEU 1 Mixture1

I ran the following command:
TMTpeptides <- PDtoMSstatsTMTFormat(peptides, annotation_df, rmPSM_withMissing_withinRun = TRUE, useNumProteinsColumn = F)

The resulting dataframe:
ProteinName PeptideSequence Charge PSM Mixture TechRepMixture Run Channel
1 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 126
2 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 127C
3 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 127N
4 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 128C
5 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 128N
6 Q9UK76-3; Q9UK76; Q9UK76-2 [-K].mASNIFGTPEENQASWAk.[S] 2 [-K].mASNIFGTPEENQASWAk.[S]_2 Mixture1 1 Mixture1_1 129C
BioReplicate Condition Intensity
1 1 RTT_NPC 85.2
2 2 RTT_NPC 79.8
3 2 WT_NPC 103.4
4 3 RTT_NPC 71.7
5 3 WT_NPC 71.0
6 1 RTT_NEU 76.6

All of my Run names have been changed to Mixture1_1. Is there an easy way to fix this? Thank you!

.fitModelTMT calls wrong model if multiple subject and mixtures but one technical replicate

check if .getRunsMedian is OK

Interface for outputs of major functions

as in Vitek-Lab/MSstats#62

Exporting the protein-level data

Hello!
I want to export the protein-level data after proteinSummarization. I tried getSummarizedTMT but it didn't work. Could you pleased let me know how I can export the protein-level data?

Fractionated Data Not Handled With Data Frame Input to proteinSummarization()

Hello,

I'm processing data from a fractionated sample and I think I've found a bug in proteinSummarization(). I've built a data frame with all of the necessary columns and proteinSummarization() reads it just fine and returns a result data frame. However when I use a fractionated sample and include the Fraction column protein summaries are returned per fraction/run rather than across all the fractions for the plex.

I dug through the code and as best I can tell the attached patch fixes the issue. I was also able to route around it by directly calling MSconvert::MSstatsBalancedDesign myself using the default for handle_fractions option the way it's done in the various conversion functions shipped with MSstatsTMT. I'm not sure if there's a reason why handle_fractions was set to FALSE for the plain data frame code path through proteinSummarization though. Do you have any guidance on how to handle this? Thank you!

David Nusinow

protein_summary_fractions_data_frame.patch

Fragpipe

Fragpipe has been giving significantly better TMT data than Maxquant for us. What is the best way to import Fragpipe data into MSstatsTMT and MSstatsTMTPTM?

Error when using TMP for protein summarization.

Hi,
I've been playing with MSstatsTMT for a while, I'm using MaxQuant as search tool, but when I run protein.summarization function with method='MedianPolish' it reports following error: Error in inds_combine(.vars, ind_list) : Position must be between 0 and n.
This error only occurs on MedianPolish but not the other two.
Hope get some clues from you.

Best,
Weixian

difficulties in creating annotation.pd

Hey,

I am trying to create the annotation.pd file and I needed some help understanding a certain things. So the experimental design is as follows :

There were 27 human post-mortem tissues from the dorsolateral prefrontal cortex, which were digested and labeled with 3 batches of 11 plex TMT reagents. Two pooled global internal standards (GIS) were labeled with channels 126 and 131 C in each batch. There were 9 samples from 3 groups (control, AsymAD and AD) randomized and labeled in the remainder channels. The labeled sample mixture was fractionated by off-line high-pH reversed-phase chromatography into 24 fractions. All fractions were analyzed on an Orbitrap Fusion Lumos Tribrid mass spectrometer.

Sample trait file is as contained in the attached file : sampletraits.xlsx

The raw files are co-analysed (specified as fractions in PD) to produce a single PSM.csv. How many rows and what should the columns of annotation.pd be?

Uday

which() not working with data.table

MSstatsTMT/R/PDtoMSstatsTMTFormat.R

Line 62 in f8018fa

    
           input <- input[, which(colnames(input) %in% c(which.pro, which.NumProteins, 'Annotated.Sequence', 'Charge',

If input is a data.table rather than a data.frame, which() won't index correctly! Either the class of input must initially be set to data.frame, or the indexing method should be changed.

Suppression of ratios during proteinSummarization

I am performing an IP-MS/MS with TMTPro. There are 4 cell lines

channels 1:4 (126:128N). No proteins are tagged.
channels 5:8 (128C:130N). Q9H9L4 is flag tagged.
channels 9:12 (130C:132N). Q9P2N6 is flag tagged.
channels 13:16 (132C:134N). Q9H7Z6 is flag tagged.

Ip is performed on all cell lines with anti-FLAG beads.

I ran the samples in MS3 SPS mode and the quantitation performed as expected: Q9H9L4 is almost exclusively present in 128C:130N, Q9P2N6 in 130C:132N and Q9H7Z6 in 132C:134N channels. This is seen on level of peptides after PDtoMSstatsTMTFormat conversion:

The black lines represent individual PSMs, the red line is the average of all PSMs as computed by ggplot2.

However, when the quantitation is rolled up to protein level, the ratios become extremely suppressed and now the baits are "present" in all channels.

The black lines represent individual PSMs, the red line is the average of all PSMs as computed by ggplot2 and the blue line is quantitation on protein level by MSStatsTMT.

Thus the ratios computed by proteinSummarization() are extremely suppressed compared to ratios I would get by simply finding the average of all peptides.

Below is the data:

example_data.zip

and code

 NSLComplex <- c("Q9H9L4" , "Q9P2N6", "Q9H7Z6")
PSMTable <- read.csv ("example_data.csv" , stringsAsFactors = F)

annotation.pd= data.frame (Run = rep ("NSL_tmt_unfr_6hours.raw", 16), 
                           Channel = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"), 
                           Condition = c(rep("control", 4),  rep("K2", 4), rep("K3", 4), rep("K8", 4)), 
                           Fraction = rep(1, 16), 
                           BioReplicate = "empty", Mixture = 1, TechRepMixture = 1, 
                           stringsAsFactors = FALSE)

PSMTMTStats <- PDtoMSstatsTMTFormat(PSMTable, annotation.pd,  rmPSM_withMissing_withinRun = TRUE)
protTMTStats <- proteinSummarization(PSMTMTStats, method="msstats", global_norm = FALSE)

PSMTMTStats <- rename (PSMTMTStats, "Protein" = "ProteinName", "Abundance" = "Intensity")
  
NSLComplexPSM <-  PSMTMTStats  %>% filter (Protein %in% NSLComplex) %>% group_by(Protein, PSM) %>% mutate (Abundance = Abundance/sum (Abundance)) 
NSLComplexProt <- protTMTStats %>% filter (Protein %in% NSLComplex) %>% group_by (Protein) %>% mutate (Abundance = Abundance/sum(Abundance))

NSLComplexPSM$Channel <- factor (NSLComplexPSM$Channel, levels = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"))
NSLComplexProt$Channel <- factor (NSLComplexProt$Channel, levels = c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131N", "131C", "132N", "132C", "133N",  "133C",  "134N"))

ggplot (NSLComplexPSM, aes (Channel,  Abundance, group = PSM)) + geom_line () + 
  facet_wrap (~Protein) + 
  stat_summary(aes(y = Abundance), fun.y=mean, colour="red", geom="line",group=1, size = 1) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_line(data=NSLComplexProt, aes(Channel, Abundance,group=1), color="blue")

MSstatsTMT does not work after dplyr update to 1.0.2.

I updated dplyr to 1.0.2 and MSstatsTMT. The mistake it returns is:

** Negative log2 intensities were replaced with NA.
Summarizing for Run : all.raw ( 1 of 1 )
Error in [.data.table(raw, , require.col) :
j (the 2nd argument inside [...]) is a single symbol but column name 'require.col' is not found. Perhaps you intended DT[, ..require.col]. This difference to data.frame is deliberate and explained in FAQ 1.1.

the same code works with dplyr 0.8.3

MaxQtoMSstatsTMTFormat fails

Hi, when running:
MaxQtoMSstatsTMTFormat(evidence = raw.mq, proteinGroups = proteinGroups.mq, annotation = annotation.mq,allow.cartesian=TRUE)

I get the below error. I rerun with "allow.cartesian=TRUE", but the error persists.
Any advice on this? Attached is my annotation file.
annotation_towers.txt

Help appreciated.
GPR

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 1854844 rows; more than 1650276 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

annotation file issues for output of Proteom Discoverer

Hi,

This is my first time analysing proteomic data (but well versed in R). I have been trying to use MSstatsTMT to study a certain condition. However, I am facing issues trying to create the annotation file required for my situation. The data I am looking at is contained in the following link : https://www.ebi.ac.uk/pride/archive/projects/PXD020296

I'd be extremely glad if you could help me create an annotation file in situations involving High-pH off-line fractionation.

Thanks,
Uday

MaxQtoMSstatsTMTFormat function doesn't take correct run identifier from annotation file

Hi MSStatsTMT team,

I noticed that in the newest version of MSStatsTMT, which is 0.99.9, MaxQtoMSstatsTMTFormat function wrongly take the identifier from Mixture column instead of Run column. In this case, when doing groupcomparison, it shows the error which is saying have duplicate identifiers from multiple rows.

Best,
Weixian

Logging options

Logging: suppress MSstats log from dataProcess while keeping logs for MSstatsTMT proteinSummarization

dataProcess breaks proteinSummarization in TMT analysis (MSstatsTMT 1.0.1)

MSstatsTMT/R/proteinSummarization.functions.R

Line 68 in a66c093

output.msstats <- dataProcess(sub_data,

In line 668 of MSstats::dataProcess, a dataframe ("tempmissingwork") is to be generated from parameters with different dimensions. In my data, the "nameID" has a dimension of (12,7) and the "tempfeatureID" has a dimension of (1345,4). This makes the following command to break the code:

tempmissingwork <- data.frame(tempfeatureID, LABEL = "L", GROUP_ORIGINAL = nameID$GROUP_ORIGINAL, SUBJECT_ORIGINAL = nameID$SUBJECT_ORIGINAL, RUN = nameID$RUN, GROUP = nameID$GROUP, SUBJECT = nameID$SUBJECT, SUBJECT_NESTED = nameID$SUBJECT_NESTED, INTENSITY = NA, ABUNDANCE = NA, FRACTION = nameID$FRACTION)

The nameID is:

SUBJECT_ORIGINAL	GROUP_ORIGINAL	GROUP	SUBJECT	SUBJECT_NESTED	RUN	FRACTION
1.1.126	mdx	3	1	3.1	1_126	1
1.2.126	mdx	3	11	3.21	1_126	1
1.3.126	mdx	3	21	3.31	1_126	1
1.4.126	mdx	3	31	3.41	1_126	1
1.5.126	mdx	3	41	3.51	1_126	1
1.6.126	mdx	3	51	3.61	1_126	1
1.7.126	mdx	3	61	3.71	1_126	1
1.8.126	mdx	3	71	3.81	1_126	1
1.9.126	mdx	3	81	3.91	1_126	1
1.10.126	mdx	3	91	3.101	1_126	1
1.11.126	mdx	3	101	3.111	1_126	1
1.12.126	mdx	3	111	3.121	1_126	1

and the tempfeatureID:

PROTEIN	PEPTIDE	TRANSITION	FEATURE
Q6PGF7	[R].eGSGTGEEGk.[Q]_2	NA_NA	[R].eGSGTGEEGk.[Q]_2_NA_NA
Q9CZ13	[R].NALVSHLDGTTPVcEDIGR.[S]_3	NA_NA	[R].NALVSHLDGTTPVcEDIGR.[S]_3_NA_NA
Q7TSK3	[R].dVNDHAPR.[F]_3	NA_NA	[R].dVNDHAPR.[F]_3_NA_NA
Q8VE33	[R].LNLGEEVPVIIHR.[D]_3	NA_NA	[R].LNLGEEVPVIIHR.[D]_3_NA_NA
...	...	...	...

The complete dataset consists of thousands of proteins, however I tested the MSstatsTMT workflow with a subset of 20 and 50 proteins, but the code still breaks at this point.

I assume the code should be something like this:
tempmissingwork <- merge(tempfeatureID, data.frame( LABEL = "L", GROUP_ORIGINAL = nameID$GROUP_ORIGINAL, SUBJECT_ORIGINAL = nameID$SUBJECT_ORIGINAL, RUN = nameID$RUN, GROUP = nameID$GROUP, SUBJECT = nameID$SUBJECT, SUBJECT_NESTED = nameID$SUBJECT_NESTED, INTENSITY = NA, ABUNDANCE = NA, FRACTION = nameID$FRACTION))

missing value handling question

Some of the values are missing in TMT experiments and are represented by NA after processing using PD. I believe that in my applications, the reason the values are missing is that they are genuinely 0 (meaning ni signal is present in the channel). MSstatsTMT imputes these values, since 0s are not allowed in protein summarization and statistical tests.

Would it, however, not be better to multiply all the values by a parge number (say 1e5) and then assign missing values a value of smth like 10?

Error in proteinSummarization - possible annotation issue?

I'm running into a weird error when using the "msstats" method in proteinSummarization(). I believe this may have to do with my annotation file as I am not using a converter and doing this manually.

Here is an example of the error I encounter:

quant.msstats <- proteinSummarization(msstatsinput,
+                                       method="msstats", 
+                                       global_norm=TRUE,
+                                       reference_norm=TRUE,
+                                       remove_norm_channel = TRUE,
+                                       remove_empty_channel = TRUE)
INFO  [2021-12-08 16:14:23] ** MSstatsTMT - proteinSummarization function
INFO  [2021-12-08 16:15:09] Summarizing for Run : TMT_V46_P1E ( 1  of  194 )
  |=                                                  |   2%Aggregate function missing, defaulting to 'length'
<simpleError in .Primitive("length")(newABUNDANCE, keep = TRUE): 2 arguments passed to 'length' which requires 1>
Error in merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE],  : 
  Elements listed in `by` must be valid column names in x and y
In addition: Warning message:
In merge.data.table(input[, colnames(input) != "newABUNDANCE", with = FALSE],  :
  You are trying to join data.tables where 'y' argument is 0 columns data.table.

I'm hoping someone can point me in the right direction where I'm going wrong. Thanks!

unclear columns in annotation.pd

I am trying to load some data into MSstatsTMT and the meaning of the columns in annotation.pd is very unclear to me (I use both TMT and bioinformatics software regularly).

"Mixture : Mixture of samples labeled with different TMT reagents, which can be analyzed in a
single mass spectrometry experiment. If the channal doesn’t have sample, please add ‘Empty’
under Condition."

TMT is always a mixture of samples, what is this column supposed to convey?

"TechRepMixture : Technical replicate of one mixture. One mixture may have multiple technical replicates. For example, if ‘TechRepMixture’ = 1, 2 are the two technical replicates of
one mixture, then they should match with same ‘Mixture’ value."

"Fraction : Fraction ID. One technical replicate of one mixture may be fractionated into multiple fractions to increase the analytical depth. Then one technical replicate of one mixture
should correspond to multuple fractions. For example, if ‘Fraction’ = 1, 2, 3 are three fractions of the first technical replicate of one TMT mixture of biological subjects, then they
should have same ‘TechRepMixture’ and ‘Mixture’ value"

I would suggest you describe a specific example to make it clear. E.g. I have a 16plex experiment, with 16 samples in 4 conditions:

condition 1: channels 1:4
condition 2: channels 5:8
condition 3: channels 9:12
condition 4: channels 13:16

after the samples are labelled and pooled, the pool is fractionated into 4 fractions, each of them is run individually and produces a separate raw file, the raw files are co-analysed (specified as fractions in PD) to produce a single PSM.csv. What should the columns of annotation.pd say?

Problem with annotation file for proteome discoverer

Hello,

I currently want to run the MSStatsTMT pipeline on the Galaxy Server and this has worked in the past for MaxQuant data. Now, I want to process the same data set with Proteome Discoverer and MSStatsTMT. While using MSStatsTMT implemented on the Galaxy server, I always faced the same error messages: " ** Please check the annotation file. The channel name must be matched with that in input data." Following the documentation, I named the channels in the annotation file (126, ..., 131) according to the columns in the PSM file (Abundance: 126, ..., Abundance 131). Afterwards, I tried some different combinations but none of them has worked yet, thus I would be really thankful for any advice.

I currently use the PSM output from a TMT6plex data set processed with Proteome discoverer 2.5.0. on the Galaxy server. Attached to this message, you can have a look on a shortened PSM.txt file and the annotation file.

Thank you for your help!
M. Maldacker

TMT10_PSMs_short.txt
Diff_Row_PD_annotation.txt

Fraction variable in PD annotation

Hello, I am facing the following issue:
I have an experiment of TMTpro 16 plex in 4 runs with 58 samples + 6 (internal control). The experiment was fractionated in gradient length (table and experiment information attached).
In order to create the AnnotationPD file should I consider fractions with the same values or not? In that case, which one is more adequate to use (same fractions values or different ones)? And what should be the impact if I use both cases to compare?

TMT16_HiRIEF_Materials and Methods.pdf

Error with groupComparisonTMT

I recently updated to R version 4.1.0 and used devtools to install the Github version of MSstatsTMT 1.99.0 or 2.0.0
I encountered a problem with .checkGroupComparisonInput() function, during groupComparisonTMT. I tried to rerun my old code that worked fine before I updated R to the most recent version 2 weeks ago.

> test_contrast_test <- groupComparisonTMT(data = input.pd,
+                                           contrast.matrix = 'pairwise',
+                                           moderated = TRUE,
+                                           adj.method = 'BH',
+                                           remove_norm_channel = TRUE,
+                                           remove_empty_channel = TRUE)
Error in .checkGroupComparisonInput(input) : 
  Please check the required input. ** columns : Protein, BioReplicate, Abundance, Run, Channel, Condition, TechRepMixture, Mixture , are missed.

When I checked my input.pd columns, they seem to be normal.

> colnames(input.pd)
[1] "Run"            "Protein"        "Abundance"      "Channel"        "BioReplicate"   "Condition"     
[7] "TechRepMixture" "Mixture"

If I manually, run .checkGroupComparisonInput(input) from the source code, it returns the data frame okay.

.checkGroupComparisonInput = function(input) {
  required_cols = c("Protein", "BioReplicate", "Abundance", "Run", 
                    "Channel", "Condition", "TechRepMixture", "Mixture")
  if (!all(required_cols %in% colnames(input))) {
    missing_cols = !(required_cols %in% colnames(input))
    missing_msg = paste(required_cols[missing_cols],
                        collapse = ", ")
    if (sum(missing_cols) == 1) {
      stop(paste("Please check the required input. ** columns :",
                 missing_msg, "is missed."))
    } else {
      stop(paste("Please check the required input. ** columns :",
                 missing_msg, ", are missed."))
    }
  }
  if (data.table::uniqueN(input$Condition) < 2){
    stop(paste("Please check the Condition column in annotation file.", 
               "There must be at least two conditions!"))
  }
  input
}

output <- checkGroupComparisonInput(input.pd)
head(output == input.pd)
     Run Protein Abundance Channel BioReplicate Condition TechRepMixture Mixture
[1,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[2,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[3,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[4,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[5,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE
[6,] TRUE    TRUE      TRUE    TRUE         TRUE      TRUE           TRUE    TRUE

Here are some of my session_Info()

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.4.1       usethis_2.0.1        ComplexHeatmap_2.8.0 mclust_5.4.7         expss_0.10.7        
 [6] robustbase_0.93-8    psych_2.1.3          ggrepel_0.9.1        data.table_1.14.0    MSstatsTMT_1.99.0   
[11] readxl_1.3.1         forcats_0.5.1        stringr_1.4.0        dplyr_1.0.6          purrr_0.3.4         
[16] readr_1.4.0          tidyr_1.1.3          tibble_3.1.2         ggplot2_3.3.3        tidyverse_1.3.1     
[21] edgeR_3.34.0         limma_3.48.0

Thank you very much
Vasin

dataProcessPlotsTMT : flexible order of condition

MSstatsTMT/R/dataProcessPlotsTMT.R

Line 150 in bff897c

tempGroupName = unique(processed[, list(Condition, xorder, Run, Channel)])

Can we change it to

    tempGroupName = unique(processed[, list(xorder, Condition, Run, Channel)])
    tempGroupName = tempGroupName[order(xorder), ]

It allows to change the order of conditions in dataProcessPlotsTMT.
Thanks!

Using MSstatsTMT with a more complicated experimental design

Hi MSstats group,

I have struggled to get my data into the MSstatsTMT pipeline. I'm working with data from Proteome Discoverer and using the MSstatsTMT::PDtoMSstatsTMTFormat function to coerce my data to MSstats format.

I have traced the problem to two sources:

My PSM report from ProteomeDiscoverer was saved as an excel document and thus its column names differ from what is expected by MSstatsTMT.

The column names expected by MSstats have had all spaces and special characters replaced with ., e.g.
"Spectrum File" == "Spectrum.File". This is only a nuisance however, as I can replace these characters with . myself

NOTE: the column 'Charge' is also required when importing the data from PD into MSstatsTMT, but is not mentioned in the documentation. If there is no column 'Charge' MSstats stops with an error.

MSstats expects that the Spectrum.File column from the PSM report should match Run in the user provided annotation file.

Run : MS run ID. It should be the same as Spectrum.File info in raw.pd.

My problem is this:

# There are 36 unique MS runs:
length(unique(input.pd$Spectrum.File)) ==  36

# There are 48 unique MS samples:
length(unique(annotation.pd$Run)) == 48

We performed 3 TMT experiments, in each analyzing 16-TMT labeled samples concatenated into a single Mixture.
This mixture was fractionated into 12 samples (Fractions) to increase analytical depth. Thus, there were 12 LCMS injections for each TMT experiment. Thereby 3 x 12 = 36 injections and Spectrum Files. However, there are 16 x 3 = 48 TMT samples.

Because each Spectrum.File annotation corresponds to measurements made from all 16 labeled samples in an experiment, a Spectrum.File matches multiple sample Runs and cannot be mapped to a single Condition as is required by the annotation.pd file passed to PDtoMSstatsTMTFormat.

contrast matrix definition

the online tutorial describes contrast matrix:

contrast.matrix : Comparison between conditions of interests. 1) default is pairwise, which compare all possible pairs between two conditions. 2) Otherwise, users can specify the comparisons of interest. Based on the levels of conditions, specify 1 or -1 to the conditions of interests and 0 otherwise. The levels of conditions are sorted alphabetically.

I am not familiar with what contrast matrix is. is limma used for statistical tests and is contrast matrix functionality for limma?

Support for Fragpipe

Hi Ting,

I'm wondering if you have TMT quant output from Fragpipe supported yet?

Best,
Weixian

old version of MSstatsTMT

Hello MSstats team

first of all, I really appreciate for this amazing package.
I have one quick question about MSstatsTMT.
As I am not familiar with R code or package, it might be a basic question.
Where can I find older version of MsstatsTMT package??
I would like to match the result with previous one which was run before update ( I run first data on about May 2020)

I appreciate for your help.

Sincerely,

.make.contrast.single bug when groups contain the string "Group"

I ran into a bizarre issue where MSstatsTMT completely failed to produce output.

I tracked it to this line in .make.contrast.single:

group_c <- tempcontrast[gsub("Group", "", temp)]

Suggestion: use "^Group" in the gsub call.

Question about TechRepMixture

Hi @huang704

Is it possible to specify TechRepMixture across two or more Mixtures in a continuously increasing manner?

e.g.

Or would that imply a different design in comparison to a repeated/matching specification (1,2,3; 1,2,3; ...)?

vitek-lab / msstatstmt Goto Github PK

msstatstmt's Introduction

MSstatsTMT

msstatstmt's People

Contributors

Stargazers

Watchers

Forkers

msstatstmt's Issues

Recommend Projects

Recommend Topics

Recommend Org