lmsimp / bioc-proloc-hyperlopit-workflow Goto Github PK
View Code? Open in Web Editor NEWA Bioconductor workflow for processing and analysing spatial proteomics data
A Bioconductor workflow for processing and analysing spatial proteomics data
@ClaireMulvey had a very good question regarding the code chunk where we update the feature variable names:
fvarLabels(hyperLOPIT2015ms3r1)[3:5] <- paste0(fvarLabels(hyperLOPIT2015ms3r1)[3:5], 1)
fvarLabels(hyperLOPIT2015ms3r2)[3:5] <- paste0(fvarLabels(hyperLOPIT2015ms3r2)[3:5], 2)
fData(hyperLOPIT2015ms3r1) <- fData(hyperLOPIT2015ms3r1)[1:5]
fData(hyperLOPIT2015ms3r2) <- fData(hyperLOPIT2015ms3r2)[3:5]
where she asks if one could use updateFvarLabels
using, for example
updateFvarLabels(hyperLOPIT2015ms2r1, label = "rep1")
The reason we don't do that is that updateFvarLabels
updates all the names, including EntryName
and ProteinDescription
, that would end up as EntryName_Rep1
and ProteinDescription_Rep1
(which is a bit ugly, but would work). I have however changed to code chunk to use updateFvarLabels
for the second replicate.
Thanks for the suggestion!
We could prepare a overview figure of the workflow and refer to the specific sections.
Comment form @ClaireMulvey
Can you use
filterNA()
to allow for some MVs? E.g. allow for 2 missing values per replicate? > How would you do this?
The easiest way to do that is to filter before combining and setting to proportion of allowed missing values to be 2/10 (using pNA
to control the percent of NA values).
Is it possible to say that such filtering to allow for a proportion of missing values should ideally be done on a PSM level and then show the code you would use to reassemble from the PSM to the protein level file?
Yes, ideally filtering and imputation should be done at the PSM/peptide level. That is because when combining into proteins, there is an implicit imputation that takes place (using zero, of mean) that is often not acknowledged but might not be a good option at all. For details, see
Lazar C, Gatto L, Ferro M, Bruley C, Burger T. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J Proteome Res. 2016 Apr 1;15(4):1116-25 PMID./2690640)
FeaturesOfInterest
plot3D
QSep
On figure 7, we plot the profile of the mitochondrial and peroxisome markers to highlight the differences in profiles between these two sets of markers along the 6th and 7th channels, as represented above along the 7th PC on the PCA plot.
This is confusing as the channels and indices on the y axis of plotDist
don't match. Also, I have the feeling that it's not really along the 12th and 13th (TMT channels 6 and 7), but 13th and 14th columns that the data differs most.
We should set up Travis CI to automatically build and render the workflow.
options:
phenoDisco
run, TL, SVM classification to a new object called hl
and add to pRolocdata
? Then load using the function data
? The bad side of this is we have then multiple datasets in pRolocdata
containing the same quantitation informationhyperLOPIT2015
. I would rather not do this as the fData
gets bigger and then there are many similar columns.rda
or other file type to extdata
in pRolocdata
and load using dir
function with appropriate read function? I think this is the best option?@lgatto or other suggestions?
The manuscript/vignette should be build using the recent (yesterday) release (stable) version 3.3, as that is what readers/users will be assumed to use. To make use of that version, you will need R 3.3.
In the tl
object (but see #15), we use at time SVM.marker.set
and markers
, as produced by addMarkers
. The latter contains ER and Golgi as separate groups, and the latter is all over the place. Two possibilities:
markers
When submitting, create a v1
tag to that specific commit.
Highlight where TL is useful at the beginning of section. We say towards the end particularly useful for datasets with low cluster resolution, would be good to emphasise this where TL is first introduced.
Question from @ClaireMulvey
How would you adapt this code to plot all or individual organelle markers in the same colour as they appears on the PCA plot against the grey background of all unselected proteins (and unknowns) potentially with a protein of interest highlighted?
Basically, I mean exactly what can be done easily in the pRolocGUI, but using R instead, as it isn't possible to export high res figs from the GUI?
library(pRoloc)
library(pRolocdata)
data(hyperLOPIT2015)
o <- order(hyperLOPIT2015$Iodixonal.Density) ## reorder
cl <- getMarkerClasses(hyperLOPIT2015) ## class names
cols <- getStockcol() ## colours, as used by plot2D
![allplotdists](https://cloud.githubusercontent.com/assets/384198/20843994/ac3bb5f2-b8b5-11e6-9c72-bca03d93f6ea.png)
par(mfrow = c(4, 4))
for (i in seq_along(cl)) {
plotDist(hyperLOPIT2015[, o],
markers = getMarkers(hyperLOPIT2015, verbose = FALSE) == cl[i],
mcol = cols[i],
type = "l")
title(cl[i])
}
Nearly all figures do not follow on from the code chunks. Lots of code chunks are in the wrong place and don't follow on from the text.
library('pRoloc')
library('pRolocdata')
## Create new `MSnSet`
f0 <- dir(system.file("extdata", package = "pRolocdata"), full.names = TRUE,
pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
lopit2015 <- readMSnSet2(f0, ecol = c(8:27), fnames = 1, skip = 1)
## Clean `fData` cols
fData(lopit2015) <- fData(lopit2015)[, c(2, 8, 11)]
## get markers
mrk <- pRolocmarkers(species = "mums")
lopit2015 <- addMarkers(lopit2015, markers = mark)
Error in pRolocmarkers(species = "mums") :
Available species: atha, dmel, ggal, hsap, mmus, scer_sgd, scer_uniprot. See pRolocmarkers() for details.
## Of course `featureNames` do not match, but trying to update `featureNames` is not possible because of protein grouping and unique IDs
featureNames(lopit2015) <- fData(lopit2015)[, 1]
Error in `row.names<-.data.frame`(`*tmp*`, value = c(1185L, 3057L, 3399L, :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘ACTB_MOUSE’, ‘AP1B1_MOUSE’, ‘AT1A1_MOUSE’, ‘AT2A2_MOUSE’, ‘BAIP2_MOUSE’, ‘CLAP2_MOUSE’, ‘CNNM4_MOUSE’, ‘CTNA1_MOUSE’, ‘CTNB1_MOUSE’, ‘CTND1_MOUSE’, ‘CYFP1_MOUSE’, ‘DAAM1_MOUSE’, ‘DNM1L_MOUSE’, ‘DOCK6_MOUSE’, ‘DYST_MOUSE’, ‘E41L3_MOUSE’, ‘EPHB2_MOUSE’, ‘EPHB4_MOUSE’, ‘EPN3_MOUSE’, ‘GANAB_MOUSE’, ‘GBB2_MOUSE’, ‘glutamine-hydrolyzing’, ‘GNAI3_MOUSE’, ‘GNAS2_MOUSE’, ‘HNRPC_MOUSE’, ‘HNRPK_MOUSE’, ‘IMMT_MOUSE’, ‘KC1G1_MOUSE’, ‘KPYM_MOUSE’, ‘MACF1_MOUSE’, ‘MPRIP_MOUSE’, ‘MYO1C_MOUSE’, ‘MYOF_MOUSE’, ‘NADP’, ‘NCAM1_MOUSE’, ‘PKP4_MOUSE’, ‘PLAK_MOUSE’, ‘PLXA1_MOUSE’, ‘PVRL2_MOUSE’, ‘RAB1A_MOUSE’, ‘RAB6A_MOUSE’, ‘RADI_MOUSE’, ‘RALA_MOUSE’, ‘RAP1B_MOUSE’, ‘RAP2B_MOUSE’, ‘RAP2C_MOUSE’, ‘RASH_MOUSE’, ‘RASN_MOUSE’, ‘RB11B_MOUSE’, ‘RRAS2_MOUSE’, ‘RS27A_MOUSE�� [... truncated]
@lgatto This is annoying. Now if I want to use addMarkers
I will need to write a few more lines code up make the identifiers unique. This is not ideal for the workflow. What do you think? Not use addMarkers
but highlight that it's available?
@lgatto Annoyingly, the combined hyperLOPIT data is not a model dataset to use in terms of default names and functions to create a MSnSet
. For example, I can not use getEcols
as there are two headers in the .csv file. Also, there is no column called markers
. Of course, this does not matter, and it may be the case that one has several headers. I wonder whether we should change getEcols
so that we can allow reading column names from a second or other header row line?
f0 <- dir(system.file("extdata", package = "pRolocdata"),
full.names = TRUE,
pattern = "hyperLOPIT-SIData-ms3-rep12-intersect")
## getEcols does not work for this dataset
strsplit(readLines(f0, 2)[2], ",")[[1]]
lopit2015 <- readMSnSet2(f0, ecol = c(7:26), row.names = 1, skip = 1,
header = TRUE, stringsAsFactors=FALSE)
plot2D(lopit2015, fcol = "SVM.marker.set")
I think we should clarify fractions (along the gradient) and channels (the final columns in the data).
See the QSep manuscript for an example. There, I have a bibtex entry with the repo.
Breckels et al. have written a very nice piece on analysing appropriate proteomics data for subcellular localisation. I particularly like the "workshop characteristics" of the text. Which allows a novice, but interested reader to work through the analysis stepwise and reproduce the results described therein. The authors took great care in keeping this ideal up during their text and this is also where I have put my greatest reservation to the manuscript in its present form - since a reader cannot work through the code presented in the manuscript, since there at at least two situations where a readily available HPC and quite some time is required. This kind of leaves a dent in my impression - however, given this can be resolved as well as some typos - the workflow report is superb.
Major comments:
We have added a paragraph to the 'Visualising markers' section of the manuscript reiterating the purpose of PCA and motivating the choice of looking at PC's 1 and 7. Figure 9 now follows on from this (now Figure 8), along with the corresponding code and an explanation of the plotDist
function.
Subsetting MSnSetList to their common feature names
5032 features in common
Remapping data to the same PC space
Error in (function (od, vd) :
object and replacement value dimnames differ
Error in pRolocVis_compare(object, ...) : object 'idDT' not found
We can not reproduce this error. Have you updated to the latest version of R and the latest version of pRolocGUI
? If you still get this error message could you please post this as an issue on the pRolocGUI
Github page along with your SessionInfo()
and we will certainly attempt to solve this.
The results are already available as a RDS
file and stored in pRolocdata
for users. This is what is called in the manuscript under the hood:
f0 <- dir(extdatadir, full.names = TRUE, pattern = "bpw-pdres.rds")
pdres <- readRDS(f0)
hl <- addMarkers(hl, pdres, mcol = "pd", verbose = FALSE)
We have made this code available in the manuscript in an appendix so users can continue to produce the exact plots as they see in this workflow.
The same as for the phenoDisco
analysis and svm
, the TL results are stored as a RDS
in pRolocdata
and are loaded in the background. We have added the code required to the appendix so that users can load the results directly.
To address the above comment on suitability we have added a few additional points on the challenges of using clustering for this type of data.
We generally find supervised learning more suited to the task of protein localisation prediction in which we use high-quality curated marker proteins to build a classifier, instead of using an entirely
unsupervised approach to look for clusters and then look for enrichment of organelles and complexes. In the latter we do not make good use of valuable prior knowledge, and in our experience unsupervised clustering can be extremely difficult due to (i) the loose definition of what constitutes a cluster (for example whether it is defined by the quantitative data or the localisation information), (ii) the influence of the algorithm assumption on the cluster identification (for example parametric or non-parametric) and (iii) poor estimates of the number of clusters that may appear in the data.
Minor comments:
We didn't experience any problems with installing pRolocdata
on Windows. If you re-try the installation and please let us know if you still have any issues by opening an issue on Github or posting on the Bioconductor Support site.
In this version we currently can't find the missing 'to'.
This has been changed to read "We can impute missing data..."
This was an editing mistake and has now been rectified
A footnote has been added here to tell users that the package rgl
may need to be installed with install.packages("rgl")
and mac users may need to install xquartz if it's not already installed.
This has now been rectified.
These typos have been rectified.
This typo has been rectified.
We would prefer to keep the code as it is and not introduce more noise with calls to other functions such as layout
. The workflow is not aimed at teaching R
. Users should have some basic knowledge of R
before tackling this tutorial.
I would prefer links to referred sections of the text, but that may be personal taste…
This is a comment for f1000. We can not control the linking of sections in the final version.
Page 23: One should note that the decreasing the GS, and increasing the … at least one the too many, probably two.
We have reworded this sentence as requested.
This typo has been rectified.
We have now referenced Figure 16 in the text and made sure that the code chunks and figures follow inline where they are referenced in the text.
Thank you, yes this is a typo and has now been changed to 'quartile'.
This typo has been rectified
This sentence is not needed here and so it has now been removed as it essentially reiterates what is said in the above paragraph.
This typo has been rectified.
This typo has been rectified.
We have changed the title of this section to 'Session information and getting help' to clarify this section of the tutorial.
Did this because it was confusion to call that variable lopit2016
and the 2 replicates, used to illustrate combine
are called hyperLOPIT2015
. And hl
is shorter to type.
@lmsimp - please close the issue once read.
Reminder to check use of hl
and hyperLOPIT
as examples. One is combined from loading to independent replicates, the other is loaded from a csv already been combined
@ClaireMulvey asked about using a seed when optimising the parameters. This would be done as shown below. I used 123
as seed (and would take note of it), but any integer can be used (and there's no need to sample that one at random).
set.seed(123)
params <- svmOptimisaion(hl, ...)
Saving the params
saves me for re-running the optimisation routine and allows me to inspect them again at a later date. If I wanted to re-run it and the same results, I would first set the seed to 123
again.
But I don't think that this is necessary, however. The optimisation routine is repeated 100 times (that's the default) and we assume the actual folds don't matter. If they did, i.e. different seeds gave different and incompatible best parameters, we would have an issue, and would need to increase the number of iteration and make sure the results converge. We have never observed such an issue.
This manuscript describes a Bioconductor workflow for analyzing subcellular proteomics data. It is very detailed and comprehensive and will be useful for others in the field.
A few comments:
Some clearer statement early on would help to clarify for readers what types of data this works with. I know that the authors indicate that the example they use is 10-plex TMT and that it can be used with label-free or other labels, but that is not what I am referring to. Rather, structure of the experiment. That is, that one needs systematic quantitative data on all the different relevant fractions from a cell, as opposed to someone who perhaps did a differential centrifugation experiment to isolate a couple fractions and then wants to apply this (my understanding is that this latter example would not be usable).
How do the authors recommend collapsing replicates? This could be covered in the section dedicated to the Compare function. Two replicates will (almost) never agree 100% so how are discrepancies handled?
The un-annotated PCA plot in the QC section could use transparency or the new hexbin
method.
Should we include in the workflow @lgatto?
We probably need a paragraph about replicates, whether they are combined, analysed/classified independently, how are they normalised, ...
@lmsimp - could you update the acknowledgement.md
file with the correct grant code.
@lgatto Do you think it would be nice to use this in this workflow? It would fit in nicely in the markers section where I add markers from (1) pRolocmarkers
and then (2) show the curated markers. Of course, the restriction is however that this is not in the stable version of pRoloc
and it's a lot of code to backport... and not to mention not good practice.
In the manuscript we load the results of the phenoDisco
and TL algorithms under the hood:
PD
f0 <- dir(extdatadir, full.names = TRUE,
pattern = "bpw-pdres.rds")
TL
tlfile <- dir(extdatadir, full.names = TRUE,
pattern = "bpw-tlopt.rds")
SVM
svmf <- dir(extdatadir, full.names = TRUE,
pattern = "bpw-svmopt.rds")
I do agree with Daniel it is annoying when you follow the manuscript you can't produce the relevant plots yourself because the objects are not directly accessible by following the code.
What can we do @lgatto? Do you think we should add a column to the hyperLOPIT MSnSet
in pRolocdata
for each of these results? Or directly show how to load these results using the dir
command? The latter is messy.
Although, loading the optimisations is tricky, they can't be stored in the MSnSet
at least no the TL results.....
Should we mention this? This will always be a problem. We should mention advantages of combining in Proteome Discovered vs MSnbase
?
Says something about Software Carpentry - I think this is a mistake?
@lgatto can you regenerate this figure so that the x-axis labels on the LHS plot is las = 2
and same font size are plot of the RHS?
@ClaireMulvey's comment
hl <- normalise(hl, method = "sum")
featureNames
(see head(featureNames(lopit2016))
)” – should this msnset be called hl
instead of lopit2016
?pRolocVis
pRolocVis
Mention somewhere that we can do clustering but a totally unsupervised approach not an appropriate/adaquete data analyses for spatial proteomics. It is still useful.... for clustering of markers, ref mrkHClust
Note to have a chat with @lgatto tomorrow about updating 'pRolocmarkers'
The latest pRoloc
version (added in 2b878e604bb1835f0b93f88a2227f35a9c4eedc4) chunks the input before querying Biomart. It's probably worth repeating the TL analysis. May be we could use a bit more classes, possibly complement those that don't have 13 markers with some assignments from the paper.
What about preparing short films instead of screenshots for the Interactive visualisation section?
plot3D(hl, dims = c(1, 2, 7))
Page 19 - there is an erroneous plot3D ...
line of code at the end of this section. It needs to be deleted.
The bottom of page 19 the output of the call to foi13s
is missing and has been erroneously placed on page 20 at the end of the section. It should follow after foi13s
e.g.
foi13s
## Traceable object of class "FeaturesOfInterest"
## Created on Tue May 22 16:07:42 2018
## Description:
## 13S condensin
## 4 features of interest:
## Q8CG48, Q8CG47, Q8K2Z4, Q8C156
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.