adityasavara / msresolve Goto Github PK

MSRESOLVESG. A program for converting mass spectrometry signals to concentrations. Baseline corrections, smoothing, solving overlapping patterns, mass spectrometer tuning, and more.

Python 100.00%

msresolve's Introduction

INTRODUCTION TO THE MSRESOLVE PROGRAM:

A program for converting mass spectrometry signals to concentrations. Baseline corrections, smoothing, solving overlapping patterns, mass spectrometer tuning, and more.

The Need for MSRESOLVE and what it offers

During “solving” of collected mass spectrometry spectra to extract concentrations, there are several sources of challenges. A) For time dependent signal collection, pre-processing may be required (such as baseline corrections, smoothing, etc.) B) There may be overlapping signals, making solution difficult. C) In many cases a particular molecule’s calibration may not be possible or practical at the instrument where the collection is being done. This creates two challenges. Firstly, that one may need to rely upon an externally collected reference pattern that may not exactly match how the molecule fragments in one’s own instrument, and Secondly that one requires a method for converting the signals for that molecule into (approximate) concentrations even in the absence of a calibration. MSRESOLVE addresses each of the above challenges, as follows: A) MSRESOLVE has various pre-processing functions, including baseline correction, and smoothing. B) MSRESOLVE has several methods for resolving the signals: Sequential Linear Subtraction (SLS), the Matrix Inverse Method, and also brute force grid (regression).
a. For the SLS and the inverse methods, it is possible to extract error bars for the final solved concentrations. C) MSRESOLVE is able to convert mass spectrometry signals into a common concentration scale even for uncalibrated signals, provided that reference concentration patterns are available. Furthermore, MSRESOLVE can also correct for differences in mass spectrometers tuning.

Suggested Procedure for Solving Time Series Data of Mass Signals

Create the files for the reference spectra and collected data – don’t delete any signals.
Create an MSRESOLVE run with SLS Unique and see what masse MSRESOLVE chooses, with ExportedSLSUniqueMassesUsedInSolvingMolecules.
If unsatisfied, start narrowing things down with chosen masses.
Also start using some reference file threshold filtering . a. UserChoices['applyReferenceMassFragmentsThresholds']['referenceMassFragmentFilterThreshold'] = [1.0] #this is what I am suggesting that you use.
If the application warrants doing so, include more sophisticated features of MSRESOLVE, such as mass spectrum tuning correction.

There is a manual in the documentation directory, and also there is an example analysis under ExampleAnalysis.

To install dependencies, use 'pip install -r requirements.txt'

msresolve's People

Contributors

Watchers

msresolve's Issues

header of SLSUniqueMassFragments.csv is wrong

It doesn't match when someone uses chosen Mass Frags, as evidence in the iterative analysis unit test directory

making SLS more efficient by "gaussian elimination"

From: Caspar Lant https://github.com/ caspar

Hi Ashi,

I think the sequential linear subtraction you independently invented for your MS Resolve program is also called Gaussian Elimination (and back-substitution). I thought you might like to know.

Warmest holiday wishes,

Thank you very much! I had spent time searching for something like this, I figured it had to exist in standard matrix algebra, but myself and a summer student couldn’t find it after a couple of days.

It seems in the future, we may switch to using some kind of numpy based code rather than my (slower) code for that part of the solving.
https://www.bing.com/search?q=numpy+gaussian+elimination&pc=MOZI&form=MOZLBR

Though at present, that will not be a priority. What I’ll do is make a github issue so somene else can implement when this is released, acknowledging you, and saying you should be acknowledged by whomever implements it.

Note: perhaps we should call it "Fangcheng ... ... "

making uncertainties compatible with more choices

--> extracting reference pattern needs to ADD to uncertainities if they exist already.
--> need to add in feature for reading uncertainities for each data point.(experimental)
--> should add uncertainties interpolation to interpolator.

improper population of mass fragments in trimDataMoleculesToMatchChosenMolecules

line 683 says trimmedRefererenceData.mass_fragment_numbers_monitored = trimmedRefererenceData.provided_mass_fragments

and this is clearly not correct, these are different variables.

Logical error in negative analyzer

Negative analyzer tries to figure out which molecule is the biggest contributor to each negative molecule's negative value. But it uses the molecule with the biggest concentration, which is not necessarily the one with the biggest signal at that moment. So I have added the below fixme.

#FIXME: Below is using biggest concentration value. That's not the actual biggest contributor. Should be by simulated signal. So should simulate each molecule's contribution separately to this mass, and then find the one which has the maximum contribution. That will give the right choice for correction2index

It would also be good to add a flag allowing the investigation of more than just 1 molecule within presentmoleculeslist, because a different molecule could be causing the problem without being the largest contributor at that moment in time.

parsingUserInput functionalization of subcode

this should become a module, and should have functions like this:

#If receives a string, makes it only item in a list of length one. If receives a non string, tries to cast it as a list.
def listCast (inputObj):
  if type(inputObj) == type("string"):
      inputObj = [inputObj]
  else inputObj = list(inputObj)
     return inputObj

#This *requires* a list or array type object as the inputObj to work correctly.
def parallelVectorize(inputObj, desiredLength, desiredArrayType = "list"):
  if type(inputObj) == type("string")
      print("Warning: parallelVectorize is not designed to work with string inputs.")
  if len(inputObj) == desiredLength
     return inputObj
  elif len(inputObj) == 1:
     if desiredArrayType == "list"
            parallelizedList  = []
            for index in range(desiredLength): #do a test case to check indexing here, may not be correct in this pseudocode.
                   parallelizedList.append(inputObj[0])
            return parallelizedList
      if desiredArrayType == "array"
            print("The array feature of parallelVectorize has not been implemented yet, but would not be hard to implement.")
          #return parallelizedArray

collectedFileUncertainties should have UserInput validiation and vectorization

Has a few options. Can be a list:
UserChoices['uncertainties']['collectedFileUncertainties'] = [1.37E-10,0,0,0,1.24E-10,2.01E-10,2.97E-10,3.73E-10,1.51E-10,5.52E-10,0,1.22E-14,1.03E-12,5.37E-11,3.39E-11,2.84E-10,0,3.63E-15,0,9.72E-15,0]

Can be the word "none", "auto", "file", or even an integer.

So user validation must check if it's a string, an integer, or a list. And if it's a list, check if the length is one to vectorize across all masses.

manual should say "concentration relative to CO" for matrix

Add a try statement and warning for smoothing

When using timerange, there can be situations with not enough points to smooth. THen we see something like this.


  File "C:\Users\fvs\Desktop\Temp\Junk\m\MSRESOLVE\200226MoskowitzChallenge\v8.3\XYYYDataFunctionsSG.py", line 96, in DataSmootherPolynomialSmoothing
    smoothedPoints = numpy.polyval(numpy.polyfit(timeslist,currentWindowAsArray,polynomialOrder), currentTime)

  File "C:\Users\fvs\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\polynomial.py", line 555, in polyfit
    raise TypeError("expected non-empty vector for x")

TypeError: expected non-empty vector for x

As I noted in an email to Ben...
I’ve seen this error before. It means there is a situation where it could not find enough points to smooth somewhere along the way.
I found that I could make the smoother radius 50 and then things ran.
Alternatively, I could make things work by changing timerange to pointrange

exporting improvements (reference pattern and other things)

As of Nov 10th, 2021, after the inclusion of TuningCorrectorGasMixture and additional tuning corrector exporting features, there are now to too many exported files.

The exporting should be upgraded as follows.

(1) Instead of having files like Exported0ExtractedReferencePattern we should put them into a subdirectory like this: Exported\0ExtractedReferencePattern (then the deletion will delete that directory).
(2) We should move files like "PreprocessedData" into that Exported directory. Only final files like the final reference pattern(s) and the ScaledConcentrations should be in the running directory.
(3) For tuning corrector, the ratios at each mass are exported, but 3 files are made because the abscissa is exported separately. These should be hstacked and exported as two files.
(4) final reference pattern as a full reference file, described below.
(5) During Tuning correction, the absolute uncertainties of the pattern should be exported (because they're not a constant).
(5b) Though not exporting, a continuation of 5: When ExistingReferencePattern or StandardReferencePattern or GasMixture pattern are being read in, MSRESOLVE should search for accompanying files that are have the suffix of "_uncertainties" and then use them if they exist. Otherwise the uncertainty propagation will not happen correctly after the mixed reference pattern has been made.
Make MSRESOLVE export total uncertainty (from a^2+b^2 of Standerr and Abs uncertainty). --> this suggestion is from Nov 2020, it might already be done.
Related issue (but is marked as wontfix) https://github.com/AdityaSavara/MSRESOLVESG/issues/202

Currently the final reference patterns are exported with numbers but these are just the patterns, not a full reference file.

The final pre-processed reference file should probably be exported with a number (this is particularly useful for tuning correction cases).

This may require making some kind of modification to the data exporting function since currently it does not have a way to export reference files rather than patterns. The easiest way is probably going to be to simply have that exporting data function call the reference file exporting function. [And it should probably be a call from within that same function in order to easiest properly keep track of the numbering]

Strange inequality in code: probably hardcoded filtering to be removed

This exception should probably be removed, but it would be good to first know what this exception does:
if moleculedata2[moleculedatacounter] < 0.00001:

Tuning Correction Improvements

createMixedReferencePatterns should be set to false if no standard or existing (external) file is provided.
exporting improvements as noted here: https://github.com/AdityaSavara/MSRESOLVESG/issues/267
if UserChoices['measuredReferenceYorN']['on'] ='no' then the UserChoices['measuredReferenceYorN']['referenceFileStandardTuning'] = ['GasMixtureNISTRef.csv'] needs to be set to blank during user validity. Same thing for the existing pattern.
The variables referenceFileStandardTuning and referenceFileExistingTuning in userinput need to be changed to say "andForm" at the end of the variable name. Same for referenceFileDesiredTuning
With the gas mixture tuning corrector, There is lots of printing of "moleculechooser" that is probably not intended. An optional argument or temporary turning off of verbose should be implemented for the relevant line(s).
After making a mixed reference pattern from tuning corrector, need to remove molecules again if chosen molecules is on and they are not in chosen molecules. Undesired behavior was observed in 211105BenDataAnalysisGasMixtureOneExperiment\7
In workshop directory 8 https://github.com/AdityaSavara/MSRESOLVE_Workshop, there is a problem where the resolveConcentrations are not getting plotted. (maybe not calculated at all? or maybe calculated wrong? May need to print out matching_abscissa and other things to trace this issue.)
IN workshop directory 8, there is a warning printed out about uncertainties not being completely programmed for tuning corrector.

Make MSRESOLVE export total uncertainty (from a^2+b^2 of Standerr and Abs uncertainty)

add feature of ionization correction factor & make concentration scaling factors a list

this feature would take a coefficient (1.0 by default) and apply it to the scaled concentrations that come out as soon as a line of raw signals is converted into scaled concentrations. The scale concentrations will be divided by this factor. There will be one factor per chosen molecule. Note that if the factor is 0.5 then the concentration for that molecule will be doubled which is appropriate because it would reflect that that molecule is half as likely to be ionized and therefore there is twice as much of it actually present.

This feature is needed to improve accuracy based on the Hiden note, but also based on the fact that we can use this for when concentration scaling coefficients are extracted. Note that we can provide multiple molecules for concentrations scaling coefficients, and then we can define the first molecule as the one which other molecular concentration scaling coefficients get normalized to if they do not have a custom instrumentally measured one provided for them. for example, if the molecule were converting to torr with is acetaldehyde, and the concentration scaling factor is 0.3, and the ionization factor is 1.25 that means that the other molecules which have 1.0 as their ionization factor will get a 0.3 scaling factor, while those with an ionziation factor of 2.5 will implicitly get scaling factors of 1 * 0.3 * 2.5 ( note that the 2.5 already happens implicitly when the relative concentrations are calculated). species which have custom defined calibrations would in this example have the 0.3 replaced with something else. They could have an original ionization factor of 1.25, and an ionization correction factor of 0.6, that would mean 0.75 for the corrected ionization factor, and then we'd end up with 0.75 * 0.3 * 1.25 implicitly. This factor would be calculated on the fly, exported, and would supplement any user provided factor for ionization efficiency. we need to make some good unit tests for this feature before working on it. It should not be too hard to do so. One can even easily test it by making molecules that are actually identical other than the efficiency: "Acetaldehyde_Normal" and "Acetaldehyde_EasyToIonize" for example.

The 1.25 factor would be FamilyIonizationFactorNormalizedToNitrogen [FIFNN or whatever] and the 0.75 factor would be empiricalIonizationScalingFactorAfterFamilyIonizationFactorNormalizedToNitrogen (EISFAFIFNN which would be 1.0 for Acetaldehyde_Normal)

clarify / for choices for implicitSLScorrection compatibilities

It's currently not compatible with iterativeAnalysis so need a check on that.

It's currently not compatible with multiple reference patterns, so need to run a check on the length of the reference pattern list.

making reading of data more robust against blanks

During one analysis, Initially got an error. While trying to figure things out, I removed extra extra rows and columns from the input file. I think that may have been the issue. I have not double checked if that was really the issue, but I think it was. I think it’s because blanks are not numbers, and numpy was trying to convert them into floats.

Add back in uncertainties and signals from filtered peaks for SLSUnique

For filtered to 0 small reference peaks, need to keep track of them. see if
any other molecules used it for solving and then add that amount back on
as uncertainty in the solved peak in a post processing step. Simple way:
take predicted signal for molecule that had filtering, get absolute
uncertainty. Get predicted signal for molecule that had sls unique on it. Get
uncertainty. Subtract the "filtered" molecule's contribution, consider the uncertainties as orthogonal.

Then, scale both the previously solved concentration and previously solved relative uncertainty based on the new values.

solve __var_list__ issue for iterative

a "pre" tag was used to allow the __ to remain in this post.
option 1: remove __ from export import module's variable name of __var_list__ (and in input files)
option 2: use default user input as G, then import variables from test file into that namespace (good idea, but how?)
option 3: find a way to import __ variables (good idea, but how?).  Maybe as simple as typing __var_list__ = G.__var_list__ ? That should work, I would think?

createMSDataObject and readMSDataFromFile functions should be made to read Experiment Data from file

Currently, we have a function named

createReferenceDataObject which calls readReferenceFile then returns a ReferenceData object.

We don't have an analogous set of functions for the MSData class. We should.
We could then also have an optional field for the number of header rows which would bring us closer to being able to simply read in exported mass spectrometry files.

print statements every data point?

There seems to be a print statement occuring at every datapoint in the newest version. I don't know if it is in def ExportMSData(self) or somewhere else. We should track it down and turn it off or make it optional.

Reference extraction uncertainties don't get propagated to the standardized reference pattern

In this directory, a reference pattern and the uncertainties for it get extracted.
However, after the reference pattern is standardized, the uncertainties for the standardized pattern are dropped.

0.2.3 ReferenceExtraction.zip

export scaled concentrations and also resolved concentrations graphs

Currently, for iterative, we export scaled and resolved concentrations for each iteration. However, ultimately, for both iterative and not iterative, we only export the graph of the resolved concentrations and not the scaled concentrations graph.

specifications for Brute should be documented

specifications argument of Brute is currently not well documented.

Pre-processed data masses should have a version with an abscissa that matches ExperimentData.mass_fragment_numbers and RawSignalsSimulation

RawSignalsSimulation is not including all of the chosen masses that were originally selected.
Consider a case where you have chosen 8 masses and 3 molecules.
If only 7 of those masses are present for those 3 molecules, then ExperimentData.mass_fragment_numbers will be trimmed down to only those 7. This makes sense for solving. However, it turns out that a consequence is that signal simulation now does not include the 8th mass as a simulated mass. I think things are different in iterative, so one option would be to just do an iterative analysis and single iteration if one wants all 8 simulated masses.

However, in that situation the feature implicitSLScorrection can't be used.

This is probably okay, but then the preprocessed data export should also export a version of the signals with only these downselected masses. Right now signal simulation data can't be compared directly to the pre-processed data as easily because of this.

try statement inside userInputValidityFunctions.py

try:
    SettingsVDictionary['referenceCorrectionCoefficients_cov']    = UserChoices['measuredReferenceYorN']['referenceCorrectionCoefficients_cov']
except:
    SettingsVDictionary['referenceCorrectionCoefficients_cov']    = [0,0,0] #TODO: This is to keep some old unit tests running. Ideally they should be fixed.

The way to solve this would be to define the variable referenceCorrectionCoefficients_cov in all of the unit tests. But this will never be worth the lead developer's trouble to do.

TODO: Need to change "BruteOption" in to ObjectiveFunctionType

Ashi needs to check the best mass frag chooser

The unit tests for it, and also its use in the code needs to be checked and possibly upgraded.

timechooser / timerange should error and exit if there are no points.

deprecation warnings

genfromtxt needs to get argument of encoding=None

This text:
import test_1_Input as G, imp; imp.reload(G)
Needs to become:
import test_1_Input as G, importlib; importlib.reload(G)

make genfromtxt work with more versions by numpy check

Sometimes see something like this, depending on numpy version:

File "C:/Users/fvs/Desktop/Temp/Junk/m/MSRESOLVE/200226MoskowitzChallenge/v8.3/MSRESOLVE.py", line 1619, in exportSimulatedSignalsSoFar
simulatedSignalsFromIterative = numpy.genfromtxt(simulatedSignalsFromIterativeAbsoluteFileName,delimiter=',',dtype=None, encoding='latin1').astype(str) #Get the data from the iteration's simultaed raw signals file (use astype(str) to keep b' from showing up before each entry)

TypeError: genfromtxt() got an unexpected keyword argument 'encoding'

preprocessing "load" option is not working.

Tuning Correction for tuningCorrectionIntensity (Correction Values) should be checked for cases where pattern max changes

In some cases, the tuning correction pattern feature can result in a change in which mass fragment is the maximum. After that, when the pattern becomes standardized, does this cause a problem for the sensitivities?

The feature of tuningCorrectionIntensity exists, and maybe it is sufficient to correct even when the pattern gets "re-standardized", but it should be checked that no adjustments need to be made for cases as described above.

There is an example in the attached directory where File "Before and After for 13butadiene (Ben's Data).xlsx" shows that Butadiene after the correction has a different most intense mass, which causes masses like m39 or so to seem to "go down" after standardization of the tuning corrected file, when in fact they actual went up.

To test this the above concern , a fictional data file should be made. Then the qualitative behavior can be checked.

v1 (Working).zip

Look into what simulated raw signals is putting out during iterative

Is it subtracting the signal, then addint it back in? Or is it overwriting what was in that column originally? The solving is being done correctly, but this matters for interpretting the raw signals files that are exported between iterations. If Iteration one and three have two different molecules that each have mass 28, for example.