Coder Social home page Coder Social logo

adityasavara / msresolve Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 27.24 MB

MSRESOLVESG. A program for converting mass spectrometry signals to concentrations. Baseline corrections, smoothing, solving overlapping patterns, mass spectrometer tuning, and more.

Python 100.00%

msresolve's Introduction

Build Status UnitTests:<UnitTests> Open Source Love

INTRODUCTION TO THE MSRESOLVE PROGRAM:

A program for converting mass spectrometry signals to concentrations. Baseline corrections, smoothing, solving overlapping patterns, mass spectrometer tuning, and more.

The Need for MSRESOLVE and what it offers

During “solving” of collected mass spectrometry spectra to extract concentrations, there are several sources of challenges. A) For time dependent signal collection, pre-processing may be required (such as baseline corrections, smoothing, etc.) B) There may be overlapping signals, making solution difficult. C) In many cases a particular molecule’s calibration may not be possible or practical at the instrument where the collection is being done. This creates two challenges. Firstly, that one may need to rely upon an externally collected reference pattern that may not exactly match how the molecule fragments in one’s own instrument, and Secondly that one requires a method for converting the signals for that molecule into (approximate) concentrations even in the absence of a calibration. MSRESOLVE addresses each of the above challenges, as follows: A) MSRESOLVE has various pre-processing functions, including baseline correction, and smoothing. B) MSRESOLVE has several methods for resolving the signals: Sequential Linear Subtraction (SLS), the Matrix Inverse Method, and also brute force grid (regression).
a. For the SLS and the inverse methods, it is possible to extract error bars for the final solved concentrations. C) MSRESOLVE is able to convert mass spectrometry signals into a common concentration scale even for uncalibrated signals, provided that reference concentration patterns are available. Furthermore, MSRESOLVE can also correct for differences in mass spectrometers tuning.


Suggested Procedure for Solving Time Series Data of Mass Signals

  1. Create the files for the reference spectra and collected data – don’t delete any signals.
  2. Create an MSRESOLVE run with SLS Unique and see what masse MSRESOLVE chooses, with ExportedSLSUniqueMassesUsedInSolvingMolecules.
  3. If unsatisfied, start narrowing things down with chosen masses.
  4. Also start using some reference file threshold filtering . a. UserChoices['applyReferenceMassFragmentsThresholds']['referenceMassFragmentFilterThreshold'] = [1.0] #this is what I am suggesting that you use.
  5. If the application warrants doing so, include more sophisticated features of MSRESOLVE, such as mass spectrum tuning correction.

There is a manual in the documentation directory, and also there is an example analysis under ExampleAnalysis.


To install dependencies, use 'pip install -r requirements.txt'

msresolve's People

Contributors

adityasavara avatar akraetz avatar aroger34 avatar cdunn6754 avatar charleswatt avatar lanelee44 avatar wattcl avatar

Watchers

 avatar

msresolve's Issues

making SLS more efficient by "gaussian elimination"

From: Caspar Lant https://github.com/ caspar

Hi Ashi,

I think the sequential linear subtraction you independently invented for your MS Resolve program is also called Gaussian Elimination (and back-substitution). I thought you might like to know.

Warmest holiday wishes,


Thank you very much! I had spent time searching for something like this, I figured it had to exist in standard matrix algebra, but myself and a summer student couldn’t find it after a couple of days.

It seems in the future, we may switch to using some kind of numpy based code rather than my (slower) code for that part of the solving.
https://www.bing.com/search?q=numpy+gaussian+elimination&pc=MOZI&form=MOZLBR

Though at present, that will not be a priority. What I’ll do is make a github issue so somene else can implement when this is released, acknowledging you, and saying you should be acknowledged by whomever implements it.

Note: perhaps we should call it "Fangcheng ... ... "

making uncertainties compatible with more choices

--> extracting reference pattern needs to ADD to uncertainities if they exist already.
--> need to add in feature for reading uncertainities for each data point.(experimental)
--> should add uncertainties interpolation to interpolator.

Logical error in negative analyzer

Negative analyzer tries to figure out which molecule is the biggest contributor to each negative molecule's negative value. But it uses the molecule with the biggest concentration, which is not necessarily the one with the biggest signal at that moment. So I have added the below fixme.

#FIXME: Below is using biggest concentration value. That's not the actual biggest contributor. Should be by simulated signal. So should simulate each molecule's contribution separately to this mass, and then find the one which has the maximum contribution. That will give the right choice for correction2index

It would also be good to add a flag allowing the investigation of more than just 1 molecule within presentmoleculeslist, because a different molecule could be causing the problem without being the largest contributor at that moment in time.

parsingUserInput functionalization of subcode

this should become a module, and should have functions like this:

#If receives a string, makes it only item in a list of length one. If receives a non string, tries to cast it as a list.
def listCast (inputObj):
  if type(inputObj) == type("string"):
      inputObj = [inputObj]
  else inputObj = list(inputObj)
     return inputObj

#This *requires* a list or array type object as the inputObj to work correctly.
def parallelVectorize(inputObj, desiredLength, desiredArrayType = "list"):
  if type(inputObj) == type("string")
      print("Warning: parallelVectorize is not designed to work with string inputs.")
  if len(inputObj) == desiredLength
     return inputObj
  elif len(inputObj) == 1:
     if desiredArrayType == "list"
            parallelizedList  = []
            for index in range(desiredLength): #do a test case to check indexing here, may not be correct in this pseudocode.
                   parallelizedList.append(inputObj[0])
            return parallelizedList
      if desiredArrayType == "array"
            print("The array feature of parallelVectorize has not been implemented yet, but would not be hard to implement.")
          #return parallelizedArray
          
   

collectedFileUncertainties should have UserInput validiation and vectorization

Has a few options. Can be a list:
UserChoices['uncertainties']['collectedFileUncertainties'] = [1.37E-10,0,0,0,1.24E-10,2.01E-10,2.97E-10,3.73E-10,1.51E-10,5.52E-10,0,1.22E-14,1.03E-12,5.37E-11,3.39E-11,2.84E-10,0,3.63E-15,0,9.72E-15,0]

Can be the word "none", "auto", "file", or even an integer.

So user validation must check if it's a string, an integer, or a list. And if it's a list, check if the length is one to vectorize across all masses.

Add a try statement and warning for smoothing

When using timerange, there can be situations with not enough points to smooth. THen we see something like this.


  File "C:\Users\fvs\Desktop\Temp\Junk\m\MSRESOLVE\200226MoskowitzChallenge\v8.3\XYYYDataFunctionsSG.py", line 96, in DataSmootherPolynomialSmoothing
    smoothedPoints = numpy.polyval(numpy.polyfit(timeslist,currentWindowAsArray,polynomialOrder), currentTime)

  File "C:\Users\fvs\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\polynomial.py", line 555, in polyfit
    raise TypeError("expected non-empty vector for x")

TypeError: expected non-empty vector for x

As I noted in an email to Ben...
I’ve seen this error before. It means there is a situation where it could not find enough points to smooth somewhere along the way.
I found that I could make the smoother radius 50 and then things ran.
Alternatively, I could make things work by changing timerange to pointrange

exporting improvements (reference pattern and other things)

As of Nov 10th, 2021, after the inclusion of TuningCorrectorGasMixture and additional tuning corrector exporting features, there are now to too many exported files.

The exporting should be upgraded as follows.

  • (1) Instead of having files like Exported0ExtractedReferencePattern we should put them into a subdirectory like this: Exported\0ExtractedReferencePattern (then the deletion will delete that directory).
  • (2) We should move files like "PreprocessedData" into that Exported directory. Only final files like the final reference pattern(s) and the ScaledConcentrations should be in the running directory.
  • (3) For tuning corrector, the ratios at each mass are exported, but 3 files are made because the abscissa is exported separately. These should be hstacked and exported as two files.
  • (4) final reference pattern as a full reference file, described below.
  • (5) During Tuning correction, the absolute uncertainties of the pattern should be exported (because they're not a constant).
  • (5b) Though not exporting, a continuation of 5: When ExistingReferencePattern or StandardReferencePattern or GasMixture pattern are being read in, MSRESOLVE should search for accompanying files that are have the suffix of "_uncertainties" and then use them if they exist. Otherwise the uncertainty propagation will not happen correctly after the mixed reference pattern has been made.
  • Make MSRESOLVE export total uncertainty (from a^2+b^2 of Standerr and Abs uncertainty). --> this suggestion is from Nov 2020, it might already be done.
  • Related issue (but is marked as wontfix) https://github.com/AdityaSavara/MSRESOLVESG/issues/202

Currently the final reference patterns are exported with numbers but these are just the patterns, not a full reference file.

The final pre-processed reference file should probably be exported with a number (this is particularly useful for tuning correction cases).

This may require making some kind of modification to the data exporting function since currently it does not have a way to export reference files rather than patterns. The easiest way is probably going to be to simply have that exporting data function call the reference file exporting function. [And it should probably be a call from within that same function in order to easiest properly keep track of the numbering]

Tuning Correction Improvements

  • createMixedReferencePatterns should be set to false if no standard or existing (external) file is provided.
  • exporting improvements as noted here: https://github.com/AdityaSavara/MSRESOLVESG/issues/267
  • if UserChoices['measuredReferenceYorN']['on'] ='no' then the UserChoices['measuredReferenceYorN']['referenceFileStandardTuning'] = ['GasMixtureNISTRef.csv'] needs to be set to blank during user validity. Same thing for the existing pattern.
  • The variables referenceFileStandardTuning and referenceFileExistingTuning in userinput need to be changed to say "andForm" at the end of the variable name. Same for referenceFileDesiredTuning
  • With the gas mixture tuning corrector, There is lots of printing of "moleculechooser" that is probably not intended. An optional argument or temporary turning off of verbose should be implemented for the relevant line(s).
  • After making a mixed reference pattern from tuning corrector, need to remove molecules again if chosen molecules is on and they are not in chosen molecules. Undesired behavior was observed in 211105BenDataAnalysisGasMixtureOneExperiment\7
  • In workshop directory 8 https://github.com/AdityaSavara/MSRESOLVE_Workshop, there is a problem where the resolveConcentrations are not getting plotted. (maybe not calculated at all? or maybe calculated wrong? May need to print out matching_abscissa and other things to trace this issue.)
  • IN workshop directory 8, there is a warning printed out about uncertainties not being completely programmed for tuning corrector.

add feature of ionization correction factor & make concentration scaling factors a list

this feature would take a coefficient (1.0 by default) and apply it to the scaled concentrations that come out as soon as a line of raw signals is converted into scaled concentrations. The scale concentrations will be divided by this factor. There will be one factor per chosen molecule. Note that if the factor is 0.5 then the concentration for that molecule will be doubled which is appropriate because it would reflect that that molecule is half as likely to be ionized and therefore there is twice as much of it actually present.

This feature is needed to improve accuracy based on the Hiden note, but also based on the fact that we can use this for when concentration scaling coefficients are extracted. Note that we can provide multiple molecules for concentrations scaling coefficients, and then we can define the first molecule as the one which other molecular concentration scaling coefficients get normalized to if they do not have a custom instrumentally measured one provided for them. for example, if the molecule were converting to torr with is acetaldehyde, and the concentration scaling factor is 0.3, and the ionization factor is 1.25 that means that the other molecules which have 1.0 as their ionization factor will get a 0.3 scaling factor, while those with an ionziation factor of 2.5 will implicitly get scaling factors of 1 * 0.3 * 2.5 ( note that the 2.5 already happens implicitly when the relative concentrations are calculated). species which have custom defined calibrations would in this example have the 0.3 replaced with something else. They could have an original ionization factor of 1.25, and an ionization correction factor of 0.6, that would mean 0.75 for the corrected ionization factor, and then we'd end up with 0.75 * 0.3 * 1.25 implicitly. This factor would be calculated on the fly, exported, and would supplement any user provided factor for ionization efficiency. we need to make some good unit tests for this feature before working on it. It should not be too hard to do so. One can even easily test it by making molecules that are actually identical other than the efficiency: "Acetaldehyde_Normal" and "Acetaldehyde_EasyToIonize" for example.

The 1.25 factor would be FamilyIonizationFactorNormalizedToNitrogen [FIFNN or whatever] and the 0.75 factor would be empiricalIonizationScalingFactorAfterFamilyIonizationFactorNormalizedToNitrogen (EISFAFIFNN which would be 1.0 for Acetaldehyde_Normal)

making reading of data more robust against blanks

During one analysis, Initially got an error. While trying to figure things out, I removed extra extra rows and columns from the input file. I think that may have been the issue. I have not double checked if that was really the issue, but I think it was. I think it’s because blanks are not numbers, and numpy was trying to convert them into floats.

Add back in uncertainties and signals from filtered peaks for SLSUnique

For filtered to 0 small reference peaks, need to keep track of them. see if
any other molecules used it for solving and then add that amount back on
as uncertainty in the solved peak in a post processing step. Simple way:
take predicted signal for molecule that had filtering, get absolute
uncertainty. Get predicted signal for molecule that had sls unique on it. Get
uncertainty. Subtract the "filtered" molecule's contribution, consider the uncertainties as orthogonal.

Then, scale both the previously solved concentration and previously solved relative uncertainty based on the new values.

solve __var_list__ issue for iterative

a "pre" tag was used to allow the __ to remain in this post.
option 1: remove __ from export import module's variable name of __var_list__ (and in input files)
option 2: use default user input as G, then import variables from test file into that namespace (good idea, but how?)
option 3: find a way to import __ variables (good idea, but how?).  Maybe as simple as typing __var_list__ = G.__var_list__ ? That should work, I would think? 

print statements every data point?

There seems to be a print statement occuring at every datapoint in the newest version. I don't know if it is in def ExportMSData(self) or somewhere else. We should track it down and turn it off or make it optional.

Pre-processed data masses should have a version with an abscissa that matches ExperimentData.mass_fragment_numbers and RawSignalsSimulation

RawSignalsSimulation is not including all of the chosen masses that were originally selected.
Consider a case where you have chosen 8 masses and 3 molecules.
If only 7 of those masses are present for those 3 molecules, then ExperimentData.mass_fragment_numbers will be trimmed down to only those 7. This makes sense for solving. However, it turns out that a consequence is that signal simulation now does not include the 8th mass as a simulated mass. I think things are different in iterative, so one option would be to just do an iterative analysis and single iteration if one wants all 8 simulated masses.

However, in that situation the feature implicitSLScorrection can't be used.

This is probably okay, but then the preprocessed data export should also export a version of the signals with only these downselected masses. Right now signal simulation data can't be compared directly to the pre-processed data as easily because of this.

try statement inside userInputValidityFunctions.py

try:
    SettingsVDictionary['referenceCorrectionCoefficients_cov']    = UserChoices['measuredReferenceYorN']['referenceCorrectionCoefficients_cov']
except:
    SettingsVDictionary['referenceCorrectionCoefficients_cov']    = [0,0,0] #TODO: This is to keep some old unit tests running. Ideally they should be fixed.

The way to solve this would be to define the variable referenceCorrectionCoefficients_cov in all of the unit tests. But this will never be worth the lead developer's trouble to do.

deprecation warnings

genfromtxt needs to get argument of encoding=None

This text:
import test_1_Input as G, imp; imp.reload(G)
Needs to become:
import test_1_Input as G, importlib; importlib.reload(G)

make genfromtxt work with more versions by numpy check

Sometimes see something like this, depending on numpy version:

File "C:/Users/fvs/Desktop/Temp/Junk/m/MSRESOLVE/200226MoskowitzChallenge/v8.3/MSRESOLVE.py", line 1619, in exportSimulatedSignalsSoFar
simulatedSignalsFromIterative = numpy.genfromtxt(simulatedSignalsFromIterativeAbsoluteFileName,delimiter=',',dtype=None, encoding='latin1').astype(str) #Get the data from the iteration's simultaed raw signals file (use astype(str) to keep b' from showing up before each entry)

TypeError: genfromtxt() got an unexpected keyword argument 'encoding'

Tuning Correction for tuningCorrectionIntensity (Correction Values) should be checked for cases where pattern max changes

In some cases, the tuning correction pattern feature can result in a change in which mass fragment is the maximum. After that, when the pattern becomes standardized, does this cause a problem for the sensitivities?

The feature of tuningCorrectionIntensity exists, and maybe it is sufficient to correct even when the pattern gets "re-standardized", but it should be checked that no adjustments need to be made for cases as described above.

There is an example in the attached directory where File "Before and After for 13butadiene (Ben's Data).xlsx" shows that Butadiene after the correction has a different most intense mass, which causes masses like m39 or so to seem to "go down" after standardization of the tuning corrected file, when in fact they actual went up.

To test this the above concern , a fictional data file should be made. Then the qualitative behavior can be checked.

v1 (Working).zip

Look into what simulated raw signals is putting out during iterative

Is it subtracting the signal, then addint it back in? Or is it overwriting what was in that column originally? The solving is being done correctly, but this matters for interpretting the raw signals files that are exported between iterations. If Iteration one and three have two different molecules that each have mass 28, for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.