Coder Social home page Coder Social logo

Comments (4)

edeutsch avatar edeutsch commented on August 22, 2024

Hi Matt, thanks for these thoughtful comments.

  • You are correct that my attempt at using scan was foolish
  • In the scan/index field, instead of MS:1000770, can we just use nativeId. Once a receiver of a USI has determined the right file, is the term obvious?
    (I suggest this because asking ordinary users to juggle MS:1000770 seems difficult
  • I suggest we still keep scan for Thermo files to avoid complication
  • For WIFF can we use nativeId:1,1,123,2
  • Instead of the scanStart scanEnd thing, can we use use a range in the string? (nativeId:123,456-567?

from usi.

chambm avatar chambm commented on August 22, 2024
  1. Just knowing the filename doesn't tell you what kind of nativeId it is. You can't key off file extension: Bruker and Agilent both use .d as an extension. I guess you could require users to open the file before parsing the nativeId, at which point they should be able to infer the nativeId type, but to me it would feel underspecified without an explicit nativeId type.

  2. Can you elaborate on that? I don't think 0.1.123 is much more complicated than 123. It's an unambiguous abbreviation of the proper nativeId.

  3. That would depend on the final answer to #2. But if nativeId was indeed used without a specific type then that WIFF id would be correct.

  4. I lean toward keeping discrete key-value pairs because it'll be easier for existing parsers to deal with. But using these combined scan nativeId formats at all defers to the other discussion about how to combine IMS scans which has not yet been resolved. The formats will need to be added to the CV for example.

from usi.

edeutsch avatar edeutsch commented on August 22, 2024
  1. My concern here is that we are hoping that USIs are something that can be used by all users in the community. Making users put 'MS:1000770' into the USI seems like adding a needless element of mysterious black magic into the USI. Just having "nativeId" seems to me like the limit of what users will put up with. It seems like ProteoWizard can, given a filename, open it and convert it to mzML where it tells you what the nativeId format is. Surely ProteoWizard should be able to open an arbitrary file (be it Bruker or Agilent or mzML) and determine what the nativeID format would be if it were to convert that file (it already does it somehow) and then interpret the user input "nativeId:1,1,123,2" in that context. In fact, for a WIFF file, it can only be one thing (I think?), so if a user provides "MS:1000769:1,1,123,2" and that is the wrong nativeId for a WIFF file, then ProteoWizard might return "nativeId type for this kind of file should be MS:1000770". Why make the user tell the software what the software can determine for itself? Maybe this adds an extra "check digit" confirmation, but it really seems like needless turn-off for users.

  2. It is true that "0,1,123" is not much more complicated than "123", but we're trying to make this as simple as possible for ordinary users and I will suggest that the vast majority of users are used to designating Thermo spectra by their "scan number", why introduce the concept of controllerNumber and controllerType if they're always going to be "0,1"? I will posit that for 99% of Thermo users out there, if you ask them "Please open up file X in Excalibur and show me scan:123", they'll know exactly what to do. If you ask them "Please open up file X in Excalibur and show me MS:1000768:0,1,123", you will only get a blank stare. And you and I can write software that can handle either, so why go with the complicated solution?

  3. I am neutral on this one. Creating a whole new set of terms include xxStart and xxEnd doesn't sound ideal, but I'm not against it. Using a range like 1,1,120-130,2 does feel like a bit a hack. Parsers will need to be updated in either case when faced with such input. For a 4-part key like 1,1,123,2, would we need to have separate terms for each permutation of what could be rangeable? Could you ever have 1,1,120-130,2-3? Presumably not, but if there were, the one term would amplify into 4 terms. Not a big deal, I guess. I'm fine either way.

from usi.

chambm avatar chambm commented on August 22, 2024
  1. It's true that, other than displaying the id in a more user-friendly way, I can't think of a reason to parse most abbreviated ids back into its original format before opening the file. However there are some exceptions:
    a. What about mzML files? They could have scans like "merged=123" as well as nativeIDs. Granted, we don't have a CV term for that format. So in a USI it might be merged:123, although this would only work for referencing the mzML file which has that specific id (i.e. not the original raw file)
    b. What about MGF files with their just-won't-go-away spectrum titles (which by the way are not necessarily unique)? If you want to talk about user unfriendliness, ask a user to find "index=1234" in an MGF file. :) But then ask them to find the spectrum title when it's not unique. Ah, the wonders of MGF.
    c. What about 3-array ion mobility formats? When we start suggesting (but not mandating, right?) combined 3-array spectra as the recommended representation, the nativeId format will be different depending on whether the spectra are in the 3-array representation or not. So for Waters, uncombined id would be:
    nativeId:1,0,123 (function=1 process=0 scan=123)
    and a combined id would be:
    nativeId:1,1,200 (function=1 scanStart=1 scanEnd=200)
    We could possibly disambiguate this as:
    nativeId:1,0,123
    combinedNativeId:1,1,200

  2. I can only say 2 things about this one:
    a. I don't think USIs are easily human readable (nor do I think they need to be). Taking the scan number out of a big USI mzspec:PXD013210:TTB20160722_ISBHJOMXX001879_r01:scan:19809 doesn't make the USI as a whole easily human readable.
    b. I don't really like giving Thermo special treatment. We have that 0,1 in the id for non-MS spectra. For MS spectra it's always going to be 0,1 (AFAIK). But if a user opened a file with both MS and PDA detectors, and maybe the PDA spectra showed up first instead of MS spectra, saying "find scan:123" is ambiguous. Of course in the contexts we work it's almost certainly meant to be MS scan 123, but I don't see why Thermo should get special license to be ambiguous. Unfortunately our Thermo nativeID format uses numbers instead of strings to identify the detector (e.g. controllerType=0 instead of controllerType=MS, controllerType=5 instead of controllerType=PDA), but that's water under the bridge.

  3. So far each vendor would only have 1 dimension with a start/end. We discussed on the email thread that combining between scan times (frames or blocks) should be considered a processing step rather than a raw data representation. But it's also true that with enough arrays (with repeated values where necessary), an entire run could just be represented as a single spectrum. However, in either case (using ranges or separate start/end terms), new native ID formats might be needed for the combined mobility formats. We really ought to finalize that recommendation.

from usi.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.