Coder Social home page Coder Social logo

proxi-schemas's People

Contributors

edeutsch avatar jjcarver avatar orenogithub avatar traviscibot avatar ypriverol avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxi-schemas's Issues

Spectra attributes. Verbose or compact?

According to the current schema, a Spectrum from /spectra has something like this:

{
"attributes": [
{
"accession": "MS:1000744",
"cv_param_group": null,
"name": "selected ion m/z",
"value": "473.1234"
},
{
"accession": "MS:1000041",
"cv_param_group": null,
"name": "charge state",
"value": "2"
},
...
}

This is nice, but quite verbose. And what if the value is another CV term?
Over in PSI Spectra libraries format land:
http://proteomecentral.proteomexchange.org/cgi/spectra?usi=mzspec:PXL000001:05-29-2014:index:5001&output_format=json
I started using a more compact notation, e.g.:

{
"attributes": [
[
"MS:1000041|charge state",
"2"
],
[
"MS:1000744|selected ion m/z",
"847.417"
],
[
"MS:1009030|representative spectrum type",
"MS:1009032|consensus spectrum"
],
[
"MS:1009040|number of enzymatic termini",
2,
"1"
],
[
"MS:1001045|cleavage agent name",
"MS:1001251|Trypsin",
"1"
],

The first item in each sublist is the key (accession|name), the second item is the value, and the third optional item is the cv_param_group

More cryptic for sure. But a lot less verbose and a bit more graceful when the value is the cvParam.

What thinks we?

Spectra end-point with no filter should be avoid

The spectra endpoint contains three possible filters:
usi, accession, scan, file collection

However, we never defined that at least one of them should be defined for the query. Then, the following query is possible:

http://www.peptideatlas.org/api/proxi/v0.1/spectra?pageSize=100&resultType=compact&responseContentType=json

In practical terms we shouldn't allow this because no many users will loop in the entire resource to get all the spectra.

Opinions?

Develop JSON Schema validator

As discussed during today's call, we need a validator for the JSON response that goes further than only checking whether there is a response or not.

Responses are all arrays unless id?

In the current schema /datasets returns an array (except for the {identifier} form.
But /spectra and /psms and all the rest are not returning arrays. Shouldn't they all return arrays?

inconsistent count* in Peptidoform

In the definition of Peptidoform, countPSM is singular and countDatasets is plural. I suggest we make them consistent. Probably plural is best. change to countPSMs?

  countPSM: 
    type: integer
    description: Number of PSMs that support the current Peptidoform 
  countDatasets: 
    type: string 
    description: Number of datasets that support the current Peptidoform

accession input and output

/psms input has:

    - name: accession
      in: query 
      type: string 
      description: Dataset accession 

But Psm output has:

  accession:
     type: string 
     description: Accession of the PSM

This is either confusing or an error.

I also suggest that "accession" is too vague. May I suggest datasetIdentifier?
I suppose we use "accession" everywhere else. But I think this is vague and confusing.
At bare minimum, we should not use access for a PSM accession.

How should we resolve?

default pageNumber?

in /psms:

    - name: pageNumber
      in: query
      description: Current page to be shown paged psms (default page 1)
      required: false
      type: integer
      default: 0

is the default 0 or 1? schema says 0, but words say one.
This occurs in multiple other places in the schema

Also, we should state clearly whether we are 0-based for page 1 and 1-based for page 1

Framework error responses

Our schema nominally defines an error result as this:

     Error:
         required:
            - code
            - message
          properties:
               code:
                    type: integer
                    format: int32
               message:
                    type: string

We can code that up. But what happens when our frameworks encounter an error like for a schema violation?

ProteomeCentral:
curl -i -X GET --header 'Accept: application/json' 'http://proteomecentral.proteomexchange.org/api/proxi/v0.1/datasets?pageSize=100&pageNumber=1&resultType=foo'

HTTP/1.1 400 BAD REQUEST
{
  "detail": "'foo' is not one of ['compact', 'full']\n\nFailed validating 'enum' in schema:\n    {'default': 'compact',\n     'description': 'Type of the object to be retrieve Compact or Full '\n                    'dataset',\n     'enum': ['compact', 'full'],\n     'in': 'query',\n     'name': 'resultType',\n     'type': 'string'}\n\nOn instance:\n    'foo'",
  "status": 400,
  "title": "Bad Request",
  "type": "about:blank"
}

After investigating the framework code, they are implementing this RFC:
https://tools.ietf.org/html/draft-ietf-appsawg-http-problem-00

PRIDE:
curl -i -X GET --header 'Accept: application/json' 'http://wwwdev.ebi.ac.uk/pride/proxi/archive/v0.1/datasets?pageSize=100&pageNumber=1&resultType=foo'

HTTP/1.1 400
{
"timestamp" : 1581580688155,
"status" : 400,
"error" : "Bad Request",
"message" : "Failed to convert value of type 'java.lang.String' to required type 'uk.ac.ebi.pride.ws.pride.utils.WsContastants$ResultType'; nested exception is org.springframework.core.convert.ConversionFailedException: Failed to convert from type [java.lang.String] to type [@org.springframework.web.bind.annotation.RequestParam uk.ac.ebi.pride.ws.pride.utils.WsContastants$ResultType] for value 'foo'; nested exception is java.lang.IllegalArgumentException: No enum constant uk.ac.ebi.pride.ws.pride.utils.WsContastants.ResultType.foo",
"path" : "/pride/proxi/archive/v0.1/datasets"
}

MassIVE:
curl -i -X GET --header 'Accept: application/json' 'ccms-internal.ucsd.edu/ProteoSAFe/proxi/v0.1/datasets?pageSize=100&pageNumber=1&resultType=foo'

HTTP/1.1 400 Bad Request

<title>Apache Tomcat/6.0.24 - Error report</title><style></style>

HTTP Status 400 - Unrecognized "resultType" parameter value [foo]


type Status report

message Unrecognized "resultType" parameter value [foo]

description The request sent by the client was syntactically incorrect (Unrecognized "resultType" parameter value [foo]).


Apache Tomcat/6.0.24

jPOST seems not to mind the schema violation:
curl -i -X GET --header 'Accept: application/json' 'https://repository.jpostdb.org/proxi/datasets?resultType=foo&accession=PXD005159'

HTTP/1.1 200 OK
[{"accession":[{"name":"jPOST dataset identifier","value":"JPST000200","accession":"MS:1002632","cvLabel":"MS"},{"name":"ProteomeXchange accession number","value":"PXD005159","accession":"MS:1001919","cvLabel":"MS"}],"title":"HeLa standard shotgun DDA analysis using a two-meter C18 monolithic silica column","publications":[{"name":"PubMed identifier","accession":"MS:1000879","value":"","cvLabel":"MS"},{"name":"Reference","accession":"MS:1002866","value":"","cvLabel":"MS"}],"contacts":[[{"name":"dataset submitter","accession":"MS:1002037","cvLabel":"MS"},{"name":"contact name","accession":"MS:1000586","value":"Saki Nambu","cvLabel":"MS"},{"name":"contact email","accession":"MS:1000589","value":"[email protected]","cvLabel":"MS"},{"name":"contact affiliation","accession":"MS:1000590","value":"Kyoto university","cvLabel":"MS"}],[{"name":"lab head","accession":"MS:1002332","cvLabel":"MS"},{"name":"contact name","accession":"MS:1000586","value":"N/A","cvLabel":"MS"},{"name":"contact affiliation","accession":"MS:1000590","value":"N/A","cvLabel":"MS"}]],"species":[[{"name":"taxonomy: scientific name","value":"Homo sapiens (Human)","accession":"MS:1001469","cvLabel":"MS"},{"name":"taxonomy: NCBI TaxID","value":"9606","accession":"MS:1001467","cvLabel":"MS"}]],"instruments":[[{"name":"Q Exactive","accession":"MS:1001911","cvLabel":"MS"}]]}]

How do we feel about these results?

In Dataset, why summary?

In the schema definition of Dataset, it seems that we have an attribute "summary" that really is "Description" in PX XML? Can we preserve the name for clarity and call this attribute "description" instead?

Do we need Psm accession?

Psm is defined in the YAML as:

Psm:
required:
- peptideSequence
properties:
accession:
type: string
description: Accession of the PSM
usi:
type: string
description: The USI representation for the PSM
...

I like the usi. But what is the accession? Does anyone plan on filling in some other kind of accession for a PSM?

Related, the output does not have datasetIdentifier. All the other components needed to build a USI are part of the output. Except datasetIdentifier. Seems like we should have it. Maybe that's what accession was supposed to be?

Slack channel

@edeutsch @jjcarver Shin, Nuno and Juan do we want to have an open channel in slack that enable use to talk about the project daily basics for example. Also this can be open to other collaborators to interact with the group/project.

We have been using this strategy in other projects such as Biocontainers and people join making questions and proposing features for the resource.

How should a spectrum status be used?

The current Spectrum class defines a required attribute:

      status:
        type: string 
        enum: [READABLE, PEAK UNAVAILABLE]
        description: Status of the Spectrum

Can we define these status entries?

What does READABLE mean? Does this mean that the spectrum exists can be fetched and provided? I suppose this is fine, although a strange word, since the antonym is UNREADABLE. But what would UNREADABLE mean? And that isn't an option.

What does "PEAK UNAVAILABLE" mean exactly? Is that the first peak unavailable? or any one peak unavailable? All peaks unavailable? Some peaks unavailable? Or does it mean the spectrum is unavailable? How is this different from a 404?

How should this be used? At PeptideAtlas a spectrum is either available and provided or it is not available and just not in the returned list or is a 404. PeptideAtlas doesn't use "PEAK UNAVAILABLE" since I don't know what it should mean or how it should be used.

Should it be used if there is no such spectrum at the repository?
Should it be used if the spectrum is real and valid and should be available, but due to some technical glitch it cannot be fetched from the data store? So not 404. But closer to 500?

We should decide and document this.

What should /psms compact and full look like?

All endpoints have resultType=compact|full
What should compact for /psms be?
the YAML says:
Psm:
required:
- peptideSequence
properties:

but just a list of peptideSequences is useless. I thought just USIs would be a fine compact. But peptideSequence is required. so here's a possibility:

http://www.peptideatlas.org/api/proxi/v0.1/psms?resultType=compact&accession=PXD005942
[
{
"peptideSequence": "LSSPATLNSR",
"usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:10:LSSPATLNSR/2"
},
{
"peptideSequence": "LSSPATLNSR",
"usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:13:LSSPATLNSR/2"
},
{
"peptideSequence": "LSSPATLNSR",
"usi": "mzspec:PXD005942:030219_ywt_sf-40:scan:15:LSSPATLNSR/2"
},
...

Do we like that?

is startPosition and endPosition really needed?

ProteinIdentification:
required:
- proteinAccession
- startPosition
- endPosition

Is the startPosition and endPosition really required here?
We don't have it trivially available at the moment, so are lying with -1 and -1.
We can get it and will, I guess.
but I'm questioning if we really should have these required. Most proteomics data output doesn't normally capture this?

peptide vs peptidoform conflation

We are totally conflating the terms peptide and peptidoform. From the /peptides YAML doc:
http://www.peptideatlas.org/api/proxi/v0.1/ui/#/

"The peptide entry point returns global peptidoform statistics across an entire resource. Each peptide contains a summary of the statistics of the peptidoform across the entire resource."

We should aim to be clear and precise. If this endpoint is dealing in peptidoforms (and it does because there are ptms there), then I think we should call it:
/peptidoforms

Do we also want to have /peptides entry point that is scrubbed of all mass mods?
i.e. the /peptides entry point is agnostic to mass mods
the /peptidoforms endpoint requires full handling of mass mods

What do you think?

Error status for multiple providers

As we discussed last week, we will need to have a different definition of errors or status when querying all entry points. The broker will need to retrieve multiple statuses for multiple entry points. We have multiple options here:

 [ 
     {
         "peptideSequence": "LSSPATLNSR",
         "usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:10:LSSPATLNSR/2"
     },
     {
        "peptideSequence": "APLVCLPVFVSR",
        "usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:120:APLVC[Carbamidomethyl]LPVFVSR/2"
     },
 ]
{
 errors: []
}
  • The second option is to encode the data into one part of the object and the errors in another Like:
{ 
   data: [
     {
         "peptideSequence": "LSSPATLNSR",
         "usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:10:LSSPATLNSR/2"
     },
     {
        "peptideSequence": "APLVCLPVFVSR",
        "usi": "mzspec:PXD005942:030219_ywt_sf-39:scan:120:APLVC[Carbamidomethyl]LPVFVSR/2"
     },
   ], 
   errors: []
}

The second approach define a global object with two parts data and errors.

/datasets API endpoint JSON output format

Below is some sample JSON that we would tentatively output from the /datasets API endpoint. The dataset used in this example is live in both MassIVE and ProteomeCentral, and can be found at the following links:

Link URL
MassIVE dataset https://massive.ucsd.edu/ProteoSAFe/QueryMSV?id=MSV000081125
MassIVE FTP ftp://massive.ucsd.edu/MSV000081125
ProteomeCentral dataset http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=6629
ProteomeCentral dataset XML http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=6629&outputMode=XML&test=no

This is a "full" record with all files listed out:

{
    "accession": "PXD006629",
    "title": "Mitochondrial H+-ATP synthase in human skeletal muscle: contribution to dyslipidemia and insulin resistance",
    "summary": "Mitochondrial H+-ATP synthase in human skeletal muscle: contribution to dyslipidemia and insulin resistance",
    "species": [
        {"accession": "MS:1001467", "name": "taxonomy: NCBI TaxID", "value": "9606", "cvLabel": "MS"}
    ],
    "instruments": [
        {"accession": "MS:1002416", "name": "Orbitrap Fusion", "cvLabel": "MS"}
    ],
    "modifications": [
        {"accession": "UNIMOD:737", "name": "TMT6plex", "cvLabel": "UNIMOD"},
        {"accession": "UNIMOD:35", "name": "Oxidation", "cvLabel": "UNIMOD"},
        {"accession": "UNIMOD:4", "name": "Carbamidomethyl", "cvLabel": "UNIMOD"}
    ],
    "contacts": [
        {"contactProperties":[
            {"accession": "MS:1002037", "name": "dataset submitter", "cvLabel": "MS"},
            {"accession": "MS:1000586", "name": "contact name", "value": "John Lapek", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "[email protected]", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UCSD", "cvLabel": "MS"}
        ]},
        {"contactProperties":[
            {"accession": "MS:1002332", "name": "lab head", "cvLabel": "MS"},
            {"accession": "MS:1000586", "name": "contact name", "value": "Laura Formentini", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "[email protected]", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UAM University Madrid", "cvLabel": "MS"}
        ]}
    ],
    "publications": [
        {"accession": "MS:1002853", "name": "Dataset with no associated published manuscript", "cvLabel": "MS"}
    ],
    "keywords": [
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "mitochondria", "cvLabel": "MS"},
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "insulin resistance", "cvLabel": "MS"},
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "ATP synthase", "cvLabel": "MS"}
    ],
    "datasetLink": {"accession": "MS:1002488", "name": "MassIVE dataset URI", "value": "http://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=d6756ac742ed4f13811ddab2843e7d54", "cvLabel": "MS"},
    "dataFiles": [
        {"accession": "MS:1002846", "name": "Associated raw file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/raw/DG000895_Francisco_Normal_Mitos.raw", "cvLabel": "MS"},
        {"accession": "MS:1002850", "name": "Peak list file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/peak/DG000895_Francisco_Normal_Mitos.mzML", "cvLabel": "MS"},
        {"accession": "MS:1002845", "name": "Result file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/result/DG000895_Francisco_Normal_Mitos_PSMs.mzTab", "cvLabel": "MS"},
        {"accession": "MS:1002848", "name": "Result file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_result/DG000895_Francisco_Normal_Mitos_PSMs.mzTab", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/other/DG000895_Francisco_Normal_Mitos.zip", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/other/Francisco_Normal_Mitos.xlsx", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_parameters/params.xml", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_statistics/statistics.tsv", "cvLabel": "MS"}
    ],
    "links": [
        {"rel": "self", "href": "http://massive.ucsd.edu/ProteoSAFe/proxi/datasets/PXD006629"}
    ]
}

Please comment on any potential issues you see with this sample output format.

do we really stipulate a max pageSize?

Current yaml says:
- name: pageSize
in: query
description: How many items to return at one time (default 100, max 100)
required: false
type: integer
default: 100

I'm fine with a default 100 so that a naive query does not return a billion rows. But why should we stipulate a max of 100? If a client wants to pull all 10,000 PSMs from PXD123, why should they have to do it in chunks of 100? How irritating for them. And extra work for my machine too.

I propose we can keep default of 100, but let each implementing site choose what max or limits to impose. If PRIDE only wants to allow 100 at a time, fine. But I don't think we should prevent PeptideAtlas from returning 10,000 rows if the user asks for it? It's not enforceable anyway via the schema, so I propose we strike that.

Comments?

/peptidoforms still labeled getPeptides

The operationId for /peptidoforms is still getPeptides, which leads to confusing autogenerated code and will cause a problem if we ever create a /peptides endpoint

/peptidoforms:
get:
summary: Get a collection of peptidoforms
operationId: getPeptides

Way of filtering datasets

It would be great if we can filter datasets by the following fields:

/datasets?pageSize=50&pageNumber=1&resultType=compact
/datasets?pageSize=50&pageNumber=1&resultType=full

  • Filters (filtering result by columns that are returned):
    /datasets?species=human
    /datasets?species=homo*
    /datasets?species=*sapiens
    /datasets?species=homo sapiens
    /datasets?species=homo sapiens&pageSize=50&pageNumber=1&resultType=compact
    /datasets?species=9606
    /datasets?species=taxon:9606
    /datasets?species=human;mouse # decide which of these to use or what is conventional
    /datasets?species=[human,mouse]
    accession=
    instrument=
    contact=
    publication=
    modification=
    search=liver
    search=P12345 # this might be honored by a service, but not mandatory. At some point in the future, we as a consortium might decide to add protein=

Searches: (selecting results based on terms that can apply to any part of the returned records)
/datasets?search=liver
/datasets?species=human&contact=Mann&search=liver

/datasets?species=human&pageSize=50&pageNumber=1&resultType=compact

species as query parameter?

Also in my implementation notes was the idea that "species" should be an input parameter to all of the endpoints. One can imagine wanting to constrain any of those queries to limit results to just one species.

What do you think?

Massive data files do not exist

@jjcarver I was testing today the Proxi API and I realize that massive endpoint is not returning the files associated with the dataset. Can you do an effort to return that information?

This is important because if we start implementing clients and tools associated with the API the users will expect as much information as possible.

Comments on current form of Spectrum class

Regarding the current schema:
https://raw.githubusercontent.com/HUPO-PSI/proxi-schemas/master/specs/swagger.yaml

Here is a toy example of a Spectrum object as defined by the current schema:
http://www.peptideatlas.org/api/proxi/v0/spectra/238293
UI: http://www.peptideatlas.org/api/proxi/v0/ui/

  • Comments/questions:
    • usi is great, but what is accession? Anything the repo wants? preferably in usi notation? (like an lsi {local spectrum identifier})
    • charge: this already has a very specific meaning (assuming this means precursor_charge) Do we really need a CV term?
    • mz: this already has a very specific meaning (assuming this means precursor_mz) Do we really need a CV term?
    • If so, what is the proper term? MS:1000744 selected ion m/z is what mzML uses
    • I suggest the very limited number of fixed slots don't need OntologyTerm since they are clearly defined
    • What if there are multiple precursors in the selection window? rarely supported but a common occurrence..
    • In addition to the standard set of attributes, how about a container for additional CV terms
      e.g. spectrumAttributes = [ OntologyTerm, OntologyTerm, ... ]
    • In OntlogyTerm, can we omit cvLabel? I don't mzML et al. have it, but since our accessions are full CURIEs, the CURIE prefix is the cvLabel
      UNLESS, we defining a cvList in the documentheader somewhere, which we're not. So how about remove cvLabel and always use the common CURIE prefixes.
      CURIE prefixes are not completely unambiguous (e.g. PMID: vs PUBMED: but in our little MS world, effectively it is)
    • What is the desired format of peakList? Array of peaks, yes, but what is each peak? 3-element array? dict of attributes?

PRIDE USI without interpretation not working?

PRIDE PROXI spectra returning mzs not in order

Fetching a spectrum from PRIDE via PROXI such as:
http://wwwdev.ebi.ac.uk/pride/proxi/archive/v0.1/spectra?resultType=full&usi=mzspec:PXD000966:CPTAC_CompRef_00_iTRAQ_12_5Feb12_Cougar_11-10-11.mzML:scan:11850:[UNIMOD:214]YYWGGLYSWDMSK[UNIMOD:214]/2

returns the mzs in random order. This is not against the current spec, which does not specify. But it is breaking the Lorikeet viewer at ProteomeCentral. Seems like many applications may assume mzs in order.

What should be the resolution?

  • Should we update the documentation to allow mzs in any order? Write some code to compensate for out of order mzs in ProteomeCentral and all other applications?
  • Should we update the documentation to require mzs in ascending order? And update the PRIDE implementation?

Does the random order of mzs at PRIDE match the same order of intensities? One risk of separate arrays is that they become unaligned.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.