- id_prefixes currently says "the identifier of this class or slot must begin with one of the URIs referenced by this prefix".
This sort of implies that a prefix can reference more than one URI. I'm hoping that we are dealing with a model where every prefix maps to exactly one URI (note, however, that the reverse may not necessarily be true... I need to check whether we guarantee uniqueness on URI's per prefix)
- when it comes to actually validating data, I would think that the following:
classes:
HighClass:
id_prefixes:
- NCIt
- SCT
Would assert that a YAML or JSON representation of the id of an instance of HighClass would necessarily start with "NCIt:" or "SCT:", while an RDF instance would start with https://nci.....org/ncit/...
or http://snomed.org/id/
.
What I would propose, however, is that we extend the definition of id_prefixes to support the following:
classes:
HighClass:
id_prefixes:
NCIt:
SCT:
Which would be the same as the above. We would extend the definition slightly, to allow:
classes:
HighClass:
id_prefixes:
NCIt: ^C\d{5,6}$
SCT: ^\d{6,18}$
Which would assert that the local name of a Curie or URI must begin with "C" and have 5 or 6 digits if it began w/ NCIt
or it must be a 6 to 18 digit number if it were SCT.
This would be a minimal change to the LinkML model itself, and, as of yet, the loaders do not do anything with ID prefixes so it would be no additions.
Questions:
- Do we really need the "^...$" pattern or can we assume them?
- Would we ever want two or more patterns and, if so, would we want something of the form
id_prefixes:
NCIt:
- C\d{5}
- M\d{7}
or would "(C\d{5}|M\d{7})" be ok?
- It should be noted that SNOMED CT, in particular, includes a check digit and other formatting information that isn't expressible as a simple RE. Should we provide a hook for future use that names an algorithm or just let it slide.
My suggested answers are: 1) Assume them, 2) single RE is fine and 3) nah - not now