linkml / linkml Goto Github PK

View Code? Open in Web Editor NEW

280.0 14.0 85.0 88.62 MB

Linked Open Data Modeling Language

Home Page: https://linkml.io/linkml

License: Other

Makefile 0.07% Shell 0.03% Python 83.98% Jupyter Notebook 15.30% Dockerfile 0.04% Jinja 0.58%

rdf modeling linkml json-schema linkml-schema semantic-web schema data-modeling json-ld-context owl

linkml's Introduction

LinkML - Linked Data Modeling Language

LinkML is a linked data modeling language following object-oriented and ontological principles. LinkML models are typically authored in YAML, and can be converted to other schema representation formats such as JSON or RDF.

This repo holds the tools for generating and working with LinkML. For the LinkML schema (metamodel), please see https://github.com/linkml/linkml-model

The complete documentation for LinkML can be found here:

linkml.io/linkml

linkml's People

Contributors

Stargazers

Watchers

Forkers

deepakunni3 polyneme hubayirp starinformatics openlink bpow cancerdhc turbomam sujaypatil96 ahwagner balhoff pabloalarconm cthoyt rosdyana vemonet noelmcloughlin smartniz dalito stroemphi janik-martin jonnycrunch mdperry galund govuk-one-login paulrschrater niraj-gupta1020 knownmed genomicmedlab joshmoore ialarmedalien capsulecorplab vincentvialard kapernikov hsolbrig jankatins silvanoc nbadrzadeh1 wolfgangfahl tkphd bmedi gaybro8777 rpatil524 nicholsn bellieetda andrawaag bastiion alasdairgray eecavanna yarikoptic tfliss gouttegd mach30 lkuchenb saerdnaer anferneealviar99 mirsci anjastrunk mandixbw puja-trivedi sneakers-the-rat peoplemakeculture rwblair samuvack sami-amabit plbremer dgruano djarecka paulmillar ielis tgbugs bendichter noahhl candleindark mih niceclat vincentkelleher timothy-trinidad-ps linupi robertschubert kulnor vladimiralexiev mkbrechtel manulera

linkml's Issues

meta:class_uri and skos:mappingRelation IRIs incorrect in rdf-gen

Previously meta:class_uri resolved to the class IRI, for example:

<https://w3id.org/biolink/vocab/Gene> meta:class_uri <http://purl.obolibrary.org/obo/SO_0000704> ;

On the biolink-model master branch it looks like this:

<https://w3id.org/biolink/vocab/Gene> meta:class_uri <https://w3id.org/biolink/vocab/Gene> ;

It would also be useful for skos:mappingRelation to point to the proper IRIs, currently they are:

<https://w3id.org/biolink/vocab/Gene>  skos:mappingRelation <https://w3id.org/biolink/vocab/SIO:010035>,
        <https://w3id.org/biolink/vocab/SO:0000704>,
        <wd:Q7187> ;

Remove the requirement of default range

Requiring a default range is a bit heavy-handed. Shouldn't bother with it until you encounter an unspecified range and, even then, you may want to hard-code the basic string range.

Ifabsent uri needs to automatically be emitted as a prefix

At the moment, ifabsent_uri.yaml requires the emit_prefixes to generate python. This should not be necessary

Drop "key" from model

The differences between "key" and "identifier" are not totally obvious. It appears that the intent of "key" was to imply some of the behaviors of "identifier" -- unique within a container without the dictionary characteristics. We believe that this behavior will better be met by adding a modifier to the "inlined" property to say whether something should be inlined as a dictionary or a list.

Note that the python is equipped to read both forms at the moment, but will alays emit dictionaries. As per issue biolink/biolinkml#186, "key" will be dropped from the model and "inlined_list" will be added, where:

class does not have an identifier slot --> inlined must be true and will be forced to true if not specified. Will inline as a list
class has an identifier slot = True
- inlined = false (default) / inlined_list = false (default) -- list of class identifiers
- inlined = true, inlined_list = false(default) -- Input for python can either be a dict or a list. Yaml and json output will be as a dictionary. JSON Schema will specify dict
- inlined_list = true, inlined = true (if omitted will be set to true) -- input for python can either be a dict or a list. Yaml and JSON output will be as a list. JSON Schema will specify a list

Treatment of slots with range typeof string, iri, etc - treat as object property?

(low priority)

Currently this will be translated to an OP

types:
  label type:
    typeof: string
    description: >-
      A string that provides a human-readable name for a thing

slots:
  name:
    is_a: node property
    aliases: ['label', 'display name']
    domain: named thing
    range: label type
    description: >-
      A human-readable name for a thing
    in_subset:
      - translator_minimal
    mappings:
      - "rdfs:label"

The generated OP has no range constraint. Additional class slots get translated into subClassOf axioms, e.g.

Note that in OWL we are forced to choose AP vs DP vs OP. No punning across these. Use of APs for rdfs:label etc is useful for classes since we are talking about the class but not an individual. However, arguably there is less need for APs in OWL2 now that we can pun on class (use an OP for labels).

There are a number of options here with pros and cons

Treat as OPs and put types in domain of discourse

Here label type in the above schema would be treated as a class, and name as an OP.

this is coherent but is obviously different from how people do this at the moment.

We also clearly have an issue here if we use rdfs:label as the IRI here as we do illegal punning between AP and OP.

One possibility is to model it this way, but to have a defined transform literalify/deliteralify, e..g

?x blmod:name [a blmod:label type ;
                            blmeta:has-value "fred"]
<=>
?x rdfs:label "fred"

Specify OWL type in model

E.g in the definition of name state that this is owl:AP. Similarly for other slots we may want to declare as DPs.

The advantage is that introduces no illegal punning if we reuse existing property IRIs from RDFS, OBO, etc

The disadvantage is that can't use in logical axioms.

Use DPs

Again this would cause illegal punning if we reuse IRIs such as rdfs:label.

aliases and alias metamodel components should be shown on class and slot pages

See https://biolink.github.io/biolink-model/docs/PhenotypicFeature.html as an example (scroll to the bottom)

Question on range of subproperty_of

The range of subproperty_of is currently uriorcurie, meaning that it may or may not reference a slot definition. It appears that some of the biolink-model and markdown code expects it to always reference a slot definition. Should we tighten this definition so that subproperty_of always references a slot?

See: subproperty_of: develops from in biolink-model for example of a non-reference

OWL and RDF test suites need to incorporate a difference calculator

The BNodes in the emitted OWL and RDF make the order of the restrictions non-determinant. The tests will fail until this is fixed.

Currently the tests have @unittest.expectedfailure decorators

Implement a way of mapping from a triad to a hyperedge

See biolink/biolink-model#269

Want to say:

R has-input C1
R has-output C2
R enabled-by P
R type Reaction

==>

C1 derives-into C2 << catalyzed-by P

Document SOP for extending metamodel

Occasionally we want to add new metadata to classes, slots, etc

E.g. adding creator, date_added, data_modified

What is the SOP for adding these?

Could be in a README-developers.md or CONTRIBUTING.md or similar

Inherited lines and what have you different color

Integrate biothings explorer generator code

Talking with @newgene

@kevinxin90 wrote some nice code for converting the bilinkml yaml to the CD2H format, biolink-model here:

http://discovery.biothings.io/bts/

This is nice! We could link this from the main biolink-model site

Would be good to explore integration of code. I think Kevin used the yaml directly and didn't use the biolinkml framework. I also think he hardcoded some filtering, e.g associations don't show up.

It looks like Kevin is missing quite a bit, e.g. http://discovery.biothings.io/bts/ProteinIsoform doesn't show inherited mixins - this is why it's good to use the biolinkml python framework rather than trying to figure the yaml semantics for yourself!!

Need a namespace manager generator, available at runtime vs. compile time.

We need to be able to generate a namespace manager tool
The initial code can be found at:

https://github.com/ncats/translator-modules/blob/ARA_Workflow_API/ncats/translator/identifiers/__init__.py

@RichardBruskiewich - no action required, just letting you know that we've got this

Need unit tests for the prolog generator (lpgen)

Module failures when running examples notebook

When running the examples.ipynb notebook, it failed to successfully load the as_json_obj and the yamlmagic would not work correctly. Specifically:

import jsonasobj.as_json_obj ## ERROR: I'm getting error  No module named 'jsonasobj.as_json_obj'

and

%%yaml --loader DupCheckYamlLoader yaml

## ERROR: running this cell produces 2 errors: 
## 1. Javascript Error: require is not defined
## 2. File "<ipython-input-3-f6f890ca4330>", line 4
##    id: http://example.org/sample/example1
            ^
## SyntaxError: invalid syntax

I order to work around these errors I had to make a number of changes. I am attaching my notebook exapmple-debug.ipynb (in a zip file)
to show how I did this. Comment marked ## ERROR: show where I had issues.

examples-debug.ipynb.zip

[low priority] confusing error messages if slot_usage specified incorrectly

id: t

license: https://creativecommons.org/publicdomain/zero/1.0/
version: 0.0.1

prefixes:
  t: http://w3id.org/t
  biolinkml: https://w3id.org/biolink/biolinkml/
  
default_prefix: t
default_range: string

imports:
  - biolinkml:types

classes:
  a:
    slot_usage:
      my slot: "I am mistakenly putting a description here"

this produces a confusing error

gen-py-classes t.yaml 
...
 File "<string>", line 43, in __init__
  File "/Users/cjm/repos/ontology-change-language/venv/lib/python3.7/site-packages/biolinkml/meta.py", line 229, in __post_init__
    self.classes[k] = ClassDefinition(name=k, **({} if v is None else v))
  File "<string>", line 42, in __init__
  File "/Users/cjm/repos/ontology-change-language/venv/lib/python3.7/site-packages/biolinkml/meta.py", line 445, in __post_init__
    self.slot_usage[k] = SlotDefinition(name=k, **({} if v is None else v))
TypeError: type object argument after ** must be a mapping, not extended_str

this can be hard to debug in a large yaml file

this similarly produces a confusing error

...
classes:
  a:
    slot_usage:
      - my slot

ideally both cases would report the class id, to make debugging asier

create a sparqlgen emitter that generates queries that detect datamodel violations

A standard pattern for QC over ontologies and KGs is to either hand-craft or generate queries that detect datamodel/QC vuolations. If the query returns zero rows, the check passes. If one or more is returned then there is a violation. The queries can be categorized, e.g. ERROR vs WARN. This is how we do things in OBO, see http://robot.obolibrary.org/report

See also the obo-dashboard: http://obo-dashboard-test.ontodev.com/

For more on the general idea see https://github.com/cmungall/dasher

The general idea would be take as input an instance graph (e.g. as JSON-LD or RDF) plus a blml-specified schema, and generate queries:

The rdf:type should be to an IRI in the model
For any S P O triple, P should be defined in the model
Required: Any required field for a class C would generate a query that returned all instances of C that did not have such a field
cardinality: if a field P is not multivalued, and there exists S P V1, S P V2, then this is an error
regex patterns: see https://w3id.org/biolink/biolinkml/meta/pattern
https://biolink.github.io/biolinkml/docs/id_prefixes - IDs should conform
MANY MORE

@hsolbrig - this is potentially redundant with pyshex, but it may also be convenient to have this generate individual sparql queries that could be executed individually. It may make more sense to do this from the shex nevertheless?

Implementing this at th blml level also is a good way of being explicit about semantics

Additionally we can do OWL reasoning e.g with arachne

Slot `owner` attribute should be removed

Slot owner no longer makes sense. It has been marked as deprecated, but we still need to work through the remainder of the code that references it to get an alternative solution.

Switch from rdflib-jsonld to pyld

rdflib-jsonld isn't planning to move to json-ld 1.1 anytime soon. We need to replace the functionality it supplies in biolink w/ https://github.com/digitalbazaar/pyld -- a simple example of its use can be found at https://github.com/hsolbrig/stupid_jsonld_tricks/blob/master/src/gostaysitv2.py

Inline properties of classes of custom types

When defining custom types, it would be good to have them be folded into the class that references these types.

For example, in the case of class Biosample, instead of having 5 arrows go to other types, would be nice to have them inline:

Prefixes section doesn't support CURIES

Prefixes section doesn't support CURIES. When you supply one, you get: File "~/.local/share/virtualenvs/tccm-CoY-QjRv/lib/python3.8/site-packages/biolinkml/utils/metamodelcore.py", line 131, in init Without a line number reference. We should be able to pass identifiers and the line number generator should be smart enough to know whether it has them or not

RDF Unit tests take too long

RDF Comparator has some sort of inefficiency -- needs tweaking

value specifications for biolinkml

In biolinkml, can you define/map literal values to an ontology iri. For example, let's say we want the values "F" to represent a female organism and "M" to represent a male organism. In json-ld you could do something like this:

{
"@context":
{

      "ex": "http://example.com/",
      "sex":
      {
          "@id": "ex:sex",
 "@type": "@vocab"
},
        "M": "http://purl.obolibrary.org/obo/CARO_0000027",
"F": "http://purl.obolibrary.org/obo/CARO_0000028"
},
  "@graph":
  [
{
      "@id": "ex:host1",
 "sex": "F"
},
  {
      "@id": "ex:host2",
      "sex": "M"
    }
  ]
}

This translates into RDF:

<http://example.com/host1> <http://example.com/sex> <http://purl.obolibrary.org/obo/CARO_0000028> .
<http://example.com/host2> <http://example.com/sex> <http://purl.obolibrary.org/obo/CARO_0000027> .

The larger point I am trying to get at is that we may get a number of different values form multiple data sources that semantically represent the same kinds of entities (e.g., another data source may use the values "female" and "male"). Can we represent such differing value sets?

Importing entrydescription appears to pull OpaqueData in to the python even if it is never used... why?

graphql mapping could be improved [low priority]

E.g.

https://github.com/biolink/biolink-model/blob/c535a7ad6c89fcdd159dcde9a3a9e91807b5493e/biolink-model.graphql#L891-L908

can we not assign a more specific type to subject?

Also, given we have type inheritance in graphql, is it necessary to repeat everything

Create formal specification

Default prefix isn't set in Python if it isn't declared in yaml

Namespaces should be local to python module

Currently, all namespaces are generated for a python extension module. Only the namespaces that are specifically referenced in the extension should be emitted, as the rest can be imported as needed.

Incorrect Markdown generation with complex documentation strings

See: attributes section in https://biolink.github.io/biolinkml/docs/SlotDefinition

The multiline descriptions is not properly formatted, which mangles the rest of the text

ensure that name and title are used appropriately in schemas

name is restricted to ncname for schema, we should ensure this is followed

many schemas may have used name where they should have used title

Model Versioning

We need to versioning support models generated via biolinkml. The language supports a version identifier but, at the moment, there is no way to reference anything but the latest version of a given model via perma-id.

It shouldn't be difficult to add a version to the w3id path, but we still need to:
a) determine how a version in a path resolves to the equivalent source in github
b) determine how to identify the "latest" version and dynamically map it to the appropriate path

default_namespace must be specified

Currently, if you omit the default namespace, you don't know that it is needed until you get a generator specific message such as:

    context = ContextGenerator(os.path.join(inputdir, 'uriandcurie.yaml')).serialize()
  File "/Users/solbrig/git/biolink/biolinkml/biolinkml/utils/generator.py", line 77, in serialize
    self.visit_schema(**kwargs)
  File "/Users/solbrig/git/biolink/biolinkml/biolinkml/generators/jsonldcontextgen.py", line 48, in visit_schema
    default_uri = self.namespaces[self.default_ns]
KeyError: None

Default namespace should be tested for (or a good default should be found) earlier in the process

Define rules for resolving paths for imports and implement consistently

Currently if I have a dir structure:

schema/
   a.yaml
   b.yaml

where a imports b, defined like this:

import:
   - a

and I run commands from the dir containing schema, most of the gen-X commands successully resolve the import (to ./schema/b.yaml), presumably by treating the dir of the importing file as base.

However generate_uml.py looks for ./b.yaml and fails

We should define what the rules are, document this, and implement consistently

slot_usage induces new sub-slots, these should be hidden in many exports

blml allows the usage of a slot to be refined (or even defined) on a per-class basis.

E.g The label slot may be generic for any kind of name. For a person class we may add documentation that states this should be a string that is typically first and last concatenated (OK, this would not be a very good rule for many names but you get the idea).

Originally I had conceived of information about a slot always being retrieved via a compound key of (class,slot). If there is a slot usage for a class, use that. Otherwise a superclass. Otherwise generic slot.

In the current implementation rather than compound key, a primary key is synthesized by concatenating class and slot. This is fine as an underlying implementation, but this should be hidden in many cases, as it causes confusion (e.g biolink/biolinkml#228).

here is a test example:

id: https://github.com/biolink/biolinkml/issues/228
name: test228
title: induced slots

types:
  string:
    base: str
    uri: xsd:string

  

classes:

  r1: {}
  r2:
    is_a: r1
  r3:
    is_a: r2
  
  c1:
    slots:
      - s
    slot_usage:
      s:
        description: s in c1
        range: r1
  c2:
    is_a: c1
    slots:
      - s
    slot_usage:
      s:
        description: s in c2
        range: r2
  c3:
    is_a: c2
    slots:
      - s
    slot_usage:
      s:
        description: s in c3
        range: r3
  d:
    slots:
      - s
    slot_usage:
      s:
        required: true

slots:
  s:
    description: >-
      generic s description

markdown changes required:

currently this generates pages for c{1,2,3}_s, as well as s. Only pages for s should be generated. c1,2,3 specific usage can be included in the markdown file for s

All usages of the arrow notation in the generated markdown docs should be removed

owlgen changes required:

currently uris are made for c{1,2,3}_s. We should only have a uri for s.

class-specific ranges can be expressed using an owl subclass axiom

python changes:

reported separately by @wdduncan here: biolink/biolinkml#228 -- but I think current behavior may be ok, if a little confusing

json schema changes:

the current json schema generation is almost correct. There are no induced slots created - only s.

Currently gen-json-schema on the above makes:

{
   "$id": "https://github.com/biolink/biolinkml/issues/228",
   "$schema": "http://json-schema.org/draft-07/schema#",
   "definitions": {
      "C1": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "$ref": "#/definitions/R1",
               "description": "s in c1"
            }
         },
         "required": [],
         "title": "C1",
         "type": "object"
      },
      "C2": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "description": "generic s description",
               "type": "string"
            }
         },
         "required": [],
         "title": "C2",
         "type": "object"
      },
      "C3": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "$ref": "#/definitions/R3",
               "description": "s in c3"
            }
         },
         "required": [],
         "title": "C3",
         "type": "object"
      },
      "R1": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R1",
         "type": "object"
      },
      "R2": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R2",
         "type": "object"
      },
      "R3": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R3",
         "type": "object"
      }
   },
   "properties": {},
   "title": "test228",
   "type": "object"
}

this is pretty good. C1 and C3 are perfect. the localized use of s is correct. However, C2 is deferring to the generic slot, which is bizaarre

Add the ability to generate LinkML YAML from shex

Currently we support biolink-model.yaml -> ShEx output.
Would be good to have biolinkml generate a YAML from ShEx.

This would be helpful in taking an existing ShEx and creating a biolinkml compliant YAML, which can then further be refined manually.

Make a Docker container

This could simplify things for schema maintainers.

Also for use in a .travis.yml for a schema repo

Improvements to markdown generator

Differentiating between Entity, Association, Slot, Relation and Property during export from biolinkml. This could be handled by creating folders for each, and exporting the markdown into these folders.
CURIE mappings from YAML should be in markdown export (currently missing).
The title in each markdown should be camelcase instead of sentence case. For example, it should be Class: NamedThing instead of Class: named thing.

Clarify behavior of imports for generated artefacts

It seems the intent with some generators is not to generate information for imported elements. However, the behavior is incomplete and underspecified. We would need to add import declarations for imported artefacts

To be able to use imports we should assume a default of merging the import closure

Mismatch between the model and the generated JSON Schema

YAML:

  biosample:
    is_a: named thing
    aliases: ['sample', 'material sample']
    description: >-
      A material sample. It may be environmental (encompassing many organisms) or isolate or tissue.  
      An environmental sample containing genetic material from multiple individuals is commonly referred to as a biosample.  
    slots:
      - annotations
    slot_usage:
      id:
        description: >-
          The primary identifier for the biosample
      name:
        description: >-
          A human readable name or description of the biosample
      alternate identifiers:
        description: >-
          The same biosample may have distinct identifiers in different databases (e.g. GOLD and EMSL)
      annotations:
        range: annotation

Generated JSON Schema:

      "Biosample": {
         "description": "A material sample. It may be environmental (encompassing many organisms) or isolate or tissue.   An environmental sample containing genetic material from multiple individuals is commonly referred to as a biosample.",
         "properties": {
            "alternate_identifiers": {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            "annotations": {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            "id": {
               "type": "string"
            },
            "name": {
               "type": "string"
            }
         },
         "title": "Biosample",
         "type": "object"
      },

In case of annotations property, the items should be of type object instead of string.

As of now, validation via JSON Schema fails because while the data is correct, in terms of modeling, there is a mismatch between the model and the generated JSON schema.

Logging utility still emits warnings when used in Jupyter Notebook

gen = PythonGenerator(model, log_level=ERROR) still emits warnings when executed in a Jupyter environment. The exact same code does not in a native python environment.

Clarify semantics of imports

Should imports behave like an #include, with everything merged into one space? Or more like owl imports, where each schema is retained in its own space?

It seems the current intent is the latter

But what are the implications for this in generators? E.g. should pythongen iterate through the import closure making different files? Is it up to the client to do this?

Imports can be complex, and I believe they are causing some issues, e.g biolink/biolinkml#121

Add reversible to_str meta-property

Use case: lat-long coordinates can be represented in both structured/normalized and flattened/denormalized form. E.g

  geolocation value:
    is_a: attribute value
    description: >-
      A normalized value for a location on the earth's surface
    slots:
      - latitude
      - longitude
      - as string value
    slot_usage:
      as string value:
        to_str: "{latitude} {longitude}"

The semantics of SLOT.to_str would be:

the range of SLOT must be string
the value of SLOT must be equal to to_str, expanding all {x} expressions with SLOT.parent.x

Implementations are not dictated but for python way to generate code for _as_str.

This could be used to drive validators and bidirectional converters (e.g. given a flat string, auto-parse into normalized object)

{
    "type": ["integer", "string"]
}

See https://cswr.github.io/JsonSchema/spec/multiple_types/

Can biolinkml be extended to support this?

Namespacemanager line 169 -- should have a file/line reference.

JSONLD context prefix with endings other than '/' or '#' not resolved correctly

See biolink model issue 301 odd CHEBI identifier for the report. This is because JSONLD is sensitive on how a synonym ends. If it ends with '' for example in 'CHEBI', it is not resolved correctly. It can be fixed by a @Prefix attribute but only in JSONLD 1.1. See http://tinyurl.com/qmd4ggd for an example.

To fix this: Add a parameter to the generator that switches between 1.0 and 1.1. If emit 1.1, add "@Version": 1.1 to the first bit of the context.

Need to add Any type -- note that you can set it as the default range if needed

Add enums and label->id mappings

We have

  values_from:
    domain: definition
    multivalued: true
    range: uriorcurie
    description: >-
      the identifier of a "value set" -- a set of identifiers that form the possible values for the range of a slot

(aside: this is declared at the definition level, should be a slot property)

We should have the equiv for string values, i.e an enum.

e.g

slots:
  evidence code:
    enum:
      - IEA
      - ISS
...

it would be good to specify mappings for each of these, perhaps:

slots:
  evidence code:
    enum:
      IEA: ECO:nnnn
      ISS: ECO:nnn
...

or perhaps a more expressive:

slots:
  evidence code:
    enum:
      IEA:
        id: ECO:nnnn
        description: ...
      ISS: 
         id: ECO:nnn
         description: ...
...

mapping to a json-ld context should be obvious