Coder Social home page Coder Social logo

linkml / linkml Goto Github PK

View Code? Open in Web Editor NEW
280.0 14.0 85.0 88.62 MB

Linked Open Data Modeling Language

Home Page: https://linkml.io/linkml

License: Other

Makefile 0.07% Shell 0.03% Python 83.98% Jupyter Notebook 15.30% Dockerfile 0.04% Jinja 0.58%
rdf modeling linkml json-schema linkml-schema semantic-web schema data-modeling json-ld-context owl

linkml's Introduction

Pyversions PyPi badge DOI PyPIDownloadsTotal PyPIDownloadsMonth codecov

LinkML - Linked Data Modeling Language

LinkML is a linked data modeling language following object-oriented and ontological principles. LinkML models are typically authored in YAML, and can be converted to other schema representation formats such as JSON or RDF.

This repo holds the tools for generating and working with LinkML. For the LinkML schema (metamodel), please see https://github.com/linkml/linkml-model

The complete documentation for LinkML can be found here:

linkml's People

Contributors

actions-user avatar amc-corey-cox avatar anjastrunk avatar bendichter avatar cmungall avatar dalito avatar deepakunni3 avatar gaurav avatar glass-ships avatar hrshdhgd avatar hsolbrig avatar ialarmedalien avatar jeff-cohere avatar joeflack4 avatar julesjacobsen avatar kervel avatar kevinschaper avatar nicholsn avatar nlharris avatar noelmcloughlin avatar pkalita-lbl avatar plbremer avatar richardbruskiewich avatar sierra-moxon avatar silvanoc avatar sneakers-the-rat avatar sujaypatil96 avatar turbomam avatar wdduncan avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linkml's Issues

meta:class_uri and skos:mappingRelation IRIs incorrect in rdf-gen

Previously meta:class_uri resolved to the class IRI, for example:

<https://w3id.org/biolink/vocab/Gene> meta:class_uri <http://purl.obolibrary.org/obo/SO_0000704> ;

On the biolink-model master branch it looks like this:

<https://w3id.org/biolink/vocab/Gene> meta:class_uri <https://w3id.org/biolink/vocab/Gene> ;

It would also be useful for skos:mappingRelation to point to the proper IRIs, currently they are:

<https://w3id.org/biolink/vocab/Gene>  skos:mappingRelation <https://w3id.org/biolink/vocab/SIO:010035>,
        <https://w3id.org/biolink/vocab/SO:0000704>,
        <wd:Q7187> ;

Remove the requirement of default range

Requiring a default range is a bit heavy-handed. Shouldn't bother with it until you encounter an unspecified range and, even then, you may want to hard-code the basic string range.

Drop "key" from model

The differences between "key" and "identifier" are not totally obvious. It appears that the intent of "key" was to imply some of the behaviors of "identifier" -- unique within a container without the dictionary characteristics. We believe that this behavior will better be met by adding a modifier to the "inlined" property to say whether something should be inlined as a dictionary or a list.

Note that the python is equipped to read both forms at the moment, but will alays emit dictionaries. As per issue biolink/biolinkml#186, "key" will be dropped from the model and "inlined_list" will be added, where:

  • class does not have an identifier slot --> inlined must be true and will be forced to true if not specified. Will inline as a list
  • class has an identifier slot = True
    • inlined = false (default) / inlined_list = false (default) -- list of class identifiers
    • inlined = true, inlined_list = false(default) -- Input for python can either be a dict or a list. Yaml and json output will be as a dictionary. JSON Schema will specify dict
    • inlined_list = true, inlined = true (if omitted will be set to true) -- input for python can either be a dict or a list. Yaml and JSON output will be as a list. JSON Schema will specify a list

Treatment of slots with range typeof string, iri, etc - treat as object property?

(low priority)

Currently this will be translated to an OP

types:
  label type:
    typeof: string
    description: >-
      A string that provides a human-readable name for a thing

slots:
  name:
    is_a: node property
    aliases: ['label', 'display name']
    domain: named thing
    range: label type
    description: >-
      A human-readable name for a thing
    in_subset:
      - translator_minimal
    mappings:
      - "rdfs:label"

The generated OP has no range constraint. Additional class slots get translated into subClassOf axioms, e.g.

image

Note that in OWL we are forced to choose AP vs DP vs OP. No punning across these. Use of APs for rdfs:label etc is useful for classes since we are talking about the class but not an individual. However, arguably there is less need for APs in OWL2 now that we can pun on class (use an OP for labels).

There are a number of options here with pros and cons

Treat as OPs and put types in domain of discourse

Here label type in the above schema would be treated as a class, and name as an OP.

this is coherent but is obviously different from how people do this at the moment.

We also clearly have an issue here if we use rdfs:label as the IRI here as we do illegal punning between AP and OP.

One possibility is to model it this way, but to have a defined transform literalify/deliteralify, e..g

?x blmod:name [a blmod:label type ;
                            blmeta:has-value "fred"]
<=>
?x rdfs:label "fred"

Specify OWL type in model

E.g in the definition of name state that this is owl:AP. Similarly for other slots we may want to declare as DPs.

The advantage is that introduces no illegal punning if we reuse existing property IRIs from RDFS, OBO, etc

The disadvantage is that can't use in logical axioms.

Use DPs

Again this would cause illegal punning if we reuse IRIs such as rdfs:label.

Question on range of subproperty_of

The range of subproperty_of is currently uriorcurie, meaning that it may or may not reference a slot definition. It appears that some of the biolink-model and markdown code expects it to always reference a slot definition. Should we tighten this definition so that subproperty_of always references a slot?

See: subproperty_of: develops from in biolink-model for example of a non-reference

Document SOP for extending metamodel

Occasionally we want to add new metadata to classes, slots, etc

E.g. adding creator, date_added, data_modified

What is the SOP for adding these?

Could be in a README-developers.md or CONTRIBUTING.md or similar

Integrate biothings explorer generator code

Talking with @newgene

@kevinxin90 wrote some nice code for converting the bilinkml yaml to the CD2H format, biolink-model here:

http://discovery.biothings.io/bts/

This is nice! We could link this from the main biolink-model site

Would be good to explore integration of code. I think Kevin used the yaml directly and didn't use the biolinkml framework. I also think he hardcoded some filtering, e.g associations don't show up.

It looks like Kevin is missing quite a bit, e.g. http://discovery.biothings.io/bts/ProteinIsoform doesn't show inherited mixins - this is why it's good to use the biolinkml python framework rather than trying to figure the yaml semantics for yourself!!

Module failures when running examples notebook

When running the examples.ipynb notebook, it failed to successfully load the as_json_obj and the yamlmagic would not work correctly. Specifically:

import jsonasobj.as_json_obj ## ERROR: I'm getting error  No module named 'jsonasobj.as_json_obj'

and

%%yaml --loader DupCheckYamlLoader yaml

## ERROR: running this cell produces 2 errors: 
## 1. Javascript Error: require is not defined
## 2. File "<ipython-input-3-f6f890ca4330>", line 4
##    id: http://example.org/sample/example1
            ^
## SyntaxError: invalid syntax

I order to work around these errors I had to make a number of changes. I am attaching my notebook exapmple-debug.ipynb (in a zip file)
to show how I did this. Comment marked ## ERROR: show where I had issues.

examples-debug.ipynb.zip

[low priority] confusing error messages if slot_usage specified incorrectly

id: t

license: https://creativecommons.org/publicdomain/zero/1.0/
version: 0.0.1

prefixes:
  t: http://w3id.org/t
  biolinkml: https://w3id.org/biolink/biolinkml/
  
default_prefix: t
default_range: string

imports:
  - biolinkml:types

classes:
  a:
    slot_usage:
      my slot: "I am mistakenly putting a description here"

this produces a confusing error

gen-py-classes t.yaml 
...
 File "<string>", line 43, in __init__
  File "/Users/cjm/repos/ontology-change-language/venv/lib/python3.7/site-packages/biolinkml/meta.py", line 229, in __post_init__
    self.classes[k] = ClassDefinition(name=k, **({} if v is None else v))
  File "<string>", line 42, in __init__
  File "/Users/cjm/repos/ontology-change-language/venv/lib/python3.7/site-packages/biolinkml/meta.py", line 445, in __post_init__
    self.slot_usage[k] = SlotDefinition(name=k, **({} if v is None else v))
TypeError: type object argument after ** must be a mapping, not extended_str

this can be hard to debug in a large yaml file

this similarly produces a confusing error

...
classes:
  a:
    slot_usage:
      - my slot

ideally both cases would report the class id, to make debugging asier

create a sparqlgen emitter that generates queries that detect datamodel violations

A standard pattern for QC over ontologies and KGs is to either hand-craft or generate queries that detect datamodel/QC vuolations. If the query returns zero rows, the check passes. If one or more is returned then there is a violation. The queries can be categorized, e.g. ERROR vs WARN. This is how we do things in OBO, see http://robot.obolibrary.org/report

See also the obo-dashboard: http://obo-dashboard-test.ontodev.com/

For more on the general idea see https://github.com/cmungall/dasher

The general idea would be take as input an instance graph (e.g. as JSON-LD or RDF) plus a blml-specified schema, and generate queries:

@hsolbrig - this is potentially redundant with pyshex, but it may also be convenient to have this generate individual sparql queries that could be executed individually. It may make more sense to do this from the shex nevertheless?

Implementing this at th blml level also is a good way of being explicit about semantics

Additionally we can do OWL reasoning e.g with arachne

Slot `owner` attribute should be removed

Slot owner no longer makes sense. It has been marked as deprecated, but we still need to work through the remainder of the code that references it to get an alternative solution.

Inline properties of classes of custom types

Screen Shot 2020-05-28 at 3 39 01 PM

When defining custom types, it would be good to have them be folded into the class that references these types.

For example, in the case of class Biosample, instead of having 5 arrows go to other types, would be nice to have them inline:

Screen Shot 2020-05-28 at 3 46 23 PM

Prefixes section doesn't support CURIES

Prefixes section doesn't support CURIES. When you supply one, you get: File "~/.local/share/virtualenvs/tccm-CoY-QjRv/lib/python3.8/site-packages/biolinkml/utils/metamodelcore.py", line 131, in init Without a line number reference. We should be able to pass identifiers and the line number generator should be smart enough to know whether it has them or not

value specifications for biolinkml

In biolinkml, can you define/map literal values to an ontology iri. For example, let's say we want the values "F" to represent a female organism and "M" to represent a male organism. In json-ld you could do something like this:

{
"@context":
{

      "ex": "http://example.com/",
      "sex":
      {
          "@id": "ex:sex",
 "@type": "@vocab"
},
        "M": "http://purl.obolibrary.org/obo/CARO_0000027",
"F": "http://purl.obolibrary.org/obo/CARO_0000028"
},
  "@graph":
  [
{
      "@id": "ex:host1",
 "sex": "F"
},
  {
      "@id": "ex:host2",
      "sex": "M"
    }
  ]
}

This translates into RDF:

<http://example.com/host1> <http://example.com/sex> <http://purl.obolibrary.org/obo/CARO_0000028> .
<http://example.com/host2> <http://example.com/sex> <http://purl.obolibrary.org/obo/CARO_0000027> .

The larger point I am trying to get at is that we may get a number of different values form multiple data sources that semantically represent the same kinds of entities (e.g., another data source may use the values "female" and "male"). Can we represent such differing value sets?

Namespaces should be local to python module

Currently, all namespaces are generated for a python extension module. Only the namespaces that are specifically referenced in the extension should be emitted, as the rest can be imported as needed.

Model Versioning

We need to versioning support models generated via biolinkml. The language supports a version identifier but, at the moment, there is no way to reference anything but the latest version of a given model via perma-id.

It shouldn't be difficult to add a version to the w3id path, but we still need to:
a) determine how a version in a path resolves to the equivalent source in github
b) determine how to identify the "latest" version and dynamically map it to the appropriate path

default_namespace must be specified

Currently, if you omit the default namespace, you don't know that it is needed until you get a generator specific message such as:

    context = ContextGenerator(os.path.join(inputdir, 'uriandcurie.yaml')).serialize()
  File "/Users/solbrig/git/biolink/biolinkml/biolinkml/utils/generator.py", line 77, in serialize
    self.visit_schema(**kwargs)
  File "/Users/solbrig/git/biolink/biolinkml/biolinkml/generators/jsonldcontextgen.py", line 48, in visit_schema
    default_uri = self.namespaces[self.default_ns]
KeyError: None

Default namespace should be tested for (or a good default should be found) earlier in the process

Define rules for resolving paths for imports and implement consistently

Currently if I have a dir structure:

schema/
   a.yaml
   b.yaml

where a imports b, defined like this:

import:
   - a

and I run commands from the dir containing schema, most of the gen-X commands successully resolve the import (to ./schema/b.yaml), presumably by treating the dir of the importing file as base.

However generate_uml.py looks for ./b.yaml and fails

We should define what the rules are, document this, and implement consistently

slot_usage induces new sub-slots, these should be hidden in many exports

blml allows the usage of a slot to be refined (or even defined) on a per-class basis.

E.g The label slot may be generic for any kind of name. For a person class we may add documentation that states this should be a string that is typically first and last concatenated (OK, this would not be a very good rule for many names but you get the idea).

Originally I had conceived of information about a slot always being retrieved via a compound key of (class,slot). If there is a slot usage for a class, use that. Otherwise a superclass. Otherwise generic slot.

In the current implementation rather than compound key, a primary key is synthesized by concatenating class and slot. This is fine as an underlying implementation, but this should be hidden in many cases, as it causes confusion (e.g biolink/biolinkml#228).

here is a test example:

id: https://github.com/biolink/biolinkml/issues/228
name: test228
title: induced slots

types:
  string:
    base: str
    uri: xsd:string

  

classes:

  r1: {}
  r2:
    is_a: r1
  r3:
    is_a: r2
  
  c1:
    slots:
      - s
    slot_usage:
      s:
        description: s in c1
        range: r1
  c2:
    is_a: c1
    slots:
      - s
    slot_usage:
      s:
        description: s in c2
        range: r2
  c3:
    is_a: c2
    slots:
      - s
    slot_usage:
      s:
        description: s in c3
        range: r3
  d:
    slots:
      - s
    slot_usage:
      s:
        required: true

slots:
  s:
    description: >-
      generic s description     

markdown changes required:

currently this generates pages for c{1,2,3}_s, as well as s. Only pages for s should be generated. c1,2,3 specific usage can be included in the markdown file for s

All usages of the arrow notation in the generated markdown docs should be removed

owlgen changes required:

currently uris are made for c{1,2,3}_s. We should only have a uri for s.

class-specific ranges can be expressed using an owl subclass axiom

python changes:

reported separately by @wdduncan here: biolink/biolinkml#228 -- but I think current behavior may be ok, if a little confusing

json schema changes:

the current json schema generation is almost correct. There are no induced slots created - only s.

Currently gen-json-schema on the above makes:

{
   "$id": "https://github.com/biolink/biolinkml/issues/228",
   "$schema": "http://json-schema.org/draft-07/schema#",
   "definitions": {
      "C1": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "$ref": "#/definitions/R1",
               "description": "s in c1"
            }
         },
         "required": [],
         "title": "C1",
         "type": "object"
      },
      "C2": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "description": "generic s description",
               "type": "string"
            }
         },
         "required": [],
         "title": "C2",
         "type": "object"
      },
      "C3": {
         "additionalProperties": false,
         "description": "",
         "properties": {
            "s": {
               "$ref": "#/definitions/R3",
               "description": "s in c3"
            }
         },
         "required": [],
         "title": "C3",
         "type": "object"
      },
      "R1": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R1",
         "type": "object"
      },
      "R2": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R2",
         "type": "object"
      },
      "R3": {
         "additionalProperties": false,
         "description": "",
         "properties": {},
         "required": [],
         "title": "R3",
         "type": "object"
      }
   },
   "properties": {},
   "title": "test228",
   "type": "object"
}

this is pretty good. C1 and C3 are perfect. the localized use of s is correct. However, C2 is deferring to the generic slot, which is bizaarre

Add the ability to generate LinkML YAML from shex

Currently we support biolink-model.yaml -> ShEx output.
Would be good to have biolinkml generate a YAML from ShEx.

This would be helpful in taking an existing ShEx and creating a biolinkml compliant YAML, which can then further be refined manually.

Make a Docker container

This could simplify things for schema maintainers.

Also for use in a .travis.yml for a schema repo

Improvements to markdown generator

  • Differentiating between Entity, Association, Slot, Relation and Property during export from biolinkml. This could be handled by creating folders for each, and exporting the markdown into these folders.
  • CURIE mappings from YAML should be in markdown export (currently missing).
  • The title in each markdown should be camelcase instead of sentence case. For example, it should be Class: NamedThing instead of Class: named thing.

Clarify behavior of imports for generated artefacts

It seems the intent with some generators is not to generate information for imported elements. However, the behavior is incomplete and underspecified. We would need to add import declarations for imported artefacts

To be able to use imports we should assume a default of merging the import closure

Mismatch between the model and the generated JSON Schema

YAML:

  biosample:
    is_a: named thing
    aliases: ['sample', 'material sample']
    description: >-
      A material sample. It may be environmental (encompassing many organisms) or isolate or tissue.  
      An environmental sample containing genetic material from multiple individuals is commonly referred to as a biosample.  
    slots:
      - annotations
    slot_usage:
      id:
        description: >-
          The primary identifier for the biosample
      name:
        description: >-
          A human readable name or description of the biosample
      alternate identifiers:
        description: >-
          The same biosample may have distinct identifiers in different databases (e.g. GOLD and EMSL)
      annotations:
        range: annotation

Generated JSON Schema:

      "Biosample": {
         "description": "A material sample. It may be environmental (encompassing many organisms) or isolate or tissue.   An environmental sample containing genetic material from multiple individuals is commonly referred to as a biosample.",
         "properties": {
            "alternate_identifiers": {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            "annotations": {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            "id": {
               "type": "string"
            },
            "name": {
               "type": "string"
            }
         },
         "title": "Biosample",
         "type": "object"
      },

In case of annotations property, the items should be of type object instead of string.

As of now, validation via JSON Schema fails because while the data is correct, in terms of modeling, there is a mismatch between the model and the generated JSON schema.

Clarify semantics of imports

Should imports behave like an #include, with everything merged into one space? Or more like owl imports, where each schema is retained in its own space?

It seems the current intent is the latter

But what are the implications for this in generators? E.g. should pythongen iterate through the import closure making different files? Is it up to the client to do this?

Imports can be complex, and I believe they are causing some issues, e.g biolink/biolinkml#121

Add reversible to_str meta-property

Use case: lat-long coordinates can be represented in both structured/normalized and flattened/denormalized form. E.g

  geolocation value:
    is_a: attribute value
    description: >-
      A normalized value for a location on the earth's surface
    slots:
      - latitude
      - longitude
      - as string value
    slot_usage:
      as string value:
        to_str: "{latitude} {longitude}"

The semantics of SLOT.to_str would be:

  • the range of SLOT must be string
  • the value of SLOT must be equal to to_str, expanding all {x} expressions with SLOT.parent.x

Implementations are not dictated but for python way to generate code for _as_str.

This could be used to drive validators and bidirectional converters (e.g. given a flat string, auto-parse into normalized object)

JSONLD context prefix with endings other than '/' or '#' not resolved correctly

See biolink model issue 301 odd CHEBI identifier for the report. This is because JSONLD is sensitive on how a synonym ends. If it ends with '' for example in 'CHEBI', it is not resolved correctly. It can be fixed by a @Prefix attribute but only in JSONLD 1.1. See http://tinyurl.com/qmd4ggd for an example.

To fix this: Add a parameter to the generator that switches between 1.0 and 1.1. If emit 1.1, add "@Version": 1.1 to the first bit of the context.

Add enums and label->id mappings

We have

  values_from:
    domain: definition
    multivalued: true
    range: uriorcurie
    description: >-
      the identifier of a "value set" -- a set of identifiers that form the possible values for the range of a slot

(aside: this is declared at the definition level, should be a slot property)

We should have the equiv for string values, i.e an enum.

e.g

slots:
  evidence code:
    enum:
      - IEA
      - ISS
...

it would be good to specify mappings for each of these, perhaps:

slots:
  evidence code:
    enum:
      IEA: ECO:nnnn
      ISS: ECO:nnn
...

or perhaps a more expressive:

slots:
  evidence code:
    enum:
      IEA:
        id: ECO:nnnn
        description: ...
      ISS: 
         id: ECO:nnn
         description: ...
...

mapping to a json-ld context should be obvious

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.