Following the <a href="https://github.com/RDFLib/pySHACL/issues/60" data-hovercard-typ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

assert(): cli-friendly validation + entailment api for data pipelines about pyshacl HOT 2 OPEN

usalu commented on June 2, 2024

assert(): cli-friendly validation + entailment api for data pipelines

from pyshacl.

Comments (2)

usalu commented on June 2, 2024 1

@ashleysommer Thank you for the quick and detailed answer!

There is something like assert that you described, where validation is performed as normal with graph expansion and SHACL Rules, and if the input graph does not fail validation it returns the expanded graph to stdout. Secondly, an inflate operation, similar to assert it runs graph expansion and SHACL Rules per default, but skips Shape Constraints checking, and returns the expanded graph to stdout.

The general idea behind data piplines was to share mappings between two different SHACL shapes in a resuable way, entirely descriptive way. To be more precise it would be something like a qualified data pipeline, because it does not only pipe one graph and returns a modified graph but actually the outputgraph would be SHACL validated. Only this is the reason why the pipeline is reusable. It acts like a statically defined functions but instead of static types as schema, you have a shacl shape. Only the combination of two together in one api like assert makes it powerfull.

Think of shacl being Interface Definition Language like protobuf and a transpiler from one IDL-Definition into another one at the same time.

The main challenge with that is deciding exactly what needs to be added to the output graph.

Yes, exactly.

The consensus seems to be, it should include RDFS/OWL inferencing (if enabled), as well as SHACLRule entailment, but not include the triples from the mix-in ontology file. This will require the use of a second in-memory datagraph, specifically for the purposes of delivering to the output, but it can be done.

I would leave all the OWL-RL and OWL related inferencing out because in my understanding OWL and SHACL have completly different purposes despite them doing technically the same (checking a schema, inferencing triples and reasoning whether the input graph is valid).

For OWL, I see the main value in searching for knowledge inside an arbitrarly large graph which has more knowledge that I can ever understand. The idea is: Here is a complex ontology from which I (think) I understand the rules and here is an arbitrarly complex graph. Please give me back everything you know, so that I can find out something new. Aka open world.

For SHACL, I see the main value in limiting how a graph looks like. Not the entire WWW. Only something that I can process. This limit is what creates freedom for creating interoperable behaviour. Something like OpenAPI and JSON Schema for microservices. A qualified data pipelines would be like the source code for a microservice which itself is a graph.

I read through your linked RDFLib issue, and unless I am misunderstanding your request, I think you have missed that RDFLib already has the ability to register a custom SPARQL function into the SPARQL engine. register_custom_function().

Currently in my understanding this is only possible at "compile time". What I was proposing is a way to use a graph like that to create the function and register it at runtime which uses the definition of the. If you look at the example:

@prefix ex: <http://example.com/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.com/ns#> .

ex:multiply
	a sh:SPARQLFunction ;
	rdfs:comment "Multiplies its two arguments $op1 and $op2." ;
	sh:parameter [
		sh:path ex:op1 ;
		sh:datatype xsd:integer ;
		sh:description "The first operand" ;
	] ;
	sh:parameter [
		sh:path ex:op2 ;
		sh:datatype xsd:integer ;
		sh:description "The second operand" ;
	] ;
	sh:returnType xsd:integer ;
	sh:select """
		SELECT ($op1 * $op2 AS ?result)
		WHERE {
		}
		""" .

then you see that SELECT ($op1 * $op2 AS ?result) is not a valid sparql query. Of course you can write quickly a function which replaces these arguments by python arguments (I used the existing initBindings from rdflib) but you also use a native f-string approach by pyhton:

def multiply(grpah, op1, op2):
   return graph.query(f'SELECT ({op1} * {op2} AS ?result)')

But that would be again at time of definition and not at runtime. So the issue is about using metaprograming to define such functions and register them at runtime.

This is necissary because the shacl graph of a qualified data pipeline has to be reusable. It wouldn't work. if you had to pull all python implementations and manually register them.

Let me try give a more detailed explanation:

The COMPANYAGRAPH would be a custom shacl shape COMPANYASHAPE of Company A. The FIWARE2SIM shacl shape contains all mapping behaviour to transform a COMPANYASHAPE into a SIM shape graph which itself contains all mapping behaviour to translate into SIMREP which itself contains all mapping behaviour to VISREP.

The original shape has geometry (3D), the SIM shape has simulation related information (2D + energy characteristics such as how many people per m²) and returns calculates the energy demand (simply by multiplying areas with usage) for individual rooms. The SIMREP shape is about reporting energy behaviour (e.g. in relative units such as kWh/m²*year which divides the energy use per m², etc).

You can see it as a General Purpose programming Language which accepts one shape and returns another shape.

Here a totaly different application:

It would be pipeline for computing transdiciplinary connections from an article.

Hopefully, I these examples on graph level help understand the idea. Feel free to tell me if things are still unclear to you!

from pyshacl.

ashleysommer commented on June 2, 2024

Hi @usalu
Thanks for the very detailed issue write-up. It is clear that you have put a great deal of research into this and have a good understanding of the issue at hand, the requirements of PySHACL, and the current limitations of the software.

First, I also believe that a validate function shouldn't return an entailed input. For that the concept is very wrong.

Thank you for validating my opinion on that. Most people who raise this issue seem to think that validating should modify their input graph by default and return the modified graph. That is not only violating the W3C SHACL spec, but does not make sense conceptually for a validator.

Here a practical example that I think depicts a very general problem:

I admit, despite reading it several times, I am having a lot of trouble following and understanding your example. It seems very specific to a particular application case, and is not general at all. It had too much specific detail to be a general example case, and not enough detail for me to understand the problem you are attempting to explain.

As concept I would suggest assert because it can include both validation and entailment.

From the rest of your writeup, I gather that you are asking about two different things:

Ability to run an assert procedure, that combines validation of the input dataset, plus entailment, and returning the entailed graph as the output. (Same as requested in #20, #78, #189, and discussed in #60).
Something about the implementation of SHACL Functions, in the form of SPARQLFunctions and SHACL-JS.

In response to issue 1.:
After reading back through all of the issues related to this request, and re-reading the discussion in #60, I have come to the understanding that there are two different features that need to be implemented here. There is something like assert that you described, where validation is performed as normal with graph expansion and SHACL Rules, and if the input graph does not fail validation it returns the expanded graph to stdout. Secondly, an inflate operation, similar to assert it runs graph expansion and SHACL Rules per default, but skips Shape Constraints checking, and returns the expanded graph to stdout. The main challenge with that is deciding exactly what needs to be added to the output graph. The consensus seems to be, it should include RDFS/OWL inferencing (if enabled), as well as SHACLRule entailment, but not include the triples from the mix-in ontology file. This will require the use of a second in-memory datagraph, specifically for the purposes of delivering to the output, but it can be done.

In response to issue 2.:
I read through your linked RDFLib issue, and unless I am misunderstanding your request, I think you have missed that RDFLib already has the ability to register a custom SPARQL function into the SPARQL engine. register_custom_function().

Now for this to work properly SHACL Functions and SHACL Javascript are vital.

PySHACL already has full support for SHACL Functions from the SHACL-AF spec for more than two years. Specifically, it implements SPARQLFunction using RDFLib register_custom_function(), and it implements SHACL-JS JSFunctions using pyduktape2. So what you are describing is already possible (aside from the debugging ability).

from pyshacl.

assert(): cli-friendly validation + entailment api for data pipelines about pyshacl HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent