dataoneorg / slinky Goto Github PK

Slinky, the DataONE Graph Store

License: Apache License 2.0

Makefile 0.50% Python 97.80% Dockerfile 1.00% Go 0.16% Shell 0.16% HTML 0.38%

slinky's Issues

rdfsurveyor deployment

I gave a demo of using rsdfsurveyor with Slinky on the Nov 4 2021 salmantics call. To do the demo I created deployment and service files for rsdfsurveyor-I'm wondering if there might be some utility in keeping them around for internal use. It's fairly lightweight: an nginx based container with the rdfsurveyor source in www/html and a service file to expose it.

access datasets from DataONE updates queue

To process all data that comes into DataONE, we need to be able to get access to the metadata files in an efficient manner, and be notified when new updated revisions or changes to SystemMetadata are available. This is the same access problem from k8s for the MetDIG processor and the DataONE index processor, and can likely use the same solution. We have discussed making the metadata documents available from a known location on a read-only Ceph filesystem that is mounted into appropriate containers. A queue system would be notified when PID changes occurred that need to be processed by various subscribers, and then they would be able to access the data directly from the Ceph filesystem without undertaking a REST call to Metacat (and without making a cached copy).

The Related issue in Metacat is NCEAS/metacat#1436 for designing such a system.

Next steps: Slinky as a layer of enhancement on top of static holdings

[I'm filing this not because we're moving on to next steps already but just to file it and let people chime in with ideas]

We can always find ways to improve the metadata we have but most metadata are written once, possibly checked and tweaked by a moderation team, and the left fixed in stone. What if we could extend the ways Slinky already improves metadata (ie co-reference resolution, minting/finding party identifiers) beyond what we're doing now?

I got to thinking about this after one of our recent mobilization calls and a recent example got me here to writing this ticket. Take the metadata record at https://search.dataone.org/view/urn%3Auuid%3A84f4e415-53c3-55e9-bb6d-3ee34419595d. It's a JSON-LD record from NPDC. The abstract starts:

Data from Polarstern cruise PS94 in the Arctic in 2015 with chief scientist Ursula Schauer.

There's a few really key elements to this free text description that we could totally extract into linked data and make for a much richer landing page: (1) Polarstern (2) PS94 (3) Arctic (4) 2015, (5) Ursula Shauer (6) Ursula Shauer as a Chief Scientist and (7) the role of Chief Scientist.

Extracting and linking information like this would be a really nice enhancement for a lot of metadata records, but especially our science-on-schema ones which will tend to be more minimal. We might also think about how we preserve any enhancements in our Data Package exports.

Specific things we could build on top:

Name entity extraction (with semantic linkages)
FIPS codes or gazetteer linking for arbitrary spatial bounding boxes (dataset X has coverage in FIPS codes A, B, and C)

Change Default Account Passwords

We should avoid using passwords for the default virtuoso accounts, for example dba:dba and dav:dav. We should be setting new passwords during deployment; see this documentation link for more information.

Fix bug in ORCID processing code

Saw this while testing earlier today

; slinky get doi:10.18739/A23F4KP3J
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 33, in <module>
    sys.exit(load_entry_point('d1lod', 'console_scripts', 'slinky')())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/bryce/src/slinky/d1lod/d1lod/cli.py", line 35, in get
    model = client.get_model_for_dataset(id)
  File "/Users/bryce/src/slinky/d1lod/d1lod/client.py", line 77, in get_model_for_dataset
    return processor.process()
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml220_processor.py", line 87, in process
    return super().process()
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 70, in process
    self.process_publisher()
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 214, in process_publisher
    publisher_subject = self.process_party(publisher)
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 129, in process_party
    return self.process_organization(party)
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 381, in process_organization
    self.process_user_id(party_subject, user_id)
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 228, in process_user_id
    self.process_user_id_as_generic(party_subject, user_id)
  File "/Users/bryce/src/slinky/d1lod/d1lod/processors/eml/eml_processor.py", line 326, in process_user_id_as_generic
    directory = user_id.attrib["directory"]
AttributeError: 'str' object has no attribute 'attrib'

parse and incorporate measurements/variables/features/attributes/etc

Convert to a Helm Chart

Right now deployment is handled through a make file. It would be great if developers could use the same tooling across all kubernetes deployments, like helm.

One hoop to jump through is ordering the deployment process; we want some pods to start after others have started. Helm doesn't have an official way to specify an ordering of deployments, but it looks like there's at least one workaround.

One way is to use chart hooks. Since we have six deployments, we'll have six helm charts. It's possible to run a chart hook before a helm chart is deployed. We're also able to set the hook priority with a hook-weight. Since hooks with high priorities are deployed first, it can be used as an ordering mechanism. This unfortunately doesn't provide a means to tell the scheduler deployment to wait for the redis deployment (it only allows us to deploy redis first, and immediately the scheduler).

I think we might actually be able to attach an initContainer to the scheduler and worker that pings redis until it finally gets a response. Then the container will pass control to the slinky cli. Doing this, we should be able to start all deployments at once without using chart hooks.

Tasks

Create an initial Helm chart
Adapt Helm chart to support persistence, esp. w/in our Ceph PV provider
Secure Redis and Virtuoso
Maybe: Consider using a StatefulSet for non-scaling/persistent pods. Probably don't do this but do make an issue to track the work as a next step

Configure networking via env vars

We have a few deployment cases for slinky; at the moment we can configure these using an environment. In #43 I added a --prod flag to the cli which brings us further away from the pattern using environmental variables for this.

Instead, we should be able to configure the hosts in the kubernetes deployment file and the docker-compose definition.

This should involve

Add environmental variables/config map to the pods using d1lod
Modify d1lod to use networking variables from the environment, with sensible backups (probably local store settings)

expose protected SPARQL endpoint

Merge feature_update_graph_pattern and re-deploy to the dev cluster

The work on https://github.com/DataONEorg/slinky/tree/feature_update_graph_pattern is far along and can be merged to develop whenever. There will definitely be some cleanup to get the deployment back to a working state so we'll have to do that.

Set up test deployment on our test k8s cluster

As of #3, @ThomasThelen has gotten Slinky running under minikube so the next thing to work on is to get a real deployment under k8s going so we can test all the odds and ends of getting that done (ie Docker Hub, ingress, etc).

What would be ideal would be hooking things up enough that I can, as a developer, make changes to services like the worker or scheduler, and see my changes reflected in the test cluster relatively quickly.

Remove web & www directories

With the transition over to Virtuoso, we no longer need Apache + Tomcat to expose sesame and the work bench (since they're no longer used). The Apache+Tomcat container is in the web/ folder, while the web stuff is in www/.

Create schema:Dataset nodes instead of geolink:Dataset

Transform the geolink:Dataset object to, at a high level, conform to the schema,org model (schema:Dataset). For this task, ignore the representation of individuals. This representation should include the following properties, which can then be expanded on.

name
description
url
accessibility
license
identifier
keywords
connected EML document

Some of these aren’t super useful to know through the graph store. Ie how is DataONE going to leverage license information in the graph store? It probably won’t. The reason these are included is to comply with the SOSO recommendation that they be included. They’re also easy fields to include and “shouldn’t take much time”.

support python3

Current processing code depends on python 2.x. Need to upgrade to support 3.x.

Remove legacy code

We have a folder of legacy code and a few unit tests that are tied down to it (test_metadata, test_graph, conftest). We probably want to decouple the legacy code from the tests and remove it from the code base

Virtuoso Should Use ClusterIP

The Virtuoso service should be set as a ClusterIP so that the ingress controller can manage traffic rather than directly exposing the service to the outside world.

Support Different Graphs in Production

Note: This is using language from the codebase in PR #54

With the ongoing effort around developing a solid deployment, this is a use case that can help drive the solution's direction. In production, Slinky currently uses Virtuoso as the backing graph store. We have unit tests that suggest Blazegraph should also work in a production environment; we should include a way to run Blazegraph in production.

Right now, we're hardcoding which deployment we use in production; each time slinky is deployed, it uses the Virtuoso adapter. To let users choose their graph store, we can support a new flag in the cli to specify which graph store is being used. For example, if the scheduler is going to use Blazegraph-it would be started with slinky schedule --store blazegraph. We can use this to instantiate the appropriate connector/store class. This solves the is of how to specify the graph, but we still need to be able to specify where.

We use the GRAPH_HOST environmental variable to specify the network location of virtuoso. I propose that we change this to VIRTUOSO_HOST, matching the style of BLAZEGRAPH_HOST. If we didn't need to specify the location of all graphs while the unit tests run, we could keep a single environmental variable.

If a user wants to use Blazegraph they must also fill out the appropriate environmental variables. In this case of a user deploying with blazegraph, BLAZEGRAPH_HOST and BLAZEGRAPH_PORT must be set. Likewise, if Virtuoso is used the user specifies the store with slinky schedule --store virtuoso and must fill out the VIRTUOSO_HOST & VIRTUOSO_PORT environmental variables.

deploy basic graph visualization tool via SPARQL

Discussion: Slinky applications

We had a good discussion on our last salmantics call [1] about applications. I'm filing our notes here for posterity. Please add any ideas you have to this issue..

Locating datasets that particular individuals are involved with (contacts, creators, etc) whether they have a DataONE account or not
Cross resolution between ORCID and various name variants (Bryce Mecum vs B.Mecum)
Landing pages for the users
Links for a “person” that goes to the portal page for that user
What’s near this? Using S2, geohash, etc
Utilize the annotations further in the user interface (creating browse hierarchies out of them)
Cross linking searches based on semantic annotations
Provide autocomplete for annotations in the search catalog

[1] https://docs.google.com/document/d/1Sc3m526EDkT53s0uJQarV-zAVUVft7ObIii2wazHJrQ/edit

establish Dataset graph pattern

This is to determine how the schema:Dataset, dcat:Dataset, and related classes inter-relate, and how they will be tied to People and other critical entities.

Review current graph pattern and get feedback

On our last weekly salmantics call [1], we agreed we should work towards closing out some of the graph pattern issues and PRs in order to get a solid first pattern down.

We have an open PR: #21 which includes a discussion. The high points include:

Include prov
Review how we're handling various identifier types (pid, sid, etc)
Agree on how we're modeling Dataset/Metadata/Entity/Package

I'm going put together a hackpad with a handful of datasets that'll help us find the gaps and issues.

[1] https://docs.google.com/document/d/1Sc3m526EDkT53s0uJQarV-zAVUVft7ObIii2wazHJrQ/edit

Remove Persistent Volume Definitions

At some point we're going to be using ceph storage which means deploying slinky locally will be very involved. There's also the architecture of the DataONE k8 cluster that enforces restrictions on who can create persistent volumes. Because of this, the persistent volume definitions should be moved out of this repository and into the DataONE k8 cluster repository. The development and production clusters will then have persistent volumes for slinky to make claims on.

Describe people using SOSO guidelines

Transform the representation of people affiliated with datasets to conform to schema,org. This is most likely a large task that involves a few parts...

Refactoring the internal representation of people. Right now they’re represented under the geolink vocabulary. When making the switch to schema,org we should replace the terms with schema,org ones (isomorphic with foaf)
Connecting datasets to them. SOSO connects people to datasets through roles, which we should be able to determine from the science metadata document.

Represent individual files

We need a way to represent individual files. Right now, there isn't a great way to do this with SOSO.

One option is to represent them as individual nodes in the graph and connect the related dataset(s) to them. This would allow us to further annotate the nodes with additional information, such as which variables they describe.

Unknowns:

The rdf:type of the node
The predicate connecting the dataset to the file

Add Logging Capabilities

For a complete stack, we should have some logging resources at hand. The ELK stack is one of the most popular K8 logging frameworks. It was also present in the original d1lod repository.

Within the context of the DataONE Kubernetes stack, we may want to have a central logging system in place that aggregates logs from other namespaces (including slinky).

Pretty up CLI output with namespace bindings

The output of the CLI's get command can be a bit hard to read because it doesn't make effective use of namespace prefixes.

support parsing SOSO documents into triples

support parsing ISO 19115* family metadata into triples

Figure out a way we might front our triplestore

To integrate linked open data into existing web views and services, we likely don't want to expose, say, a SPARQL endpoint. If we did, we might at least want to point it at a read replica.

Another approach would be to put something in front of the SPARQL endpoint to constrain the types of queries that can be constructed by users without direct access to the SPARQL endpoint. There must be some solutions already out there, though I think building something specific to our use case isn't out of scope.

One example is https://github.com/UKGovLD/linked-data-api/blob/wiki/Specification.md

parse and incorporate ORE triples

create a DataONE graph epic

Create a DataONE graph store that:

runs a graph store in kubernetes that can be scaled and relocated across nodes
parses and indexes incoming metadata documents from DataONE into a set of RDF statements
- supports multiple metadata dialects, including EML, schema.org, ISO-19115*, and extensible to others
stores those RDF statements in a graph store like Virtuoso or graphDB
exposes a SPARQL endpoint for local, internal use by the team
exposes a graphical frontend for local, team exploration of the graph store
uses a high-level Dataset graph model that
- resolves how multiple versions of dataset objects are represented in the graph
- includes attribute-level and entity-level semantic measurement types from EML and other annotation sources
- includes provenance trace information from the ProvONE model in DataONE ORE graphs
- is compatible with the schema:Dataset and dcat:Dataset models, and with OAI-ORE
- is extensible to allow additional dataset and entity properties to be added as needed

Use ceph-fs volume

We should refactor our PV & PVC definitions to use the ceph-fs facilities which are now available.

Migrate to Kubernetes

DataONE has a Kubernetes cluster that this stack should work nicely in. Migrate the docker swarm over to a Kubernetes Helm Chart and deploy it on the development cluster.

Once complete, do a fully integrated test to make sure that harvesting and graph population works the same (nothing should change from this). With Kubernetes, we should think about a new logging framework that integrates nicely with the whole stack. For now exclude them from the stack

Essential services to migrate:

Virtuoso
Redis
Scheduler
Worker

Versioning the graph and api

On the Jan 27'th salmantics call we brought up the idea of versioning the graph and creating a set of API endpoints that are also versioned.

Versioned Graph

Virtuoso doesn't support versions the same way that GraphDB does. Instead of creating new repositories, subgraphs are used.

Graph Version URI (subgraph URI)

Each subgraph name should follow a convention that we define here. We should use the full slinky URI followed by the version.

For example,

https://api.test.dataone.org/slinky/v1 --> Version 1
https://api.test.dataone.org/slinky/v2 --> Version 2

Example: Query V1 Graph

The SPAQRL query is sent to https://api.test.dataone.org/slinky/query to ask for triples in the https://api.test.dataone.org/slinky/v1 graph.

SELECT * WHERE {
    GRAPH <https://api.test.dataone.org/slinky/v1/> {
        ?dataset rdf:type schema:Dataset .
        ?dataset schema:name ?name .
        ?dataset schema:creator ?creator .
        OPTIONAL {?dataset schema:description ?description . }
  }
}

Query V2 Graph

The SPAQRL query is sent to https://api.test.dataone.org/slinky/query to ask for triples in the https://api.test.dataone.org/slinky/v2 graph.

SELECT * WHERE {
    GRAPH <https://api.test.dataone.org/slinky/v2/> {
        ?dataset rdf:type schema:Dataset .
        ?dataset schema:name ?name .
        ?dataset schema:creator ?creator .
        OPTIONAL {?dataset schema:description ?description . }
  }
}

Associated Tasks

Extend Slinky to write to named graphs
Update the frontend to match the GRAPH SPARQL pattern

Versioned API Endpoints

On the same Jan 27th call it was decided that we want a number of utility endpoints:

/datasets/<dataset_pid>
/persons/<dataset_pid>&<person_name>&<person_id>
/organizations/<org_name>&<org_id>
/iri/

Since each one of these will involve a SPARQL query that we construct on the backend, we have complete control over the SPARQL query. This allows us to insert the version based on content from the URL.

For example, the following endpoint URLs query v1 of the graph

https://api.test.dataone.org/slinky/v1/datasets
https://api.test.dataone.org/slinky/v1/persons

The version 2 analogues,
https://api.test.dataone.org/slinky/v2/datasets
https://api.test.dataone.org/slinky/v2/persons

Associated Tasks

Determine what each endpoint does
Create the endpoints in slinky
Expose an OpenAPI endpoint page

import funder identifiers and metadata

The dataone graph will hopefully have many references to funders through metadata documents that link to global funder identifers from the CrossRef Funder registry, and from the ROR system. Both are made available as data dumps:

Funder registry (in RDF): https://gitlab.com/crossref/open_funder_registry/-/blob/master/registry.rdf
ROR (in JSON): https://doi.org/10.6084/m9.figshare.c.4596503.v9

I think it would be helpful to:

Load both of these graphs
Where possible, provide sameAs links across the graphs
Use the graph to be able to provide contextual relationships for querying (e.g., summarize all datasets from NSF, or from GEO programs in NSF)

While the ROR graph isn't in RDF, its highly structured and provides very nicely structured cross-references to other identifier types like WikiData IDs, ISNI, etc.

Being able to report based on funders has long been a goal of ours, and getting all of this accessible in a queryable form would be really helpful.

Create & Assign Early Milestones

We should create a few milestones for work in the foreseeable future. From the 03/11/21 call we know that we need at the least, the graph up and running under K8 and that we want a schema,org representation of datasets and people. The work after (measured variable, files, geospatial, etc) is currently unordered.

1.0: Python 3 + Kubernetes Stack
2.0: Graph with schema:Dataset + schema:Person nodes

Use a different dockerhub account

From the November 4th development call we should create a new email account for dockerhub to store the related project images. This account can be used by other devs when working on the project.

Find solution to Virtuoso SPARQL troubles

Years ago, back when we set up d1lod, we decided to handle inserting RDF data into whichever triplestore we used in an agnostic fashion so the triplestore could be swapped out without too much work. So we settled on inserting data via SPARQL INSERT statements.

While revisiting d1lod and repurposingn it for Slinky, I've run into two related issues with this approach:

Virtuoso has some sort of arbitrary and not-documented size limit on SPARQL statements. Their SPARQL engine just pukes when you get over a certain query string length. I don't think we ran into this during the GeoLink work and I only noticed it because a particular dataset got turned into a too-long SPARQL INSERT query
If I choose to split the query up and insert it in batches, we run into another problem: Blank nodes. If a query references a bnode as an object but the definition of that bnode (where it's a subject) ends up in the next query, Virtuoso complains. AFAICT this is a Virtuoso Open Source bug and may not apply to other triplestores
- This makes sense because I don't think bnodes really work across multiple queries. I considered making each bnode a proper HTTP IRI (skolemizing?) but wanted to avoid that because I want our output to still match science-on-schema.

SPARQL may just not be the right thing for this workload. I considered using alternative RDF data loading methods Virtuoso provides but it looks all they have is a system that loads data from a local filesystem via I-SQL commands.

I had been meaning to look at Blazegraph for a few years and I see that it has a nice HTTP bulk data loading REST API where you can just send serialized RDF to an endpoint. We aren't using any special functionality from Virtuoso so this might be a good point to switch.

Feedback or thoughts welcomed. I'll update here with what I figure out.

Investigate integration with OpenAlex (previously Microsoft Academic Graph)

OpenAlex is launching December 2021 and could be a great complement to the DataONE graph. They'll have ORCIDs, RORs/GRIDs and lots more. See this [https://blog.ourresearch.org/openalex-update-june/](blog post) for more info.

Deployment Startup Order

Right now the worker on the dataset queue isn't waiting for virtuoso to start up and will produce errors while trying to communicate with it. Ideally the deployment would wait for Virtuoso to start up before starting the workers. Alternatively the source code of the worker could wait for an available triple store.

With kubectl, I can write a shell script that utilizes kubectl wait to pause the deployment of deployments until a particular deployment it ready. This will require a ReadinessProbe on the Virtuoso deployment; I think an HTTP GET should be enough of a canary to make the claim that Virtuoso is ready.

Refactor Images to pip Install

I feel like there's an identity crisis in this repository that's between the actual d1lod python library, the deployment of it, and the documentation.

For example, the fact that we need to copy the d1lod directory into the docker build directory, which then gets copied into the image, means something's not quite right. Doing this will bring about bugs in the future where the requirements.txt in the d1lod folder aren't installed; I'm assuming this is why some of the scheduler requirements are a subset of the d1lod requirements file (since d1lod is installed in the scheduler, the scheduler's requirements file gives us a way to install the d1lod requirements-without ever touching the d1lod requirements file) (and scheduler doesn't use python-dateutil, but has it defined in the requirements, probably so that d1lod has access to it).

If it were up to me, I'd create a separate repository for the python library, and document it as a standalone library which is standard enough. Then I'd create a second repository that's responsible for DataONE's deployment of services to spin up the graph store. One container in that stack uses the d1lod package and could be installed via a simple git clone & pip install, or just a pip install git+ . This would also allow for better placed documentation. Right now we have the docs/ folder, which is probably more appropriate in the python library's documentation (maybe as readthedocs). The architecture of the Slinky K8 stack would belong in the deployment repository.

At the least we should be able to pip install the d1lod folder. At this moment, attempting to do so raises the following

Any thoughts?

support parsing EML into triples

Document the Deployment Architecture

We should document the Kubernetes structure along with the interactions between the worker, scheduler, and virtuoso.

Switch to using official VOS images

Virutoso now (since 2017?) has official images. We should use those I think: https://hub.docker.com/r/openlink/virtuoso-opensource-7/.

Configure the Slinky Ingress

We nee to configure the Slinky ingress to route traffic from *.dataone.org/slinky to Virtuoso.

One problem that I've come across is that the links in Virtuoso Conductor are relative to *.dataone.org/, not *.dataone.org/slinky. This results in a bunch of broken links and buttons. This can be seen on the test deployment.

Regardless of the broken links, the sparql endpoint works fine,

curl -G https://api.test.dataone.org/slinky/sparql --data-urlencode query='SELECT DISTINCT * WHERE { ?s ?p ?o} LIMIT 1000'

@amoeba suggested that an alternative is hosting the service at slinky.api.test.dataone.org.

Add support for spatial coverage

Science on schema has an explicit way of handling spatial data; if the EML document supports it-we should definitely include it in the graph. This can be a strong case for third party integrations (eg KWG).

Deployment ordering

This is an issue for cleaning up the dependency ordering in the deployment. Right now we're using the makefile that interacts with kubectl to wait for pods to be in the ready state. This works at the moment, but doesn't apply to the docker stack deployment and adds a layer of complexity to the deployment. It also makes things tricky with Helm charts (see #52).

The general idea is to bring the 'waiting' logic into the codebase and remove it from the deployment layer.

Scheduler

The scheduler is deployed in two steps: the first step is initialization of the scheduler (happens in the slinky cli), the second is starting rqschedule (happens on the command line). These steps can be seen in the deployment file.

Both of the steps require an active instance of redis and should be able to be started independently of each other without issue. Since rqscheduler is effectively moving jobs to different queues, it should be fine if the scheduler from the first step hasn't submitted the update_job job since it'll pick it up the next time it checks.

Solution

Making the scheduler portion of startup wait on redis can be achieved by adding a method that checks for redis with a timeout and threshold. This same code can be used with the workers (see below).

Unfortunately, rqschedule doesn't have a retry flag, but we can use the same logic above. I'd like to bring the call to rqschedule inside the Slinky cli. Either in def schedule or as a separate command. This would enable us to use the blocking code from above and would allow us to manage the dependency in the code.

Workers

The workers need to be able to perform database transactions (needs Virtuoso to be online). They also depend on redis.

Solution

Redis is easily tackled by using the blocking call from the scheduler solution.

A similar approach can be taken with Virtuoso, along the same lines as the readinessProbe

Engineering problems left over from d1lod work

@ThomasThelen and I touched base today to go over the existing codebase. I wanted to document some of the issues we talked about here for us and others to see. They're all really leftover technical debt from the initial GeoLink work on this:

Co-reference resolution and IDs: The current system tries to re-use identifiers for people and (I think?) organizations when it has a reasonably high chance they're the same thing. We used random, opaque IDs for these since we didn't already have an identifier. The problem this created was, when we re-generated the graph from scratch (See 2), the opaque ID might change which might be problematic. I can think of a few solutions here but it's a thing to think about. What we do here might interact with our thoughts about other types of co-reference resolution.
Re-generating the graph when we change the code or any triplification patterns: Under the current system, we wipe the entire graph and re-build it when we make changes to the codebase that affect triplifcation patterns. We do have mechanisms in place to use disk-cached metadata records to speed things up but it's still slow. We also don't have a system in place to rebuild the graph while still serving requests to the existing graph. I've been thinking that we might maintain a write-ahead log as a way to quickly re-build the graph.
Search visibility: We danced around this in our first implementation by only triplifying publicly-visible content. This works well because most public content can be expected to stay public and the really sensitive stuff is usually inside the data objects which weren't triplifying. This is pretty reasonable but could be a lot better. We might be able to handle this if we wrap the SPARQL query engine in an HTTP API and handle object access at that level. A part of the problem is that it's hard to know how to map a single object to the triples we inserted into the graph about it. e.g., if Bryce and Tommy both assert the sky is blue and Bryce later decides he wants to recant his statement, what do we do?
Logging/observability system: This was always clunkier than I'd like. The whole thing used up way more resources than the service itself and broke often. As we look to migrate forward (#3), I think we should strip all of this out and find a new approach. I'm sure a lot has changed since we built this and using k8s might mean this isn't really a slinky concern anymore and is really more of a k8s cluster thing.

If you read this and have any questions or additional items to add please do and I'll update this.

Put together a landing page for Slinky with example queries

Exposing a SPARQL endpoint for Slinky would be fun but it's not all that useful to anyone but ourselves. And even then.

It'd be better if we had a concise showcase of what Slinky is and what it can do and this could similar to what we did for GeoLink: http://data.geolink.org/.

Today we put together a quick list of interesting example queries that could go on this page:

Show me all datasets that have measurements of sea-surface temperature
Show me all datasets that have measurements of carbon-dioxide flux from Alaska
Show me all datasets created by Ted Schuur
Show me all datasets that have the term “ecosystem” in their Abstract or Title
Show me all datasets about Salmon that have both weight and length measurements (ideally in a tuple)
Who’s doing research in the arctic? Which organizations?

[I don't think we have an issue for this specifically but let me know if that's wrong]

dataoneorg / slinky Goto Github PK

slinky's Issues

Versioned Graph

Graph Version URI (subgraph URI)

Example: Query V1 Graph

Query V2 Graph

Associated Tasks

Versioned API Endpoints

Associated Tasks

Scheduler

Solution

Workers

Solution

Recommend Projects

Recommend Topics

Recommend Org