dasch-swiss / dsp-api Goto Github PK

View Code? Open in Web Editor NEW

73.0 16.0 18.0 447.83 MB

DaSCH Service Platform API

Home Page: http://admin.dasch.swiss

License: Apache License 2.0

Scala 96.54% Shell 0.61% HTML 0.03% JavaScript 0.16% XSLT 0.53% Python 0.07% Lua 1.86% Makefile 0.19% Just 0.03%

ontologies humanities rdf triplestore

dsp-api's Introduction

DSP-API — DaSCH Service Platform API

DSP is a server application for storing, sharing, and working with primary sources and data in the humanities.

It is developed by the Swiss National Data and Service Center for the Humanities at the University of Basel, and is supported by the Swiss Academy of Humanities and Social Sciences and the Swiss National Science Foundation.

DSP-API is free software, released under the Apache License, Version 2.0.

Features

Stores humanities data as industry-standard RDF graphs, plus files for binary data such as digitized primary sources.
- Designed to work with any standards-compliant RDF triplestore. Tested with Jena Fuseki.
Based on OWL ontologies that express abstract, cross-disciplinary commonalities in the structure and semantics of research data.
Offers a generic HTTP-based API, implemented in Scala, for querying, annotating, and linking together heterogeneous data in a unified way.
- Handles authentication and authorization.
- Provides automatic versioning of data.
Uses Sipi, a high-performance media server implemented in C++.
Designed to be be used with DSP-APP, a general-purpose, browser-based virtual research environment, as well as with custom user interfaces.

Requirements

For developing and testing DSP-API

Each developer machine should have the following prerequisites installed:

Linux or macOS
Docker Desktop
Homebrew (macOS)
JDK Temurin 21
sbt
just

JDK Temurin 21

Follow the steps described on https://sdkman.io/ to install SDKMAN. Then, follow these steps:

sdk ls java  # choose the latest version of Temurin 21
sdk install java 21.x.y-tem

SDKMAN will take care of the environment variable JAVA_HOME.

For building the documentation

See docs/Readme.md.

Try it out

Run DSP-API

Create a test repository, load some test data into the triplestore, and start DSP-API:

just stack-init-test

Open http://localhost:4200/ in a web browser.

On first installation, errors similar to the following can come up:

error decoding 'Volumes[0]': invalid spec: :/fuseki:delegated: empty section between colons

To solve this, you need to deactivate Docker Compose V2. This can be done in Docker Desktop either by unchecking the "Use Docker Compose V2" flag under "Preferences > General" or by running

docker-compose disable-v2

Shut down DSP-API:

just stack-stop

Run the automated tests

Automated tests are split into different source sets into slow running integration tests (i.e. tests which do IO or are using Testcontainers) and fast running unit tests.

Run unit tests:

sbt test

Run integration tests:

make integration-test

Run all tests:

make test-all

Release Versioning Convention

The DSP-API release versioning follows the Semantic Versioning convention:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,

MINOR version when you add functionality in a backwards-compatible manner, and

PATCH version when you make backwards-compatible bug fixes.

Additionally, we will also increment the MAJOR version in the case when any kind of changes to existing data would be necessary, e.g., any changes to the knora-base ontology which are not backwards compatible.

dsp-api's People

Contributors

Stargazers

Watchers

Forkers

anukat2015 waffle-iron etacassiopeia padlina uzahnd asanchez75 kritischewalserausgabe lazycrazyowl ridaayed ralfbarkow mobarmg musicenfanthen margarethwarburton sepidehalassi samuelboerlin parejar jamison413 alexellis

dsp-api's Issues

Shall I add .sbtrc to the Gitrep?

Shall I add .sbtrc to the gitrep?

When starting sbt from the console, it automatically switches to the project webapi.

alias boot = ;reload ;project webapi ;iflast shell

CC: @subotic @benjamingeer

Fix inconsistencies in SALSAH export script

The export script produces some inconsistencies in the Incunabula ontology and data, and I haven't figured out how to deal with them.

Cardinalities

In incunabula:book

title has owl:cardinality 1, but book http://data.knora.org/e41ab5695c has two titles (so the cardinality could be owl:minCardinality 1).
publoc has owl:cardinality 1, but book http://data.knora.org/de6d83112e04 has no publoc (so the cardinality could be owl:maxCardinality 1).
publisher has owl:maxCardinality 1, but book http://data.knora.org/9311a421b501 has two publishers (so the cardinality could be owl:minCardinality 0).

In incunabula:page

pagenum has owl:cardinality 1, but page http://data.knora.org/f84653d34004 has no pagenum (so the cardinality could be owl:maxCardinality 1).
seqnum has owl:cardinality 1, but page http://data.knora.org/f84653d34004 has no seqnum (so the cardinality could be owl:maxCardinality 1).

In incunabula:Sideband

description has owl:cardinality 1, but sideband http://data.knora.org/684a7a4ec5d7 has no description (so the cardinality could be owl:maxCardinality 1).

Property subject class constraints

incunabula:description has knora-base:subjectClassConstraint :Sideband, but in the exported data, its subject is sometimes a :book or a :page.
incunabula:citation has knora-base:subjectClassConstraint :page, but in the exported data, its subject can be either a :book or a :page.

Reference to nonexistent resource

Page http://data.knora.org/9374cc0a4f03 says it has a left sideband http://data.knora.org/bb4a4d4758e722, but that sideband doesn't exist. If I view the page in the old SALSAH (page l8r in [Das] Narrenschiff (dt.)), it's clear that there's no sideband on that page.

Incorrect references to external resources

http://data.knora.org/c30e76e55204 is a knora-base:ExternalResource. It should have a knora-base:hasExtResValue property pointing to a knora-base:ExternalResValue, but instead it has knora-base:extResId and knora-base:extResProvider, each of which point to a knora-base:ExternalResValue. The two instances of knora-base:ExternalResValue have the same contents.

The other external resources have the same problem.

Regions without isRegionOf

These regions have no isRegionOf (required by the cardinality on Region):

http://data.knora.org/c9905d015030
http://data.knora.org/8e96ec3b5030
http://data.knora.org/d8e282a38b32
http://data.knora.org/707f8edf3935
http://data.knora.org/d44533274407

It makes no sense to have a region that isn't a region of something, so probably these regions should be excluded from the export.

Page without isPartOf

The page http://data.knora.org/3c94b29fb90f02 has no incunabula:partOf (required by the cardinality on page). It makes no sense to have a page that isn't part of a book, so probably this page should be excluded from the export.

Off by one

In incunabula:seqnum, valueHasString is different from valueHasInteger. For example, in page http://data.knora.org/9374cc0a4f03, the seqnum is http://data.knora.org/9374cc0a4f03/values/ac5f87e52f0d. Its valueHasInteger is 177, but its valueHasString is 176.

Performance regression in Jena 3.0.1 / Fuseki 2.3.1

https://issues.apache.org/jira/browse/JENA-1121

Projects need different sorts of permissions from resources and values

Decide which sorts of permissions projects can grant.
Decide whether projects have owners.
Implement appropriate permission-checking in PermissionUtilV1, and use it in ProjectsResponderV1.

Implement handling of file values in MessageUtil.scala

convertCreateValueResponseV1ToResourceCreateValueResponseV1

-> case stillImageFileValue: StillImageFileValueV1 => basicObjectResponse // TODO: implement this.

Extend ValueUtilV1.fileValueV12LocationV1 to support file types other than images

Update documentation about the SBT build process

Document everything surrounding KnoraBuild.sbt

Implement automated performance tests

Test the performance and scalability of the Knora API server with the fake triplestore
Test performance with different real triplestores
Have the performance tests run automatically on the test server

Add LIMIT to SELECT COUNT in search queries

This might require doing a subquery with a limit, then using COUNT in the outer query.

Choose an actor supervisor strategy for fault tolerance

The Akka documentation chapter on Fault Tolerance shows how to do this.

org.knora.webapi.testing.tagobjects.DbTest extends nonexistent class

org.knora.webapi.testing.tagobjects.DbTest extends org.knora.rapier.testing.tags.DbTest, which doesn't exist (rapier was changed to webapi, and DbTest was changed to SipiTest).

FOAF Properties Naming

In OntologyConstants we have the following:

object Foaf {
        val FirstName = "http://xmln.com/foaf/0.1/firstName"
        val LastName = "http://xmln.com/foaf/0.1/lastName"
    }

In knora-base.ttl we have the following definition:

###  http://www.knora.org/ontology/knora-base#User

:User rdf:type owl:Class ;

      rdfs:subClassOf <http://xmlns.com/foaf/0.1/Person> ,
                      [ rdf:type owl:Restriction ;
                        owl:onProperty <http://xmlns.com/foaf/0.1/familyName> ;
                        owl:cardinality "1"^^xsd:nonNegativeInteger
                      ] ,
                      ...
                      [ rdf:type owl:Restriction ;
                        owl:onProperty <http://xmlns.com/foaf/0.1/givenName> ;
                        owl:cardinality "1"^^xsd:nonNegativeInteger
                      ], 
                      ...

      rdfs:comment "Represents a Knora user"@en .

According to the FOAF spec, we have these options: foaf:givenName which is used alongside foaf:familyName and foaf:firstNamewhich is used alongside foaf:lastName. The spec prefers the givenName / familyName combination.

I suppose that we should then stick to foaf:givenName / foaf:familyName?

Also there is a mix of both usages in the test/demo data, which I would like to clean up as soon as it is clear what we will use.

@lrosenth
@benjamingeer
@tobiasschweizer

Ensure that WHERE clauses in updates don't return too many rows

If a WHERE clause in a SPARQL update returns more rows than expected, this can cause an INSERT clause to insert duplicate data. For example, in addValueVersion.scala.txt, the WHERE clause should return at most one row. However, this depends on certain assumptions about the data in the triplestore, e.g. that a resource will have at most one knora-base:lastModificationDate. If the resource being updated has two lastModificationDate triples, the WHERE clause will return two rows, one for each, and the INSERT clause will be executed twice. If the value being inserted contains standoff markup, which is represented as blank nodes, duplicate standoff nodes will be inserted.

My first thought was that this could be prevented by adding LIMIT 1 to the end of the WHERE clause, but it seems that SPARQL Update doesn't allow this. I think it should be possible to use LIMIT by putting the contents of the WHERE clause in a subquery. I made a brief attempt to get this to work, but it wasn't successful; the WHERE clause just returned no results. It would be worth trying it again.

Change in CoreBooted prevented server from starting, needs explanation

One of the changes in pull request #35 prevented re-start from starting the API server:

-    implicit lazy val system = ActorSystem("webapi")
+    implicit def system = ActorSystem("webapi")

The symptom was that the server would appear to start (the console messages looked normal), but any HTTP request from a browser would time out. The user would just get the message The server was not able to produce a timely response to your request in the browser, and there would be no additional output on the console.

I changed it back to lazy val, and now re-start works again. I don't understand why this needs to be a lazy val, or why it was changed to a def. This needs a comment in the code to explain it, so the same problem doesn't happen again.

Also, the Scaladoc comments for Core and CoreBooted are not very helpful. They say that these classes are part of the cake pattern, but it would be better to explain what the cake pattern is accomplishing here. Is the intention that we could make another class like KnoraService, but with a fake ActorSystem? If so, why would we want to do that?

Continuous Integration

Move continuous integration to Travis-CI or some other publicly visible platform.

add delete file route to Sipi

This case is meant to delete a file after it has already been successfully created by Sipi, because when looking at its type we find out that it is not what Knora expects (e.g., we expect an image but Sipi create an audio file).

So we have to delete the file from Sipi because its file value will not be created in Knora.

Search responder queries are not deterministic

Two problems:

In a full-text search (e.g. search for 'Orationes' in Incunabula), if a resource has a matching label and a matching value, Fuseki and GraphDB return different results: Fuseki returns the match for the resource label, while GraphDB returns the one for the text value. Both appear to be correct: we are using SAMPLE, so each triplestore is returning a different random result. Try to find another approach so that they return the same result.
GraphDB returns ontology entity labels in the wrong language (again, this seems to be because we have erroneously requested a random one, rather than the one in the user's preferred language as intended). A good solution would probably be to remove all ontology-related things from the search queries, and instead to have the search responder ask the ontology responder for that information.

Ontology responder caches entity info filtered by language

So if someone requests information about a resource class in German, the ontology responder will cache that, and then if someone requests information about the same resource class in French, they'll get the cached German version.

The ontology responder should either:

Cache the raw SPARQL query results, and filter them by language on each request, or
Cache them separately by language.

Some Lucene-related tests fail with with embedded-Jena-tdb

This seems to occur in different places depending on the machine that's running the tests. It appears that in some cases, a Lucene index can be used before the creation of the index has completed.

How will Knora or Sipi find resource class icons?

The definition of knora-base:LinkObj says:

         :resourceIcon "link.gif"^^xsd:string ;

How can this work? Should it be changed to something else?

Userdata includes the user's password hash

When authenticator.getUserProfileV1 returns a UserProfileV1, that object includes a UserDataV1 containing a hash of the user's password. This hash can easily end up being returned to the client in an API response. Can we remove it from UserDataV1?

Creating file information for Sipi

In Sipi responder's method getFileInfoForSipiV1, a SipiFileInfoGetResponseV1 is created and returned containing the permissions the user has on the file and the path/name of the file.

So this allows Sipi to send a fileValueIri to Knora and get the user's permissions on the file and also the actual file name (so Sipi can find it on the file system and read it).

This happens after the client has sent a IIIF-URL to the Sipi server naming a fileValueIri. Why would Sipi responder return another URL to access the file which it was already called with by the client?

fileValueV1: FileValueV1 => valueUtilV1.makeSipiFileGetUrlFromFileValueV1(fileValueV1)

...


    /**
      * Creates a URL for accessing a file via Sipi. // TODO: implement this correctly.
      *
      * @param fileValueV1 the file value that the URL will point to.
      * @return a Sipi URL.
      */
    // TODO: if this is a StillImageFileValue, create a IIIF URL
    def makeSipiFileGetUrlFromFileValueV1(fileValueV1: FileValueV1): String = {
        makeSipiFileGetUrlFromFilename(fileValueV1.internalFilename)
    }

    /**
      * Creates a URL for accessing a file via Sipi. // TODO: implement this correctly.
      *
      * @param filename the name of the file that the URL will point to.
      * @return a Sipi URL.
      */
    def makeSipiFileGetUrlFromFilename(filename: String): String = {
        s"${settings.sipiUrl}/$filename"
    }

I think we have to different cases here:

Knora creates a IIIF-URL for the client to access an image asking Sipi
Sipi has to ask Knora about the user's permissions on the file and about the internal filename in case it got a fileValueIri in the IIIF-URL (possibly, this isn't even necessary, as Knora could directly put the file's name in it, also making use of the prefix to declare the project)

Don't use rdfs:range to check data consistency

Currently we use rdfs:range for consistency checks when creating values: if a property P has an rdfs:range of class C, we check that the value submitted is a C. In other words, we use rdfs:range to mean that if a value is not a C, it's invalid to make it an object of P.

Ontotext pointed out that this isn't actually what rdfs:range means. As its definition says, it really means that if something is an object of P, it is a C. In other words, rdfs:range is for inferring the type of the value, not for imposing constraints. They suggest we use another predicate to avoid confusion.

In fact, as far as I can tell, there's no standard predicate for this, so perhaps we should add one to knora-base.

Consider refactoring the ontology cache

Currently each combination of entity IRI and preferred language is queried and cached separately. Consider instead caching the unfiltered query results as discussed in pull request #26.

How to create an ActorRef outside an actor?

What would be the best way to create an ActorRefoutside an Actor (outside a class that extends Actor)?

On my branch wip/setup_fake_responders, I created a TestResponderManagerV1 that extends ResponderManager and has access to all its definitions of the live routes.

However, each of these routes can be overridden in case someone passes a mock or fake responder:

override val sipiRouter = actor("mocksipi")(new Act {
        become {
            case sipiResponderConversionFileRequest: SipiResponderConversionFileRequestV1 => future2Message(sender(), imageConversionResponse(sipiResponderConversionFileRequest), log)
            case sipiResponderConversionPathRequest: SipiResponderConversionPathRequestV1 => future2Message(sender(), imageConversionResponse(sipiResponderConversionPathRequest), log)
        }
    })

What would be the best way to create an actor in a test like SipiV1E2ESpec to pass it to TestResponderManagerV1 then?

The problem is that extending Actor and ActorLogging provides a lot of stuff that you have to explicitly deal with outside this context, like the logger or the ActorContext (implicit ActorRefFactory required: if outside of an Actor you need an implicit ActorSystem, inside of an actor this should be the implicit ActorContext).

So somehow I am breaking the design by creating those actors outside of an Actor.

CC: @subotic @benjamingeer

Update a resource's lastModificationDate when one of its values is changed

On resource creation, lastModifiactionDate is created more than once for the resource

When creating a resource, knora-base:lastModificationDate is created several times instead of just once.

How to reproduce this:

go to the branch wipi/sipi_integration
run Knora
in another terminal, call ./create_page_with_binaries.py in webapi/_test_data/test_route
copy the Iri of the new resource returned by Knora and search for it in the triplestore.

You will see that the property exists several times:

prop	value
:lastModificationDate	2016-02-11T14:40:38.256+01:00
:lastModificationDate	2016-02-11T14:40:38.257+01:00
:lastModificationDate	2016-02-11T14:40:38.262+01:00
:lastModificationDate	2016-02-11T14:40:38.271+01:00
:lastModificationDate	2016-02-11T14:40:38.303+01:00

Now why is that? We had both a look at the SPARQL template createValue.scala.txt. There is a statement for lasModificationDate in Delete, Insert and Where.

Store information about projects in the triplestore, not in application.conf

Currently application.conf contains a list of projects, along with the named graphs used by each project. It should not be necessary to change application.conf to create a project, because this requires restarting the server, and involves a Catch-22: you can't get the project IRI until the project is created, but you can't create it until you put its IRI in application.conf.

Instead, we could create a named graph called something like http://www.knora.org/data/config where a list of projects would be stored, along with the named graphs used for each project. Probably users should be stored in the same named graph, too, rather than with each project, since a user can belong to multiple projects.

Password checking does not seem to work (authentication)

When I submit credentials (username and password), the password does not seem to be checked correctly. Only the existence of the username seems to be checked.

This happens both with HTTP header auth. and when the params are set in the ULR (Get params).

Check the triplestore for inconsistencies

This could be either:

A program that we can run periodically to find inconsistencies.
A consistency checker that runs each time an update is performed. But this would pose a problem. To create a resource, we first create an empty resource, and then we add each value, one by one. The constraints won't be met until all the values are added.

Ideally, it would use the existing OWL constraints that are defined in the ontologies in the triplestore.

GraphDB can do consistency checks, but the documentation is not clear to me. I've emailed them to ask for clarification.

add file handling to ValuesRoute

Implement monitoring

Steps:

measuring response times (done in PR #1347)
expose metric to Prometheus
provide simple GUI to visualize the metrics

The API server needs to know the user's active project

The user should be able to select one project as active, meaning that new resources and values are created in that project by default. Routes and responders need to know which is the active project.

Running tests with graphdb

I have installed graphdb SE 6.62 on my machine (Mac).

I ran the following scripts:

webapi/scripts/graphdb-se-load-test-data.sh (test)
webapi/scripts/graphdb-free-ci-prepare.sh (test-unit)

Now, using graphdb with Knora (test repo) works fine (setting in application.conf), but the tests do not work with graphdb:

Load test data *** FAILED *** (1 second, 828 milliseconds)

The test data cannot be loaded into test-unit. But the test-unit repository exists (I can access it in the over the webinterface). I remember that we had that problem before but thought that running the script would solve it.

What am I doing wrong?

Ensure that input validation prevents SPARQL injection

This paper discusses the issue.

InputValidation.toSparqlEncodedString attempts to escape everything relevant, but this needs to be checked systematically.

Making changing image servers more easier

I would like to propose a change for application.conf to make exchanging image server settings easier. application.conf could look something like this:

app {
    ...
    imageserver {
        // type = webapi
        type = sipi
        webapi {
            url = "http://localhost"
            path = "/v1/assets"
            port = 3333
        }
        sipi {
            url = "http://localhost"
            path = ""
            port = 1024
            path-conversion-route = "convert_path"
            file-conversion-route = "convert_file"
            image-mime-types = ["image/tiff", "image/jpeg", "image/png", "image/jp2"]
            movie-mime-types = []
            sound-mime-types = []
        }
    }
    ...
}

Set up a deployment system

Probably using Docker.

resources search (resources?searchstr=)

http://localhost:3333/v1/resources?searchstr=Holzschnitt&limit=100&numprops=1

change numprops to 2, and no results are returned

Default Permissions for knora-base:hasStillImageFileValue is missing in the ontology

For knora-base:hasStillImageFileValue, no default permissions are set.

This is because in incunabula-onto.ttl, incunabula:page is made a subclass of knora-base:StillImageRepresentation and inherits its property knora-base:hasStillImageFileValue (defined in knora-base.ttl), but without any default permission.

I suggest to define those default permissions in incunabula-onto.ttl because this should be project specific.

CC @benjamingeer

Add user, group, and project management capabilities

Need to add a route and extend the user responder.

Value version history should not refer to hidden values

The values responder hides values that the user doesn't have permission to see, but the IRI of a hidden values can still be returned as previousValue.

Handle valueHasComment in Search

Additionally to valueHasString and rdfs:label, valueHasComment was added to the full text index for the triplestore(s).

Do we have to adapt the fulltext search templates?

-> jena/fuseki:

@if(triplestore == "embedded-jena-tdb" || triplestore == "fuseki") {
            ?matchingSubject knora-base:valueHasString ?literal .
            #?matchingSubject ?p ?literal .
            #FILTER(?p = knora-base:valueHasString || ?matchingProperty = rdfs:label)
        }

the literal could also be a comment!

Query returns correct results with Fuseki 2.3.0 but not with 2.3.1

https://issues.apache.org/jira/browse/JENA-1130

how to set up a fake sipi responder for testing?

I am now implementing the integration of Sipi into Knora.

In order to test this, it would be good to have a fake sipi responder that simulates the HTTP conversion request sent to Sipi. Otherwise, we could not test it without having Sipi running.

Could you help me with that?

Missing persons in images demo data

In images-demo-data, the consistency checker is finding a lot of references to missing resources of type images:person, which are the objects of properties like images:urheber. There are too many of these references to change them all manually. (In the original source data, there are hundreds of different :person resources.) We need to fix this somehow before consistency checking can be merged.

Sesame/GraphDB bug affecting the import of carriage returns

When we import RDF data containing a triple-quoted string that ends with a carriage return, the carriage return is stripped, apparently by Sesame:

https://openrdf.atlassian.net/browse/SES-425?jql=text%20~%20%22carriage%20return%22

https://openrdf.atlassian.net/browse/SES-1736?focusedCommentId=14423&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14423

It may be fixed in Sesame 2.7.9, in which case the fix should appear in GraphDB 7.0, which will use Sesame 2.8:

http://graphdb.ontotext.com/documentation/standard/roadmap.html

Problems when Adding Default Permissions to hasStillImageFileValue

When adding default permissions to hasStillImageFileValue, the ontology responder does not work properly anymore. The tests fail.

For example, if you want to create a page, this error occurs:

{"status":4,"error":"org.knora.webapi.InconsistentTriplestoreDataException: Resource class
 http://www.knora.org/ontology/incunabula#page has cardinalities for one or more link properties 
without corresponding link value properties. The missing link value property or properties: 
http://www.knora.org/ontology/incunabula#hasLeftSidebandValue, http://www.knora.org/ontology
/incunabula#hasRightSidebandValue"}

Steps to reproduce: add default permissions to hasStillImageFileValue in knora-base.ttl:

 knora-base:hasDefaultRestrictedViewPermission knora-base:UnknownUser ;

 knora-base:hasDefaultViewPermission knora-base:KnownUser ;

 knora-base:hasDefaultModifyPermission knora-base:ProjectMember ,
 knora-base:Owner .