kbss-cvut / termit Goto Github PK

An advanced SKOS terminology manager linking concepts to their definitions in documents

License: GNU General Public License v3.0

Java 95.17% Shell 0.19% HTML 4.46% Dockerfile 0.04% Ruby 0.13%

java skos rdf ufo rest

termit's Introduction

TermIt

TermIt is a SKOS compliant terminology management tool based on Semantic Web technologies. It allows managing vocabularies consisting of thesauri and ontologies. It can also manage documents whose content can be used to seed the vocabularies (e.g., normative documents with definition of domain terminology). In addition, documents can also be analyzed to discover occurrences of the vocabulary terms.

Terminology

Asset

An asset is an object of one of the main domain types managed by the system - Resource, Term or Vocabulary.

Required Technologies

JDK 17 or newer
Apache Maven 3.6.x or newer

System Architecture

The system is split into two projects, TermIt is the backend, TermIt UI represents the frontend. Both projects are built separately and can run separately.

See the docs folder for additional information on implementation, setup, configuration and the architectural decisions record.

Technologies

This section briefly lists the main technologies and principles used (or planned to be used) in the application.

Spring Boot 3, Spring Framework 6, Spring Security, Spring Data (paging, filtering)
Jackson 2.13
JB4JSON-LD - Java - JSON-LD (de)serialization library
JOPA - persistence library for the Semantic Web
JUnit 5 (RT used 4), Mockito 4 (RT used 1), Hamcrest 2 (RT used 1)
Servlet API 4 (RT used 3.0.1)
JSON Web Tokens (CSRF protection not necessary for JWT)
SLF4J + Logback
CORS (for separate frontend)
Java bean validation (JSR 380)

Ontology

The ontology on which TermIt is based can be found in the ontology folder. For proper inference functionality, termit-model.ttl, the popis-dat ontology model (http://onto.fel.cvut.cz/ontologies/slovnik/agendovy/popis-dat/model) and the SKOS vocabulary model (http://www.w3.org/TR/skos-reference/skos.rdf) need to be loaded into the repository used by TermIt (see doc/setup.md) for details.

Monitoring

We use JavaMelody for monitoring the application and its usage. The data are available on the /monitoring endpoint and are secured using basic authentication. Credentials are configured using the javamelody.init-parameters.authorized-users parameter in application.yml (see the JavaMelody Spring Boot Starter docs).

Documentation

TermIt REST API is available for each instance via Swagger UI. It is accessible at http://SERVER_URL/PATH/swagger-ui/index.html, where SERVER_URL is the URL of the server at which TermIt backend is running and PATH is the context path. A link to the API documentation is also available in the footer of the TermIt UI.

Build configuration and deployment is described in setup.md.

Docker

The Docker image of TermIt backend alone can be built by docker build -t termit-server .

Then, TermIt can be run and exposed at the port 8080 as sudo docker run -e REPOSITORY_URL=<GRAPHDB_REPOSITORY_URL> -p 8080:8080 termit-server

An optional argument is <GRAPHDB_REPOSITORY_URL> pointing to the RDF4J/GraphDB repository.

TermIt Docker images are also published to DockerHub.

Links

TermIt UI - repository with TermIt frontend source code
TermIt Docker - repository with Docker configuration of the whole TermIt system (including the text analysis service and data repository)
TermIt Web - contains some additional information and tutorials
TermIt: A Practical Semantic Vocabulary Manager - a conference paper we wrote about TermIt
- Cite as Ledvinka M., Křemen P., Saeeda L. and Blaško M. (2020). TermIt: A Practical Semantic Vocabulary Manager.In Proceedings of the 22nd International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-423-7, pages 759-766. DOI: 10.5220/0009563707590766

License

Licensed under GPL v3.0.

termit's People

Contributors

Stargazers

Watchers

Forkers

datagov-cz michalmed aahmadai psiotwo alanbuzek holecekm filip-kopecky mighantos

termit's Issues

Support working with repository containing multiple copies of the same vocabulary

Follow-up to #163 and #164 - a repository may contain several copies of the same vocabulary, one is canonical, the other ones are working copies. Each user may open open of the working copies for editing.
TermIt has to be able to determine the correct context of the vocabulary and any other related vocabularies (vocabularies containing terms SKOS-related to the terms from the edited vocabulary).

Possibly problematic areas:

SKOS export (inferred skos:exactMatch and skos:relatedMatch - do not know if they are inferred based on statements in someone else's workspace)
- Solution: SKOS export in instance with possible workspaces will only contain asserted statements about skos:exactMatch and skos:relatedMatch

TODOs:

Harmonize code with current development head
Optimize retrieval of vocabulary repository contexts

Snapshoty pojmů vrácené TermIt API neobsahují některé atributy

Endpoint se snapshotem pojmu by měl vracet v elementu properties tyto atributy:
http://onto.fel.cvut.cz/ontologies/slovnik/slovník-datového-modelu-dtm/pojem/je-reálným-objektem
http://purl.org/dc/terms/references
http://www.w3.org/2004/02/skos/core#notation
http://www.w3.org/2004/02/skos/core#example

Tady příklad pojmu v aktuálních datech a ve verzi, kdy datově by měly být totožné:

Term removal does not delete `hasTopConcept` statements

If a term is a root term of a vocabulary and is removed, the skos:hasTopConcept referencing the term from the glossary remains in the repository.

Allow configuring types language

As a TermIt administrator, I want to be able to specify a file containing the definition of types users can use to classify terms.

Currently the types (based on UFO ontology) are loaded from a file that is packed into the application archive. This cause any changes to the types language to require rebuilding the project. Instead, it should be at least possible to specify the path to the language file as a parameter on startup, with the built-in file used as a default when no custom one is provided.

This is motivated by attempts to incorporate TermIt into the SGoV assembly line, which uses a different language to stereotype terms.

Document update fails due to JSON-LD deserialization exception

The following exception is thrown when attempting to update a document:

cz.cvut.kbss.jsonld.exception.AmbiguousTargetTypeException: Object with types [http://onto.fel.cvut.cz/ontologies/slovník/agendový/popis-dat/pojem/zdroj, http://onto.fel.cvut.cz/ontologies/slovník/agendový/popis-dat/pojem/dokument] matches multiple equivalent target classes: [class cz.cvut.kbss.termit.dto.listing.DocumentDto, class cz.cvut.kbss.termit.model.resource.Document]
	at cz.cvut.kbss.jsonld.deserialization.util.TargetClassResolver.ambiguousTargetType(TargetClassResolver.java:133)
	at cz.cvut.kbss.jsonld.deserialization.util.TargetClassResolver.selectFinalTargetClass(TargetClassResolver.java:105)
	at cz.cvut.kbss.jsonld.deserialization.util.TargetClassResolver.getTargetClass(TargetClassResolver.java:82)
	at cz.cvut.kbss.jsonld.deserialization.expanded.Deserializer.resolveTargetClass(Deserializer.java:51)
	at cz.cvut.kbss.jsonld.deserialization.expanded.ObjectDeserializer.openObject(ObjectDeserializer.java:79)
	at cz.cvut.kbss.jsonld.deserialization.expanded.ObjectDeserializer.processValue(ObjectDeserializer.java:60)
	at cz.cvut.kbss.jsonld.deserialization.expanded.ExpandedJsonLdDeserializer.deserialize(ExpandedJsonLdDeserializer.java:61)
	at cz.cvut.kbss.jsonld.jackson.deserialization.JacksonJsonLdDeserializer.deserialize(JacksonJsonLdDeserializer.java:85)
	at cz.cvut.kbss.jsonld.jackson.deserialization.JacksonJsonLdDeserializer.deserializeWithType(JacksonJsonLdDeserializer.java:120)

Endpoint: rest/resources/document

Ensure TermIt ontology is in a separate context in the repository

As a developer, I want to keep the TermIt ontology in a separate context (RDF graph) in the repository, so that it can be updated automatically (#227).
Currently, some of the existing deployments have the ontology in the default context, which makes the automated updates difficult (additions are fine, removals would be hard). If the ontology were in a dedicated context, we could just replace the context completely.

Automatic update of ontology in repository

As a developer, I sometimes make changes to the TermIt ontology (occasionally, even changes to the popis dat (data description) ontology happen). These changes may influence the inference results or behavior of the application. As installations of TermIt are created that are not managed by the development team, there needs to be a mechanism of automatically updating these ontologies in the main application repository, so that when a new version of TermIt is deployed, the ontologies in the repository are up-to-date.

Replace aspects with Spring application events

Following migration to JOPA 2.0.0(-SNAPSHOT), AspectJ is no longer required to work with the object model. However, we are currently using Aspects to notify certain components of selected events. This prevents the removal of AspectJ Maven plugin from the build configuration.

We should replace the Spring aspects with application events and remove AspectJ altogether.

Integration with Keycloak

In order to facilitate compatibility with the SGoV assembly line, TermIt has to be able to use Keycloak as an authorization service.

However, to retain backwards compatibility, it also has to be able to run without it, using its internal authentication mechanisms for secure access to the application.

Note that this issue involves backend as well as frontend of TermIt.

Repeated annotation of large files is slow

When text analysis is invoked on an already annotated larger file (cca 1MB) containing many term occurrences, processing of its results can take minutes to finish. This makes it practically unusable, as the user is unsure whether it is normal that the application shows Please wait... for several minutes and may leave/attempt to refresh.

Analysis of repeated annotation of the metropolitan plan shows the following times:

Invocation of text analysis: 8.5s
Resolution of occurrences in the file: 47s
Saving occurrences: 5min 31s

The goal should be to get at least under a minute altogether, preferably even better.

Return datetime values as ISO string in JSON

When using plain JSON, datetime values using Java 8 datetime API (Instant in particular) are serialized as decimal numbers by Jackson. Instead, they should be serialized as ISO 8601 strings. This will ensure, among other things, consistency with the representation in JSON-LD.

Import of vocabulary does not generate default document

When a vocabulary is imported, TermIt fails to generate a document for such vocabulary. In contrast, when a vocabulary is created (not through import, but through create vocabulary form), a document is generated for it.

Allow opening a set of vocabularies for editing

To facilitate collaborative creation and maintenance of multiple vocabularies, TermIt must be able to open only a selected set of vocabularies for editing and treating any other vocabularies as read-only. This should be session-based, so that multiple requests from the same user can work with the same set of vocabularies.

All vocabulary contexts are available for editing by default (this will ensure compatibility with the current behavior).

An API for opening a set of vocabularies (or rather a set of vocabulary contexts) has to be added.
Information about which contexts are open for editing by a user is stored in a session (server-side or client-side (token)).
List of vocabularies contains only the vocabularies open for editing.
References to vocabularies outside of the specified set (e.g., when a term from another vocabulary is referenced via a SKOS relationship) are read-only. I.e., they are accessible, but editing such vocabularies (the terms they contain) is forbidden.
termit-ui must be able to parse this set of vocabularies from a URL and set-up the working context accordingly.

Configure Docker Compose to preserve logs

Currently, the Docker Compose setup does logs only to system out, so the output is lost on restart. As a system admin, I need to be able to examine logs from before last restart.

Rewrite vocabulary history of content retrieval

The current implementation of a vocabulary content history retrieval is extremely inefficient, as it retrieves all change records related to the repository. There can be thousands of those, so the loading takes minutes and there are megabytes of data sent to the client which then only needs the grouped changes per day (added/edited every day).
This should be rewritten so that the backend immediately returns the aggregated changes.

Provide REST API documentation per instance

Currently, the REST API documentation is maintained manually at SwaggerHub. However, this is quite inefficient for two reasons:

Manual maintenance in a separate place than the source code makes it often outdated,
Testing the API is difficult because different instances would require different versions on SwaggerHub.

Instead, the documentation should be a part of each deployment of TermIt so that it can be directly tested. Moreover, the documentation of the endpoints would be specified directly in code. Springdoc OpenAPI could be used for this purpose.