opencitations / oc_ocdm Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 745 KB

Object mapping library for manipulating RDF graphs that are compliant with the OpenCitations datamodel.

Home Page: https://opencitations.net/

License: ISC License

Python 98.04% Jupyter Notebook 1.96%

graphs object-mapping rdf sparql

oc_ocdm's People

Contributors

Stargazers

Watchers

Forkers

arcangelo7 iosonopersia

oc_ocdm's Issues

Differences between documentation and code: import_entities_from_graph

The documentation of the import_entities_from_graph method of the Reader class is out of date. In fact, the method takes three mandatory arguments, not two: the GraphSet, the Graph and the responsible agent. The latter is not mentioned in the documentation. Furthermore, import_entities_from_graph is also invoked by the sync_with_triplestore method of the graph_set module, but it is called without specifying the responsible agent, preventing its use.

Implement runtime warnings for functional property setters

Setter methods that correspond to a functional property (e.g. hasTitle for a BibliographicResource) should either

generate a runtime warning when an already existing value gets overwritten
or allow the user to set a parameter flag to tell them not to update the value if another one is already set.

A similar functionality should be discussed and implemented also for "remover" methods that correspond to non-functional properties (e.g. remove_format for BibliographicResource): when the user doesn't specify a specific value to remove, the current implementation deletes all existing values. Instead, it should inform at runtime the user that he/she may have inadvertently forgot to specify the value to be deleted.

Documentation could be organized as the one of oc_graphenricher

We could take inspiration from the way in which the oc_graphenricher's documentation is structured.
The folder from which to "copy" the structure is this one, i.e. docs/source.

Reader class

In the Reader class, the method graph_validation only validates "graph" entities: it should validate also "prov" and "metadata" entities.

It's also missing the possibility to import not only "graph" entities but also "prov" and "metadata" entities inside a ProvSet and a MetadataSet respectively.

The Reader class should also contain methods to import entities directly from a triplestore/RDF file, instead of requiring an rdflib.Graph from the user.

Inverse properties

As suggested by @GabrielePisciotta:

It might be interesting to have an inverse property in OCDM (or at least in the oc_ocdm library) to get all the entities that have a certain specified identifier object, starting from the latter (similarly to what we already have for ARs with get_is_held_by ()) . The use case is precisely that of going backwards, perhaps even starting from the literal value, going to take all the identifier objects that have the literal (eg: string doi) and subsequently all the entities connected to this identifier (ex: a journal article).
Right now you can get the information of interest with a simple SPARQL query, but with oc_ocdm we can't do SPARQL queries and doing the reverse is a bit convoluted, something like ar.get_is_held_by ()

Remove method create_discourse_element

We should have a look at the create_discourse_element method from the DiscourseElement class.
Given the existence of the other create_* methods, such method should be useless/redundant.

External usage should be investigated before proceeding.

Some Storer and Reader functions could be parallelized

Functions from Storer and Reader that execute SPARQL queries (SELECT/UPDATE) or that access RDF files (read/write), could be parallelized in some way to achieve better performance.

Generally speaking, applications that make use of oc_ocdm often need to import/store/upload a large quantity of data. This usually consumes a lot of time, which often represents the biggest percentage of total execution time.

An initial optimization approach can be found in Storer.upload_all(...), where the total amount of UPDATE queries to be performed is grouped into batches of configurable maximum size (see batch_size parameter).

Simplify code by using `Graph.set` and `Graph.value` methods

I just discovered a method of the rdflib.Graph class (which is extensively used by this library) which is the perfect fit for all the entities' setter methods that concern functional properties (e.g. br.has_title).

Graph.set is a convenience method that ensures no more than a single value is set for a specific subject-predicate pair by removing any preexisting values and replacing them with the new one.

Additionally, Graph.value is a convenience method that returns no more than a single value for a specific subject-predicate pair (it raises an error when more than one value exists).

Documentation for the set method.
Documentation for the value method.
Convenience methods for functional properties.

Example of a possible substitution:

    @accepts_only('literal')
    def has_title(self, string: str) -> None:
        self.remove_title()
        self._create_literal(GraphEntity.iri_title, string)

would become something similar to

    @accepts_only('literal')
    def has_title(self, string: str) -> None:
        self.g.set((self.res, GraphEntity.iri_title, Literal(string))

Proposal to add time zone to snapshot generation time

oc-ocdm saves snapshot generation time in "%Y-%m-%dt%H:%M:%S" format, that is without time zone. In a vision of data in federation, where several endpoints manage different entities that call each other in a way compliant with the Opencitations Data Model, the absence of time zone information would be a problem. This would mean not only modifying the implementation, but also modifying the OCDM, in addition to the fact that all the data generated so far do not have the time zone indication. It’s probably not going to be a short-term problem, but I’m still writing an issue to focus on the problem.

Thread-safety

oc_ocdm is not thread-safe!

As for now, it's better to make sure that this library is used inside only one execution context at a time.

This is caused by the way in which the IDs of new entities are managed (storing and reading counters from text files or in-memory dictionaries).

It could be worth having a look at the CounterHandler subclasses.

remove_every_triple should preserve rdf:type statements!

remove_every_triple's implementation should be fixed: the rdf:type of an entity is mandatory!

Wider usage of support function is_empty_string

A wider usage of the function is_empty_string from support.py could be made across the entire project.

This function replaces this very common pattern when dealing with strings:

# The old:
if string is None or string == '':
    pass

# becomes:
if is_empty_string(string):
    pass

This makes much clearer the semantics of that conditional statement: the intention of the developer is easily and quickly understandable.

Fixing the merge semantics

Should the "Merge" operation preserve non-OCDM-compliant statements from the removed entity?
Actually, in A.merge(B), the non-OCDM-compliant statements from A are kept intact, while the ones from B are not moved into A (and so they are lost forever). Should we change this behaviour?

Reuse methods from the wcw project

Some functions related to external identifiers that can be found in the Pusher subproject of wcw could be reused inside the Identifier class of oc_ocdm.

They can be found here.

The key idea here is to provide a "validity check" functionality when the user tries to call methods such as create_isbn, create_issn and create_orcid.

Proposal to implement a helper method that returns AgentRoles in the correct order

I would like to make an implementation proposal. Currently, AgentRoles linked to a certain BibliographicResource via the pro:isDocumentContextFor predicate are not sorted, although the order was specified via the oco:hasNext predicate. This makes a lot of sense because if there is more than one type of AgentRole, such as the authors and the publisher, the order is no longer significant. However, the order of the AgentRoles is relevant and the user may want to get, for example, all the authors in the right order. At the moment, this operation must be done manually, using the oco:hasNext predicate. Therefore, it might be meaningful to implement a support method that allows performing this operation automatically, without the user having to do it every time.

New method transitive_closure()

A new method named transitive_closure() could be added to class AbstractSet.

This new method should loop over every AbstractEntity contained in the set and looking for referenced entities which are not contained in that same set. Missing entities should be then imported from the right RDF file (by using find_paths from support.py) or from a triplestore. This process should be repeated until no further entity is missing, failing in case a transitive closure of the set cannot be made.

Particular attention should be put on avoiding infinite loops.

An additional method could be then added, named is_transitively_closed() which establishes whether the set has no missing entities or not.

Find a solution for method get_derives_from of EntitySnapshot

Method get_derives_from of class EntitySnapshot has a big problem which must be addressed in future releases of oc_ocdm.

When references to other EntitySnaphots are found, either an EntitySnapshot instance already contained in the ProvSet is returned or a new one must be created and added to the ProvSet. In the latter case there's a problem: the constructor of a ProvEntity (superclass of EntitySnapshot) requires the prov_subject parameter, which the method is uncapable of retrieving as for now.

Should the prov_subject parameter be removed from the constructor's signature of ProvSet (by also adapting the rest of the code with the new assumption that a prov_subject could sometimes be unknown)? Or maybe there's a viable way to retrieve that information?

If the parent snapshot(s) are stored in the same ProvSet instance, than there's no problem: add_se(None, uri) will not have to create any new entity and it will simply return the one that has that particular uri.
If the parent snapshot(s) are NOT stored in the same ProvSet instance, we've currently no way to find out what their prov_subject is (calling add_se(None, uri) will create a new snapshot entity with None as prov_subject).

A TODO comment is put in place here:

oc_ocdm/oc_ocdm/prov/entities/entity_snapshot.py

Line 82 in 33e4c6e

# TODO: what is the prov_subject of these snapshots?

Enforcing the functional constraint even on entities without the preexisting_graph

When adding an already existing entity to a set without importing it, it is currently possible to add triples that will be inserted into the persistent graph without enforcing the single-value constraint on the functional properties (since the preexisting_graph would be empty). This should be fixed.

Example:

br = graph_set.add_br(..., res=URIRef("<existing_br_uri>"), ...)
br.hasTitle("<title>")

In the example, the existing BibliographicResource could already have a title defined for it inside the persistent graph. Synchronising the above GraphSet would add a second title to the persistent entity (unless the two literals are exactly the same), thus breaking the functional constraint.

sync_with_triplestore and sync_with_rdf_file

A sync_with_triplestore method is contained inside the GraphSet class, but it's missing from MetadataSet.
Moreover, a sync_with_rdf_file variant should be added to both classes, enabling the user to synchronize both
a triplestore and an RDF file with the user's GraphSet/MetadataSet: these methods should be able to import the entities
that contain a reference to an entity which is going to be deleted and should also remove those references.

Maybe these methods should be called something like propagate_deletions_to_triplestore and propagate_deletions_to_rdf_file.

These methods should not be present inside ProvSet, since provenance entities cannot be deleted!

Update CounterHandler interface

The CounterHandler interface could be simplified by a lot just by allowing the user to directly pass the entity instance.

For example, instead of:

handler.set_counter(15, 'br', 'se', 28)

it could become:

handler.set_counter(15, snapshotEntityOfBibliographicResource28)

All required information can be easily extracted from the entity instance!

Add missing documentation comments

When I wrote documentation comments for this projects, I had not enough time to do it for every module. The following files need to be enriched with comments over classes and functions:

graph_entity.py
graph_set.py
metadata_entity.py
metadata_set.py
prov_entity.py
prov_set.py
reader.py
storer.py

Additionally, comments are missing from the following files, which do not contain functions that are publicly accessible:

query_utils.py
reporter.py
support.py

Lazy Loading strategy

In future, the library could evolve by implementing a Lazy Loading strategy that would automatically import from any source only the entities that have to be modified.

OCDM extension proposal to also include the "posted-content"

The OpenCitations Data Model includes numerous types of bibliographic resources, such as book, book chapter, book part, book section, etc. However, among the various types returned by Crossref, there is also the "posted-content", i.e. the content published on the web, which at the moment is not considered by the data model. Therefore, I propose to extend it to include this type of bibliographic resource as well. For example, it could be mapped via http://purl.org/spar/fabio/WebContent.

Function find_paths only considers files with extension '.json' and '.ttl'

Function find_paths only considers files with extension '.json' and '.ttl'. See the code for reference here:

oc_ocdm/oc_ocdm/support/support.py

Lines 272 to 275 in 33e4c6e

    
           if is_json: 
        
               format_string: str = ".json" 
        
           else: 
        
               format_string: str = ".ttl"

This approach is not solid at all, since the Storer is able to export RDF files with a variety of different file extensions. Those files couldn't be found through the current implementation of find_paths. A noteworthy example is the '.nquads' extension of the provenance files.

Either another approach should be applied or the current one should be extended.

Fixing generate_provenance

generate_provenance should check if the .g graph of an entity was manually emptied by the user, and in such case automatically mark it as to be deleted before proceeding with the generation of the snapshots.

query_utils.py and get_update_query

In order to avoid problems related to the eventual occurence of circular dependencies, a "hack" was put in place in the get_update_query function of the support/query_utils.py module. Since we need to know what type of entity has been passed to the function as argument (GraphEntity, ProvEntity or MetadataEntity) and at the same time we cannot use the builtin function isinstance (since it would require us to import GraphEntity, ProvEntity and MetadataEntity, leading to circular dependencies), then whoever needs to use this function must do the "isinstance check" and pass to get_update_query a string with one of these values: "graph", "prov" or "metadata".

This is surely not a good design choice and it should be fixed. It should also be investigated if the circular dependency appears only when testing with "poetry run test" (which in turn uses unittest that tries to load every module, and also the __init__.py files that cause the problem). All the internal import statements must always have a full path to the required module, since this ensures that __init__.py files are bypassed. This should avoid circular dependencies at runtime.

	if is_json:
	format_string: str = ".json"
	else:
	format_string: str = ".ttl"