Comunica Experiments

Collection of various small experiments with Comunica. The experiments here are not ready for actual use anywhere, and are mostly just saved in case they will turn out to be useful later. There is nothing of interest here for most people, and nothing is guaranteed to work.

Generation of test data

The SolidBench benchmark can be used to generate social network test data and example queries, and serve them using Community Solid Server. For example, to generate data and then serve it using the default configuration:

yarn run generate
yarn run serve

Please check the SolidBench documentation for more details, especially the dependencies, as it requires Docker for the data generation part. The generated data will end up in the out-fragments folder in the workspace root.

Development setup

The project uses Yarn as package manager, so to install the dependencies and build the components after cloning, with optional --ignore-engines flag when using up-to-date Node versions and some packages complain:

yarn install
yarn run build

Generation of VOID descriptions for test data

The dataset description generator tool allows generating a dataset description for all the pods generated by SolidBench, using the VOID vocabulary. The description will contain information on the total count of triples in the pod, distinct subjects and objects, unique property count and the cardinalities of various properties. By default, the file gets placed in profile/voiddescription.nq for each pod, and is linked to from the WebID. For example:

@prefix void: <http://rdfs.org/ns/void#>.
@prefix ldbcv: <http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/vocabulary/>.
...

<http://localhost:3000/pods/00000000000000000065> a void:Dataset;
    void:triples 3410;
    void:distinctSubjects 471;
    void:distinctObjects 111;
    void:properties 36;
    void:propertyPartition [ a 332 ];
    void:propertyPartition [ ldbcv:id 313 ];
    void:propertyPartition [ ldbcv:creationDate 314 ];
    void:propertyPartition [ ldbcv:locationIP 313 ];
    ... etc.

The generator tool is a somewhat dumb script that makes a lot of assumptions about the paths and other things, and is not yet reusable. The tool should get built with the rest of the workspaces. To run it from the repository root:

yarn run index

Additionally, some of the pods appear to have Solid type indexes generated by the SolidBench tool, by the looks of it, but not all of them.

Using VOID description metadata for querying

There exists the ActorRdfMetadataExtractVoidDescription actor in packages/ that extracts any VOID metadata it finds and places it in the metadata object, keeping track of metadata on a dataset level. The actor primarily focuses on providing predicate cardinalities for query operations. For running queries, there is the query runner tool in tools/query-runner that includes the new actor, by creating an instance of the query engine with a custom configuration in templates/config-query-cardinalities.json. Running the query tool should work with:

yarn run query

The query runner tool attempts to do some simple approximate timing of queries, just to get a rough estimate of the time it takes to execute one. The tool uses the default configuration in templates/config-runner.json by default, and runs each query on each config the number of times specified by repeat. The query durations are then averaged and eventually serialised in templates/results.csv for each combination. This is not necessarily a valid way to benchmark anything, it is just there to get some idea of how the configurations affect the query durations.

simonvbrae / comunica-experiments Goto Github PK