chat-wane / sage-jena Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 1.85 MB

Sage ⨯ Jena: Sage on top of Jena.

License: Apache License 2.0

Java 100.00%

sage-jena's People

Contributors

Stargazers

Watchers

Forkers

lorenzbuehmann

sage-jena's Issues

Continuous integration that runs tests

Implement `progress()`

Taking as argument the logical plan and cardinality of each iterator.

Assuming two iterators (?s <p> ?o . ?s <p'> ?o'), the first processed 3 out of its 10 elements, and the second processed 5 out of its 100 elements. The progress is 3/10 + 1/10 * 5/100.

Of course, this a best effort strategy since (i) cardinality are not 100% accurate, and (ii) there are no indicators to know all cardinalities of the 10 iterators implied by the first iterator. All 7 elements remaining from the first iterator could have 1M results, therefore it could be a lot longer than estimated in first place.

`LIMIT` operator > `SageInput.limit` makes the query create all results

Hence, every LIMIT iterator should have a saved state that contains the number of results seen until pause.

`ExtensibleRowSetWriter` to add series of outputs to the standard one

For instance, Sage needs to return the state of iterators to enable resuming query execution later on.

But we could envision further exports of metadata. For instance, statistics on already visited scans.

Logical time testing

One major issue about preemptive queries lies in the fact that they can stop at any time, and more specifically in the middle of the iterator model.

It would be great to have an easy mean to change the logical time in the middle of the iterator model, and therefore enable testing on timeout.

Fix the basic `SELECT * WHERE {?s ?p ?o}`

For now, it fails because a part of code was removed.

It 's useless most of the time, but should be done anyway.

Benchmarking `Sage` vs `TDB` on `Watdiv`

run all configurations easily and save
commit .csv files of every experiment along with the description of machine it ran on
write a data visualization program, maybe jupyter notebook?

Performance issue `Sage` vs `TDB` on volcano.

Running the benchmark on a simple query (query_10084) with watdiv.10M for the first time highlights performance issues with current Sage implem. This is roughly 2x longer to execute a query with Sage…

SELECT ?v5 ?v3 ?v0 ?v1 ?v2 WHERE { 
    ?v0 <http://db.uwaterloo.ca/~galuc/wsdbm/gender> <http://db.uwaterloo.ca/~galuc/wsdbm/Gender0>.
    ?v0 <http://xmlns.com/foaf/familyName> ?v1.
    ?v0 <http://xmlns.com/foaf/givenName> ?v2.
    ?v0 <http://schema.org/email> ?v3. 
    ?v0 <http://db.uwaterloo.ca/~galuc/wsdbm/userId> ?v5. 
}

With TDB2:

# Warmup Iteration   1: 0,282 s/op
# Warmup Iteration   2: 0,131 s/op
# Warmup Iteration   3: 0,102 s/op
Iteration   1: 0,094 s/op

With Sage:

# Warmup Iteration   1: 0,766 s/op
# Warmup Iteration   2: 0,198 s/op
# Warmup Iteration   3: 0,156 s/op
Iteration   1: 0,156 s/op

Create our own `Pair<L,R>`

instead of using that of jena… to get rid of this dependency in sage-commons

Implement `Values` Operator

Not sure if core sparql or not, but should not be so complicated…

Preempt `query_10078` fails on timeout 1ms

The first run (slower) does not report the same number of results as the second (faster). The true number of results is 3932428.

There is a mistake in preemptive queries that need to be investigated. Possibly an issue with fully bounded triple as last BGP of the query.

SELECT ?v7 ?v1 ?v0 ?v4 ?v2 ?v6 ?v3 ?v8 WHERE {
	?v1 <http://schema.org/priceValidUntil> ?v8.
	?v1 <http://purl.org/goodrelations/validFrom> ?v2.
	?v1 <http://purl.org/goodrelations/validThrough> ?v3.
	?v1 <http://schema.org/eligibleQuantity> ?v6.
	?v0 <http://purl.org/goodrelations/offers> ?v1.
	?v1 <http://schema.org/eligibleRegion> ?v7.
	?v4 <http://schema.org/nationality> ?v7.
	?v4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://db.uwaterloo.ca/~galuc/wsdbm/Role1>.
}

[fr.gdd.sage.WatdivBenchmark.execute-jmh-worker-1] DEBUG fr.gdd.sage.WatdivBenchmark - Got 3932428 results for this query in 7526 pause/resume.
17,227 s/op
# Warmup Iteration   2: <failure>

java.lang.Exception: /!\ not the same number of results on sage-jena-benchmarks/queries/watdiv_with_sage_plan/query_10078.sparql: 3932428 vs 3932328.
	at fr.gdd.sage.WatdivBenchmark.execute(WatdivBenchmark.java:75)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_ss_jmhStub(WatdivBenchmark_execute_jmhTest.java:433)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_SingleShotTime(WatdivBenchmark_execute_jmhTest.java:385)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:475)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:458)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Preemptable ID iterator for Jena

Maybe keep an iterator on Record internally and use the mapper ourselves at the end.

With the Record we can get a path, hence maybe continue where we left the computation.
There is also an idx on the bytebuffer to take into account.

Cleaner way to save preempted states with `union` and `bgp`

For now, it works because:

sage-jena/sage-jena-volcano/src/main/java/fr/gdd/sage/arq/VolcanoIteratorTupleId.java

Lines 47 to 52 in e872afd

    
           if (this.output.getState().containsKey(id)) { 
        
               return false; 
        
           } 
        
           boolean shouldSaveCurrent = Objects.isNull(this.output.getState()) || 
        
                   this.output.getState().keySet().stream().filter(k -> k < 1000) 
        
                           .collect(Collectors.toUnmodifiableList()).isEmpty();

which virtually says that iterator cannot erase already existing ones (because preempt union finishes to enumerate members of unions, so we forbid erasing). Which also means that PatternMatchSage should be reworked to provide more consistent identifiers.

The condition also states that identifiers above 1000 are ignored because they are unions (see VolcanoIteratorFactory). And unions call hasNextBinding before Iterators call hasNext, i.e., they save first.
Again, this means that identifiers should be reworked.

Suggestion: create a graph that reflects the execution and preemptive iterators be it union or scan or else.

Generate DOI with Zenodo or alternative

So this repository is easily and consistently refered to using this DOI:

The repository must have a licence.
The repository must be public first though.
Add the DOI to the main README.md

Implement `cardinality()` of scan iterators

By exploring the balanced tree.

The result should be a close approximation of the cardinality value.

Backjump feature

Note: This is not directly related to Sage. Only keeping this for the record.

Let us consider the following query:

SELECT * WHERE {
  ?s p1 ?o .
  ?o p2 ?x .
  ?s p3 ?y }

The join order is that of the query. Sometimes, ?s p3 ?y may fail, yet the engine enumerates all bindings of ?o p2 ?x. It would be better to backjump [1] directly to the first triple pattern since the variable ?s is a probable cause of failure.

In terms of implementation, this means:

Throwing an exception in the has_next() with the incriminated variable and let the iterator that actually sets the variable catch it, so it can next() it accordingly.
For the sake of genericity, we must wrap each scan iterators into a backjump iterator that can throw, and make sure that every iterator properly forwards the exception.

This demonstrated remarkable performance improvements in compiled version of SPARQL queries, does this holds on the iterator/volcano model?

[1] R. J. Bayardo Jr., and D. P. Miranker. Processing Queries for First-Few Answers. In Proceedings of the fifth international conference on Information and knowledge management (1996).

`PreemptJenaIterators` without `tMap` fail

Comes from watdiv query in force order mode; sage timeout 1s:

SELECT ?v6 ?v8 ?v0 ?v7 ?v3 ?v4 ?v1 ?v2 WHERE {
	?v1 <http://schema.org/priceValidUntil> ?v8.
	?v1 <http://purl.org/goodrelations/validFrom> ?v2.
	?v1 <http://purl.org/goodrelations/validThrough> ?v3.
	?v1 <http://schema.org/eligibleQuantity> ?v6.
	?v0 <http://purl.org/goodrelations/offers> ?v1.
	?v1 <http://schema.org/eligibleRegion> ?v7.
	?v4 <http://schema.org/nationality> ?v7.
	?v4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://db.uwaterloo.ca/~galuc/wsdbm/Role0>.
}

Possibly: The issue comes from the creation of a Singleton or a Null iterator which only wraps the iterator into a preemptable one. The mapper is therefore null and fails when requesting .current() or .previous().

java.lang.NullPointerException: Cannot invoke "org.apache.jena.atlas.lib.tuple.TupleMap.getSlotIdx(int)" because "tMap" is null
	at org.apache.jena.tdb2.lib.TupleLib.record(TupleLib.java:141)
	at org.apache.jena.dboe.trans.bplustree.PreemptJenaIterator.current(PreemptJenaIterator.java:126)
	at org.apache.jena.dboe.trans.bplustree.PreemptJenaIterator.current(PreemptJenaIterator.java:26)
	at fr.gdd.sage.arq.VolcanoIteratorTupleId.hasNext(VolcanoIteratorTupleId.java:46)
	at org.apache.jena.atlas.iterator.Iter$IterMap.hasNext(Iter.java:412)
	at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1104)
	at org.apache.jena.atlas.iterator.Iter$IterFiltered.hasNext(Iter.java:272)
	at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1104)
	at org.apache.jena.atlas.iterator.IteratorFlatMap.hasNext(IteratorFlatMap.java:46)
	at org.apache.jena.sparql.engine.iterator.IterAbortable.hasNext(IterAbortable.java:59)
	at org.apache.jena.atlas.iterator.Iter$IterMap.hasNext(Iter.java:412)
	at org.apache.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:59)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:58)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.PreemptCounterIter.hasNextBinding(PreemptCounterIter.java:33)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:38)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:38)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.exec.RowSetStream.hasNext(RowSetStream.java:47)
	at org.apache.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:81)
	at fr.gdd.sage.ExecuteUtils.executeQueryTillTheEnd(ExecuteUtils.java:66)
	at fr.gdd.sage.ExecuteUtils.executeTillTheEnd(ExecuteUtils.java:84)
	at fr.gdd.sage.SetupBenchmark.execute(SetupBenchmark.java:124)
	at fr.gdd.sage.WatdivBenchmark.execute(WatdivBenchmark.java:65)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_ss_jmhStub(WatdivBenchmark_execute_jmhTest.java:433)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_SingleShotTime(WatdivBenchmark_execute_jmhTest.java:385)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:475)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:458)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Add `WDBench` as benchmarking target

This would enable testing optional as well as additional benchmarking on BGPs.

Additionally, this would prepare the field for preemptive property path queries.

Every changes to Jena in one place

For now, changes are spread between sage-query-generator and this project.

Move everything into this repository to make one fully self-contained repository.

Implement a simple join ordering optimizer

Now that we have cardinalities, we can work on the join ordering of graph patterns.

A simple heuristic: the lowest cardinality the topest; candidates choosen amongst set variables, all if none.
Then filter push-down: as soon as all variables in the filter are set, the filter comes into play

Should groups be removed (unclear, maybe an option) ? ;
GRAPH closes inherently creates groups, these should be removed so we can reorder quads.

A range slide 🎚️ for timeout or limit. When active it enables preemptive query by sending the values to the remote Sage x Jena server.
A checkbox ✔️ to run the query automatically as responses arrive and the query is not over. It should be possible to stop such automatic running at any time either by unchecking the box, or clicking the run button again ⏸️.
A state field 📝 allows user to copy/past Sage metadata in order to resume their query execution. The field is automatically filled by the response's state if the query didn't change between its sending and the receipt of its results. If it changed, the result should still be available but with a warning (⚠️ the query seem to have changed, the state might not be relevant anymore).

Add `BSBM` as benchmarking target

BSBM templates for queries.

However they contain DISTINCT and ORDER BY that cannot be preemptive without a client-side process.

	if (this.output.getState().containsKey(id)) {
	return false;
	}
	boolean shouldSaveCurrent = Objects.isNull(this.output.getState()) \|\|
	this.output.getState().keySet().stream().filter(k -> k < 1000)
	.collect(Collectors.toUnmodifiableList()).isEmpty();