Coder Social home page Coder Social logo

sage-jena's People

Contributors

chat-wane avatar juliendavat avatar momo54 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

lorenzbuehmann

sage-jena's Issues

Implement `progress()`

Taking as argument the logical plan and cardinality of each iterator.

Assuming two iterators (?s <p> ?o . ?s <p'> ?o'), the first processed 3 out of its 10 elements, and the second processed 5 out of its 100 elements. The progress is 3/10 + 1/10 * 5/100.

Of course, this a best effort strategy since (i) cardinality are not 100% accurate, and (ii) there are no indicators to know all cardinalities of the 10 iterators implied by the first iterator. All 7 elements remaining from the first iterator could have 1M results, therefore it could be a lot longer than estimated in first place.

Logical time testing

One major issue about preemptive queries lies in the fact that they can stop at any time, and more specifically in the middle of the iterator model.

It would be great to have an easy mean to change the logical time in the middle of the iterator model, and therefore enable testing on timeout.

Benchmarking `Sage` vs `TDB` on `Watdiv`

  • run all configurations easily and save
  • commit .csv files of every experiment along with the description of machine it ran on
  • write a data visualization program, maybe jupyter notebook?

Performance issue `Sage` vs `TDB` on volcano.

Running the benchmark on a simple query (query_10084) with watdiv.10M for the first time highlights performance issues with current Sage implem. This is roughly 2x longer to execute a query with Sage…

SELECT ?v5 ?v3 ?v0 ?v1 ?v2 WHERE { 
    ?v0 <http://db.uwaterloo.ca/~galuc/wsdbm/gender> <http://db.uwaterloo.ca/~galuc/wsdbm/Gender0>.
    ?v0 <http://xmlns.com/foaf/familyName> ?v1.
    ?v0 <http://xmlns.com/foaf/givenName> ?v2.
    ?v0 <http://schema.org/email> ?v3. 
    ?v0 <http://db.uwaterloo.ca/~galuc/wsdbm/userId> ?v5. 
} 

With TDB2:

# Warmup Iteration   1: 0,282 s/op
# Warmup Iteration   2: 0,131 s/op
# Warmup Iteration   3: 0,102 s/op
Iteration   1: 0,094 s/op

With Sage:

# Warmup Iteration   1: 0,766 s/op
# Warmup Iteration   2: 0,198 s/op
# Warmup Iteration   3: 0,156 s/op
Iteration   1: 0,156 s/op

Preempt `query_10078` fails on timeout 1ms

The first run (slower) does not report the same number of results as the second (faster). The true number of results is 3932428.

There is a mistake in preemptive queries that need to be investigated. Possibly an issue with fully bounded triple as last BGP of the query.

SELECT ?v7 ?v1 ?v0 ?v4 ?v2 ?v6 ?v3 ?v8 WHERE {
	?v1 <http://schema.org/priceValidUntil> ?v8.
	?v1 <http://purl.org/goodrelations/validFrom> ?v2.
	?v1 <http://purl.org/goodrelations/validThrough> ?v3.
	?v1 <http://schema.org/eligibleQuantity> ?v6.
	?v0 <http://purl.org/goodrelations/offers> ?v1.
	?v1 <http://schema.org/eligibleRegion> ?v7.
	?v4 <http://schema.org/nationality> ?v7.
	?v4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://db.uwaterloo.ca/~galuc/wsdbm/Role1>.
}
[fr.gdd.sage.WatdivBenchmark.execute-jmh-worker-1] DEBUG fr.gdd.sage.WatdivBenchmark - Got 3932428 results for this query in 7526 pause/resume.
17,227 s/op
# Warmup Iteration   2: <failure>

java.lang.Exception: /!\ not the same number of results on sage-jena-benchmarks/queries/watdiv_with_sage_plan/query_10078.sparql: 3932428 vs 3932328.
	at fr.gdd.sage.WatdivBenchmark.execute(WatdivBenchmark.java:75)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_ss_jmhStub(WatdivBenchmark_execute_jmhTest.java:433)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_SingleShotTime(WatdivBenchmark_execute_jmhTest.java:385)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:475)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:458)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Preemptable ID iterator for Jena

Maybe keep an iterator on Record internally and use the mapper ourselves at the end.

With the Record we can get a path, hence maybe continue where we left the computation.
There is also an idx on the bytebuffer to take into account.

Cleaner way to save preempted states with `union` and `bgp`

For now, it works because:

if (this.output.getState().containsKey(id)) {
return false;
}
boolean shouldSaveCurrent = Objects.isNull(this.output.getState()) ||
this.output.getState().keySet().stream().filter(k -> k < 1000)
.collect(Collectors.toUnmodifiableList()).isEmpty();

which virtually says that iterator cannot erase already existing ones (because preempt union finishes to enumerate members of unions, so we forbid erasing). Which also means that PatternMatchSage should be reworked to provide more consistent identifiers.

The condition also states that identifiers above 1000 are ignored because they are unions (see VolcanoIteratorFactory). And unions call hasNextBinding before Iterators call hasNext, i.e., they save first.
Again, this means that identifiers should be reworked.

Suggestion: create a graph that reflects the execution and preemptive iterators be it union or scan or else.

Generate DOI with Zenodo or alternative

So this repository is easily and consistently refered to using this DOI:

  • The repository must have a licence.
  • The repository must be public first though.
  • Add the DOI to the main README.md

Backjump feature

Note: This is not directly related to Sage. Only keeping this for the record.

Let us consider the following query:

SELECT * WHERE {
  ?s p1 ?o .
  ?o p2 ?x .
  ?s p3 ?y }

The join order is that of the query. Sometimes, ?s p3 ?y may fail, yet the engine enumerates all bindings of ?o p2 ?x. It would be better to backjump [1] directly to the first triple pattern since the variable ?s is a probable cause of failure.

In terms of implementation, this means:

  • Throwing an exception in the has_next() with the incriminated variable and let the iterator that actually sets the variable catch it, so it can next() it accordingly.
  • For the sake of genericity, we must wrap each scan iterators into a backjump iterator that can throw, and make sure that every iterator properly forwards the exception.

This demonstrated remarkable performance improvements in compiled version of SPARQL queries, does this holds on the iterator/volcano model?


[1] R. J. Bayardo Jr., and D. P. Miranker. Processing Queries for First-Few Answers. In Proceedings of the fifth international conference on Information and knowledge management (1996).

`PreemptJenaIterators` without `tMap` fail

Comes from watdiv query in force order mode; sage timeout 1s:

SELECT ?v6 ?v8 ?v0 ?v7 ?v3 ?v4 ?v1 ?v2 WHERE {
	?v1 <http://schema.org/priceValidUntil> ?v8.
	?v1 <http://purl.org/goodrelations/validFrom> ?v2.
	?v1 <http://purl.org/goodrelations/validThrough> ?v3.
	?v1 <http://schema.org/eligibleQuantity> ?v6.
	?v0 <http://purl.org/goodrelations/offers> ?v1.
	?v1 <http://schema.org/eligibleRegion> ?v7.
	?v4 <http://schema.org/nationality> ?v7.
	?v4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://db.uwaterloo.ca/~galuc/wsdbm/Role0>.
}

Possibly: The issue comes from the creation of a Singleton or a Null iterator which only wraps the iterator into a preemptable one. The mapper is therefore null and fails when requesting .current() or .previous().

java.lang.NullPointerException: Cannot invoke "org.apache.jena.atlas.lib.tuple.TupleMap.getSlotIdx(int)" because "tMap" is null
	at org.apache.jena.tdb2.lib.TupleLib.record(TupleLib.java:141)
	at org.apache.jena.dboe.trans.bplustree.PreemptJenaIterator.current(PreemptJenaIterator.java:126)
	at org.apache.jena.dboe.trans.bplustree.PreemptJenaIterator.current(PreemptJenaIterator.java:26)
	at fr.gdd.sage.arq.VolcanoIteratorTupleId.hasNext(VolcanoIteratorTupleId.java:46)
	at org.apache.jena.atlas.iterator.Iter$IterMap.hasNext(Iter.java:412)
	at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1104)
	at org.apache.jena.atlas.iterator.Iter$IterFiltered.hasNext(Iter.java:272)
	at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1104)
	at org.apache.jena.atlas.iterator.IteratorFlatMap.hasNext(IteratorFlatMap.java:46)
	at org.apache.jena.sparql.engine.iterator.IterAbortable.hasNext(IterAbortable.java:59)
	at org.apache.jena.atlas.iterator.Iter$IterMap.hasNext(Iter.java:412)
	at org.apache.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:59)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:58)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.PreemptCounterIter.hasNextBinding(PreemptCounterIter.java:33)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:38)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:38)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:116)
	at org.apache.jena.sparql.exec.RowSetStream.hasNext(RowSetStream.java:47)
	at org.apache.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:81)
	at fr.gdd.sage.ExecuteUtils.executeQueryTillTheEnd(ExecuteUtils.java:66)
	at fr.gdd.sage.ExecuteUtils.executeTillTheEnd(ExecuteUtils.java:84)
	at fr.gdd.sage.SetupBenchmark.execute(SetupBenchmark.java:124)
	at fr.gdd.sage.WatdivBenchmark.execute(WatdivBenchmark.java:65)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_ss_jmhStub(WatdivBenchmark_execute_jmhTest.java:433)
	at fr.gdd.sage.jmh_generated.WatdivBenchmark_execute_jmhTest.execute_SingleShotTime(WatdivBenchmark_execute_jmhTest.java:385)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:475)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:458)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Every changes to Jena in one place

For now, changes are spread between sage-query-generator and this project.

Move everything into this repository to make one fully self-contained repository.

Implement a simple join ordering optimizer

Now that we have cardinalities, we can work on the join ordering of graph patterns.

A simple heuristic: the lowest cardinality the topest; candidates choosen amongst set variables, all if none.
Then filter push-down: as soon as all variables in the filter are set, the filter comes into play

Should groups be removed (unclear, maybe an option) ? ;
GRAPH closes inherently creates groups, these should be removed so we can reorder quads.

Jena UI fixed for preemptive queries

We don't want to rewrite the whole UI. Jena already does a good job, so we only want to add a few functionalities to their UI.

Somewhere close to the run button ⏯️ , there should be a gear ⚙️ that enables configuring the query execution.

  • A range slide 🎚️ for timeout or limit. When active it enables preemptive query by sending the values to the remote Sage x Jena server.
  • A checkbox ✔️ to run the query automatically as responses arrive and the query is not over. It should be possible to stop such automatic running at any time either by unchecking the box, or clicking the run button again ⏸️.
  • A state field 📝 allows user to copy/past Sage metadata in order to resume their query execution. The field is automatically filled by the response's state if the query didn't change between its sending and the receipt of its results. If it changed, the result should still be available but with a warning (⚠️ the query seem to have changed, the state might not be relevant anymore).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.