Coder Social home page Coder Social logo

sirixdb / brackit Goto Github PK

View Code? Open in Web Editor NEW
46.0 6.0 10.0 9.48 MB

Query processor with proven optimizations, ready to use for your JSON store to query semi-structured data with JSONiq. Can also be used as an ad-hoc in-memory query processor.

Home Page: http://brackit.io

License: Other

Java 99.35% XQuery 0.59% XSLT 0.06%
xquery json xdm java json-data statement-syntax clauses xpath hacktoberfest jsoniq

brackit's Introduction

Build & test

Brackit - a retargetable JSONiq query engine

Brackit is a flexible JSONiq query processor developed during Dr. Sebastian Bächles time as a PhD student at the TU Kaiserslautern in the context of their research in the field of query processing for semi-structured data. The system features a fast runtime and a flexible compiler backend, which is, e.g., able to rewrite queries for optimized join processing and efficient aggregation operations. It's either usable as an in-memory ad-hoc query engine or as the query engine of a data store. The data store itself can add sophisticated optimizations in different stages of the query processor. Thus, Brackit already bundles common optimizations and a data store can add further optimizations for instance for index matching.

Lately, Johannes Lichtenberger has added many optional temporal enhancements for temporal data stores such as SirixDB. Furthermore, JSON is now a first-class citizen. Brackit supports a slightly different syntax but the same data model as JSONiq and all update primitives described in the JSONiq specification. Brackit also supports Python-like array slices. Furthermore anonymous functions and closures were added lately.

Main features

  • Retargetable, thus sharing optimizations, which are common for different data stores (physical optimizations and index rewrite rules can simply be added in further stages).
  • JSONiq, a language which especially targets querying JSON, supporting user defined functions, easy tree traversals, FLWOR expressions to iterate, filter, sort and project item sequences.
  • Set-oriented processing, meaning pipelined execution of FLWOR clauses through operators, which operate on arrays of tuples and thus support known optimizations from relational database querying for implicit joins and aggregates.

We're currently working on a Jupyter Notebook / Tutorial.

Here's a more detailed document about the vision and overall mission of Brackit.

Syntax differences in relation to JSONiq

  • array indexes start at position 0
  • object projections via a special syntax ($object{field1,field2,field3} instead of a function)
  • Python-like array slices

Community

We have a Discord server, where we'd welcome everyone who's interested in the project.

Publications

As the project started at a university (TU - Kaiserslautern under supervision of Dr. Dr. Theo Härder we'd be happy if it would be used as a research project again, too as there's a wide field of topics for future research and improvements.)

Getting started

If you simply want to use Brackit as a standalone query processor use the JAR provided with the release

Otherwise for contributing

Download ZIP or Git Clone

git clone https://github.com/sirixdb/brackit.git

or use the following dependencies in your Maven or Gradle project if you want to add queries in your Java or Kotlin projects for instance or if you want to implement some interfaces and add custom rewrite rules to be able to query your data store.

Brackit uses Java 17, thus you need an up-to-date Gradle (if you want to work on Brackit) and an IDE (for instance IntelliJ or Eclipse).

Maven / Gradle

At this stage of development, you should use the latest SNAPSHOT artifacts from the OSS snapshot repository to get the most recent changes. You should use the most recent Maven/Gradle versions, as we'll update to the newest Java versions.

Just add the following repository section to your POM or build.gradle file:

<repository>
  <id>sonatype-nexus-snapshots</id>
  <name>Sonatype Nexus Snapshots</name>
  <url>https://oss.sonatype.org/content/repositories/snapshots</url>
  <releases>
    <enabled>false</enabled>
  </releases>
  <snapshots>
    <enabled>true</enabled>
  </snapshots>
</repository>
repository {
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots/"
        mavenContent {
            snapshotsOnly()
        }
    }
}
<dependency>
  <groupId>io.sirix</groupId>
  <artifactId>brackit</artifactId>
  <version>0.5-SNAPSHOT</version>
</dependency>
compile group:'io.sirix', name:'brackit', version:'0.5-SNAPSHOT'

What's Brackit?

Brackit is a query engine that different storage/database backends could use, whereas common optimizations are shared, such as set-oriented processing and hash-joins of FLWOR-clauses. Furthermore, in-memory stores for both processing XML and JSON are supported. Thus, brackit can be used as an in-memory query processor for ad-hoc analysis.

Brackit implements JSONiq to query JSON, supporting all the update statements of JSONiq. Furthermore, array index slices, as in Python, are supported. Another extension allows you to use a special statement syntax for writing query programs in a script-like style.

Jupyter Notebook / Tutorial

We're currently working on a tutorial, where you can execute interactive queries on Brackit's in-memory store.

Installation

Compiling from source

To build and package change into the root directory of the project and run Maven:

mvn package

To skip running the unit tests, execute instead:

mvn -DskipTests package

That's all. You find the ready-to-use jar file(s) in the subdirectory ./target

Step 3: Dependency

If you want to use brackit in your other maven- or gradle-based projects, please look into the "Maven / Gradle" section.

First Steps

Running from the command line

Brackit ships with a rudimentary command line interface to run ad-hoc queries. Invoke it with

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar

Where x.y.z is the version number of brackit.

Simple queries

The simplest way to run a query is by passing it via STDIN:

echo "1+1" | java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar

=> 2

If the query is stored in a separate file, let's say test.xq, type:

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar -qf test.xq

or use the file redirection of your shell:

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar < test.xq

You can also use an interactive shell and enter a bunch of queries terminated with an "END" on the last line:

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar -iq

Querying documents

Querying documents is as simple as running any other query.

The default "storage" module resolves any referred documents accessed by the XQuery functions fn:doc() and fn:collection() at query runtime (XML).

To query a document in your local filesystem simply use the path to this document in the fn:doc() function:

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar -q "doc('products.xml')//product[@prodno = '4711']"

For JSON there's the function json-doc(). Let's assume we have the following simple JSON structure:

{
  "products": [
    { "productno": 4711, "product": "Product number 4711" },
    { "productno": 5982, "product": "Product number 5982" }
  ]
}

We can query this first by dereferencing the "products" object field with ., then unbox the array value via [] and add a filter where $$ denotes the current context item and {fieldName} projects the resulting object into a new object, which is returned.

java -jar brackit.jar -q "json-doc('products.json').products[][$$.productno eq 4711]{product}"

Query result
{"product":"Product number 4711"}

Of course, you can also directly query documents via http(s), or ftp. For example:

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar -q "count(doc('http://example.org/foo.xml')//bar)"

or

java -jar brackit-x.y.z-SNAPSHOT-with-dependencies.jar -q "count(jn:doc('http://example.org/foo.xml').bar[])"

Coding with Brackit

Running a query embedded in a Java program requires only a few lines of code:

String query = """
    for $i in (1 to 4)
    let $d := {$i}
    return $d
    """;

// initialize a query context
QueryContext ctx = new QueryContext();

// compile the query
Query query = new Query(query);

// enable formatted output
query.setPrettyPrint(true);

// run the query and write the result to System.out
query.serialize(ctx, System.out);

JSON

You can easily mix arbitrary XML and JSON data in a single query or simply use brackit to convert data from one format into the other. This allows you to get the most out of your data.

The language extension allows you to construct and operate JSON data directly; additional utility functions help you to perform typical tasks.

Everything is designed to simplify the joint processing of XML and JSON and to maximize the freedom of developers. It's up to you to decide how you want your data to look like!

Arrays

Arrays can be created using an extended version of the standard JSON array syntax:

(: statically create an array with 3 elements of different types: 1, 2.0, "3" :)
[ 1, 2.0, "3" ]

(: per default, Brackit will parse the tokens 'true', and 'false' to the XDM boolean values and 'null' to the new type js:null. :)
[ true, false, null ]

(: is different to :)
[ (./true), (./false), (./null) ]
(: where each field is initialized as the result of a path expression
   starting from the current context item, e,g., './true'.
:)

(: dynamically create an array by evaluating some expressions: :)
[ 1+1, substring("banana", 3, 5), () ] (: yields the array [ 2, "nana", () ] :)

(: arrays can be nested and fields can be arbitrary sequences :)
[ (1 to 5) ] (: yields an array of length 1: [(1,2,3,4,5)] :)
[ some text ] (: yields an array of length 1 with an XML fragment as field value :)
[ 'x', [ 'y' ], 'z' ] (: yields an array of length 3: [ 'x' , ['y'], 'z' ] :)

(: a preceding '=' distributes the items of a sequence to individual array positions :)
[ =(1 to 5) ] (: yields an array of length 5: [ 1, 2, 3, 4, 5 ] :)

(: array fields can be accessed by the '[ ]' postfix operator: :)
let $a := [ "Jim", "John", "Joe" ] return $a[1] (: yields the string "John" :)

(: the function bit:len() returns the length of an array :)
bit:len([ 1, 2 ]) (: yields 2 :)

(: array slices are supported as for instance (as in Python) :)
let $a := ["Jim", "John", "Joe" ] return $a[0:2] (: yields ["Jim", "John"] :)

(: array slices with a step operator :)
let $a := ["Jim", "John", "Joe" ] return $a[0:2:-1] (: yields ["John", "Jim"] :)

let $a := [{"foo": 0}, "bar", {"baz":true}] return $a[::2] (: yields [{"foo":0},{"baz:true}] :)

(: array unboxing :)
let $a := ["Jim", "John", "Joe"] return $a[] (: yields the sequence "Jim" "John" "Joe" :)

(: the unboxing is made implicitly in for-loops :)
let $a := ["Jim", "John", "Joe]
for $value in $a
return $value (: yields the same as above :)

(: negative array index :)
let $a := ["Jim", "John", "Joe"] return $a[-1] (: yields "Joe" :)

Objects

Objects provide an alternative to XML to represent structured data. Like with arrays we support an extended version of the standard JSON object syntax:

(: statically create a record with three fields named 'a', 'b' and 'c' :)
{ "a": 1, "b" : 2, "c" : 3 }

(: 'null' is a new atomic type and jn:null() creates this type, true and false are translated into the XML values xs:bool('true'), xs:bool('false').
:)
{ "a": true(), "b" : false(), "c" : jn:null()}

or simply

{ "a": true, "b": false, "c": null}

(: field values may be arbitrary expressions:)
{ "a" : concat('f', 'oo') , "b" : 1+1, "c" : [1,2,3] } (: yields {"a":"foo","b":2,"c":[1,2,3]} :)

(: field values are defined by key-value pairs or by an expression
   that evaluates to an object
:)
let $r := { "x":1, "y":2 } return { $r, "z":3} (: yields {"x":1,"y":2,"z":3} :)

(: fields may be selectively projected into a new object :)
{"x": 1, "y": 2, "z": 3}{z,y} (: yields {"z":3,"y":2} :)

(: values of object fields can be accessed using the deref operator '.' :)
{ "a": "hello", "b": "world" }.b (: yields the string "world" :)

(: the deref operator can be used to navigate into deeply nested object structures :)
let $n := yval let $r := {"e" : {"m":'mvalue', "n":$n}} return $r.e.n/y (: yields the XML fragment yval :)

(: the deref operator can be used to navigate into deeply nested object structures in combination with the array unboxing operator for instance :)
(: note, that here the expression "[]" is unboxing the array and a sequence of items is evaluated for the next deref operator :)
(: the deref operator thus either get's a sequence input or an object as the left operand :)
let $r := {"e": {"m": [{"n":"o"}, true, null, {"n": "bar"}] }, "n":"m"}} return $r.e.m[].n (: yields "o" "bar" :)

(: to only retrieve the first item/value in the array you can use an index :)
let $r := {"e": {"m": [{"n":"o"}, true, null, {"n": "bar"}] }, "n":"m"}} return $r.e.m[0].n (: yields "o" :)

(: the function bit:fields() returns the field names of an object :)
let $r := {"x": 1, "y": 2, "z": 3} return bit:fields($r) (: yields the xs:QName array [x,y,z ] :)

(: the function bit:values() returns the field values of an object :)
let $r := {"x": 1, "y": 2, "z": (3, 4) } return bit:values($r) (: yields the array [1,2,(2,4)] :)

JSONiq update expressions

Brackit supports all defined update statements in the JSONiq specification. It makes sense to implement these in a data store backend for instance in SirixDB.

(: rename a field in an object :)
let $object := {"foo": 0}
return rename json $object.foo as "bar"  (: renames the field foo of the object to bar :)

(: append values into an array :)
append json (1, 2, 3) into ["foo", true, false, null]  (: appends the sequence (1,2,3) into the array (["foo",true,false,null,[1,2,3]]) :)

(: insert at a specific position :)
insert json (1, 2, 3) into ["foo", true, false, null] at position 2  (: inserts the sequence (1,2,3) into the second position of the array (["foo",true,[1,2,3],false,null]) :)

(: insert a json object and merge the field/values into an existing object :)
insert json {"foo": not(true), "baz": null} into {"bar": false}   (: inserts/appends the two field/value pairs into the object ({"bar":false,"foo":false,"baz:null}) :)

(: delete a field/value from an object :)
delete json {"foo": not(true), "baz": null}.foo    (: removes the field "foo" from the object :)

(: delete an array item at position 1 in the array :)
delete json ["foo", 0, 1][1]  (: removes the 0 (["foo",1]) :)

(: replace a JSON value of a field with another value :)
replace json value of {"foo": not(true), "baz": null}.foo with 1     (: thus, the object is adapted to {"foo":1,"baz":null} :)

(: replace an item in an array at the second position (that is the third) :)
replace json value of ["foo", 0, 1][2] with "bar"   (: thus, the array is adapted to ["foo",0,"bar"]

Parsing JSON

(: the utility function json:parse() can be used to parse JSON data dynamically
   from a given xs:string
:)
let $s := io:read('/data/sample.json') return json:parse($s)

Statement Syntax Extension (Beta)

IMPORTANT NOTE:

** This extension is only a syntax extension to simplify the programmer's life when writing JSONiq. It is neither a subset of nor an equivalent to the XQuery Scripting Extension 1.0. **

Almost any non-trivial data processing task consists of a series of consecutive steps. Unfortunately, the functional style of XQuery makes it a bit cumbersome to write code in a convenient, script-like fashion. Instead, the standard way to express a linear multi-step process (with access to intermediate results) is to write a FLWOR expression with a series of let-clauses.

As a shorthand, Brackit allows you to write such processes as a sequence of ';'-terminated statements, which most developers are familiar with:

(: declare external input :)
declare variable $file external;

(: read input data :)
$events := fn:collection('events');

(: join the two inputs :)
$incidents := for $e in $events
              where $e/@severity = 'critical'
              let $ip := x/system/@ip
              group by $ip
              order by count($e)
              return {$ip} count($e) ;

(: store report to file :)
$report := {$incidents};
$output := bit:serialize($report);
io:write($file, $output);

(: return a short message as result :)
Generated '{count($incidents)}' incident entries to report '{$file}'

Internally, the compiler treats this as a FLWOR expression with let-bindings. The result, i.e., the return expression, is the result of the last statement. Accordingly, the previous example is equivalent to:

(: declare external input :)
declare variable $file external;

(: read input data :)
let $events := fn:collection('events')

(: join the two inputs :)
let $incidents := for $e in $events
                  where $e/@severity = 'critical'
                  let $ip := x/system/@ip
                  group by $ip
                  order by count($e)
                  return {$ip} count($e)

(: store report to file :)
let $report := {$incidents}
let $output := bit:serialize($report)
let $written := io:write($file, $output)

(: return a short message as result :)
return Generated '{count($incidents)}' incident entries to report '{$file}'

The statement syntax is especially helpful to improve readability of user-defined functions.

The following example shows an - admittedly rather slow - implementation of the quicksort algorithm:

declare function local:qsort($values) {
    $len := count($values);
    if ($len <= 1) then (
        $values
    ) else (
        $pivot := $values[$len idiv 2];
        $less := $values[. < $pivot];
        $greater := $values[. > $pivot];
        (local:qsort($less), $pivot, local:qsort($greater))
    )
};

local:qsort((7,8,4,5,6,9,3,2,0,1))

brackit's People

Contributors

alvinkuruvilla avatar artwo avatar caetanosauer avatar dependabot[bot] avatar johanneslichtenberger avatar ksclarke avatar mosheduminer avatar mureinik avatar rishikumarray avatar sebbae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

brackit's Issues

Simple CLI

Brackit currently offers a "one-shot" query through params or STDIN, but we should provide a simple CLI tool, which can execute several queries in a row.

Examples attempts to instantiate QueryContext

In Simple.java, for example this line is used:

    QueryContext ctx = new QueryContext();

However, when running the file with

javac -cp target/brackit-0.1.11-SNAPSHOT-jar-with-dependencies.jar src/examples/java/org/brackit/examples/Simple.java

We get the following error:

src/examples/java/org/brackit/examples/Simple.java:57: error: QueryContext is abstract; cannot be instantiated
    QueryContext ctx = new QueryContext();
                       ^

Implement the DerefDescendantExpr

Implementation of the DerefDescendantExpr currently is a copy of an old version of the DerefExpr, which drilled down into arrays and lookups the right operand for the objects in this array. Now the semantics for ->>foobar should be that it drills down into objects as well and to find values of objects, which have a foobar field in a predorder traversal.

Function overloading in user code not working correctly

When running the following code:

declare function local:dummy($test) {
    $test
};

declare function local:dummy() {
    local:dummy("test")
};

local:dummy("test")

I get the expected output (test).

However, when changing the last line to

local:dummy()

I get an error: Error: err:XPST0017: Unknown function: local:dummy().

Expected behavior is that brackit should call the second declaration of local:dummy().

Git hook to autoformat source code

We need a pre commit hook or something like that to auto-format code according to our IntelliJ Java Formatter or the Kotlin standard code formatting rules.

Integrate DIAMetrics principles into a benchmarking suite for brackit

Background on DIAMetrics

DIAMetrics is an end-to-end benchmarking and performance framework for query engines developed by Google.

Componenets

Note that there are more details than mentioned here; this is only as an overview, and if we need to add details about more parts, we can do that further down the line

Workload Extractor:

According to the paper, this component extracts a "representative workload" from a live production workload. "DIAMetrics employs a workload extractor and summarizer, which is a feature-based way to ‘mine’ the query logs of a customer and extract a subset of queries that adequately represent the workload of the customer."
For our current purposes, I feel like the best way we can utilize a component like this is to pinpoint a set of heavy workloads that we can keep a list of and then just run those workloads for the time being. To this end, I am working on a PR that will hopefully bring more XQuery files for us to run against from this repository. I will update this issue with a PR number so that we can keep track of everything.

Data and Query Scrambler

This component aims to help protect sensitive data and create variations of the representative sets to prevent sensitive data leakage. The paper lists off a few ways that they achieve this, but for the time being, we can put less emphasis on this part since we will use this internally for the moment.

Workload Runner

According to the paper, this component "allows users to specify various combinations of workloads and systems to be benchmarked. For instance, we may want to run TPC-H on various query engines over various storage formats to see which storage format is the best option for which engine." The runner can either schedule runs of specific engines or spin up and manage (including cleanup and shutdown) entire engine instances for the runs

Monitoring

There are two parts to this:

  1. Visualization Framework - which brings up dashboards
  2. Alerting Framework - which compares workload performance to historical data and alerts when there iareconcerns

TODO (more to come as we get further along)

  • Merge in more XQuery files from xquerl
  • Figure out workloads that do not perform well and add them to brackit
  • Extract representative workloads somehow

And expressions evaluated in reverse order

When querying the following JSON (using the sirix rest-api):

[{"key": "hey"}, {"key": 0}]

with the following query:

for $i in bit:array-values(.)
where $i=>key instance of xs:integer and $i=>key eq 0
return $i

I get the following error:

Cannot compare 'xs:string' with 'xs:integer'

However, the AndExpr should short-circuit after evaluating key to not be an instance of xs:integer, rather than throwing an error.

It should be noted that the following query:

for $i in bit:array-values(.)
where $i=>key eq 0 and $i=>key instance of xs:integer
return $i

returns:

{'key': 0}

so it appears that the order of the AndExpr is being reversed.

Add switch to enable/disable XQuery syntax parsing

We have to add a switch to enable/disable XQuery syntax parsing and to improve the syntax of JSON queries per default. For instance I'd like to get rid of [[ ]] for array indexes and replace it with the common [ ] syntax. Therefore XPath predicates must have another syntax.

Furthermore, we might think about switching the object field deref expression -> to . and to switch the current context item from . to $$ as in the JSONiq specification stated. Furthermore, literals as true and false should then be parsed as the boolean literal types instead of as XPath axis steps. Furthermore it would be handy if object fields doesn't have to be quoted in this case {"foo":"bar"} and {foo:"bar"} should both be possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.