mozilla / mentat Goto Github PK

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.

Home Page: https://mozilla.github.io/mentat/

License: Apache License 2.0

Rust 84.17% Shell 0.11% Java 7.60% Swift 6.81% Objective-C 0.04% C 0.77% Dockerfile 0.28% Python 0.23%

mentat's Introduction

UNMAINTAINED Project Mentat

Project Mentat is no longer being developed or actively maintained by Mozilla. This repository will be marked read-only in the near future. You are, of course, welcome to fork the repository and use the existing code.

Project Mentat is a persistent, embedded knowledge base. It draws heavily on DataScript and Datomic.

Mentat is implemented in Rust.

The first version of Project Mentat, named Datomish, was written in ClojureScript, targeting both Node (on top of promise_sqlite) and Firefox (on top of Sqlite.jsm). It also worked in pure Clojure on the JVM on top of jdbc-sqlite. The name was changed to avoid confusion with Datomic.

The Rust implementation gives us a smaller compiled output, better performance, more type safety, better tooling, and easier deployment into Firefox and mobile platforms.

Documentation

Motivation

Mentat is intended to be a flexible relational (not key-value, not document-oriented) store that makes it easy to describe, grow, and reuse your domain schema.

By abstracting away the storage schema, and by exposing change listeners outside the database (not via triggers), we hope to make domain schemas stable, and allow both the data store itself and embedding applications to use better architectures, meeting performance goals in a way that allows future evolution.

Data storage is hard

We've observed that data storage is a particular area of difficulty for software development teams:

It's hard to define storage schemas well. A developer must:
- Model their domain entities and relationships.
- Encode that model efficiently and correctly using the features available in the database.
- Plan for future extensions and performance tuning.
In a SQL database, the same schema definition defines everything from high-level domain relationships through to numeric field sizes in the same smear of keywords. It's difficult for someone unfamiliar with the domain to determine from such a schema what's a domain fact and what's an implementation concession — are all part numbers always 16 characters long, or are we trying to save space? — or, indeed, whether a missing constraint is deliberate or a bug.

The developer must think about foreign key constraints, compound uniqueness, and nullability. They must consider indexing, synchronizing, and stable identifiers. Most developers simply don't do enough work in SQL to get all of these things right. Storage thus becomes the specialty of a few individuals.

Which one of these is correct?
```
{:db/id          :person/email
 :db/valueType   :db.type/string
 :db/cardinality :db.cardinality/many     ; People can have multiple email addresses.
 :db/unique      :db.unique/identity      ; For our purposes, each email identifies one person.
 :db/index       true}                    ; We want fast lookups by email.
{:db/id          :person/friend
 :db/valueType   :db.type/ref
 :db/cardinality :db.cardinality/many}    ; People can have many friends.
```
```
CREATE TABLE people (
  id INTEGER PRIMARY KEY,  -- Bug: because of the primary key, each person can have no more than 1 email.
  email VARCHAR(64),       -- Bug?: no NOT NULL, so a person can have no email.
                           -- Bug: nobody will ever have a long email address, right?
);
CREATE TABLE friendships (
  FOREIGN KEY person REFERENCES people(id),  -- Bug?: no indexing, so lookups by friend or person will be slow.
  FOREIGN KEY friend REFERENCES people(id),  -- Bug: no compound uniqueness constraint, so we can have dupe friendships.
);
```
They both have limitations — the Mentat schema allows only for an open world (it's possible to declare friendships with people whose email isn't known), and requires validation code to enforce email string correctness — but we think that even such a tiny SQL example is harder to understand and obscures important domain decisions.
Queries are intimately tied to structural storage choices. That not only hides the declarative domain-level meaning of the query — it's hard to tell what a query is trying to do when it's a 100-line mess of subqueries and LEFT OUTER JOINs — but it also means a simple structural schema change requires auditing every query for correctness.
Developers often capture less event-shaped than they perhaps should, simply because their initial requirements don't warrant it. It's quite common to later want to know when a fact was recorded, or in which order two facts were recorded (particularly for migrations), or on which device an event took place… or even that a fact was ever recorded and then deleted.
Common queries are hard. Storing values only once, upserts, complicated joins, and group-wise maxima are all difficult for non-expert developers to get right.
It's hard to evolve storage schemas. Writing a robust SQL schema migration is hard, particularly if a bad migration has ever escaped into the wild! Teams learn to fear and avoid schema changes, and eventually they ship a table called metadata, with three TEXT columns, so they never have to write a migration again. That decision pushes storage complexity into application code. (Or they start storing unversioned JSON blobs in the database…)
It's hard to share storage with another component, let alone share data with another component. Conway's Law applies: your software system will often grow to have one database per team.
It's hard to build efficient storage and querying architectures. Materialized views require knowledge of triggers, or the implementation of bottleneck APIs. Ad hoc caches are often wrong, are almost never formally designed (do you want a write-back, write-through, or write-around cache? Do you know the difference?), and often aren't reusable. The average developer, faced with a SQL database, has little choice but to build a simple table that tries to meet every need.

Comparison to DataScript

DataScript asks the question: "What if creating a database were as cheap as creating a Hashmap?"

Mentat is not interested in that. Instead, it's strongly interested in persistence and performance, with very little interest in immutable databases/databases as values or throwaway use.

One might say that Mentat's question is: "What if an SQLite database could store arbitrary relations, for arbitrary consumers, without them having to coordinate an up-front storage-level schema?"

(Note that domain-level schemas are very valuable.)

Another possible question would be: "What if we could bake some of the concepts of CQRS and event sourcing into a persistent relational store, such that the transaction log itself were of value to queries?"

Some thought has been given to how databases as values — long-term references to a snapshot of the store at an instant in time — could work in this model. It's not impossible; it simply has different performance characteristics.

Just like DataScript, Mentat speaks Datalog for querying and takes additions and retractions as input to a transaction.

Unlike DataScript, Mentat exposes free-text indexing, thanks to SQLite.

Comparison to Datomic

Datomic is a server-side, enterprise-grade data storage system. Datomic has a beautiful conceptual model. It's intended to be backed by a storage cluster, in which it keeps index chunks forever. Index chunks are replicated to peers, allowing it to run queries at the edges. Writes are serialized through a transactor.

Many of these design decisions are inapplicable to deployed desktop software; indeed, the use of multiple JVM processes makes Datomic's use in a small desktop app, or a mobile device, prohibitive.

Mentat was designed for embedding, initially in an experimental Electron app (Tofino). It is less concerned with exposing consistent database states outside transaction boundaries, because that's less important here, and dropping some of these requirements allows us to leverage SQLite itself.

Comparison to SQLite

SQLite is a traditional SQL database in most respects: schemas conflate semantic, structural, and datatype concerns, as described above; the main interface with the database is human-first textual queries; sparse and graph-structured data are 'unnatural', if not always inefficient; experimenting with and evolving data models are error-prone and complicated activities; and so on.

Mentat aims to offer many of the advantages of SQLite — single-file use, embeddability, and good performance — while building a more relaxed, reusable, and expressive data model on top.

Contributing

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

See CONTRIBUTING.md for further notes.

This project is very new, so we'll probably revise these guidelines. Please comment on an issue before putting significant effort in if you'd like to contribute.

Building

You first need to clone the project. To build and test the project, we are using Cargo.

To build all of the crates in the project use:

cargo build

To run tests use:

# Run tests for everything.
cargo test --all

# Run tests for just the query-algebrizer folder (specify the crate, not the folder),
# printing debug output.
cargo test -p mentat_query_algebrizer -- --nocapture

For most cargo commands you can pass the -p argument to run the command just on that package. So, cargo build -p mentat_query_algebrizer will build just the "query-algebrizer" folder.

What are all of these crates?

We use multiple sub-crates for Mentat for four reasons:

To improve incremental build times.
To encourage encapsulation; writing extern crate feels worse than just use mod.
To simplify the creation of targets that don't use certain features: e.g., a build with no syncing, or with no query system.
To allow for reuse (e.g., the EDN parser is essentially a separate library).

So what are they?

Building blocks

`edn`

Our EDN parser. It uses rust-peg to parse EDN, which is Clojure/Datomic's richer alternative to JSON. edn's dependencies are all either for representing rich values (chrono, uuid, ordered-float) or for parsing (serde, peg).

In addition, this crate turns a stream of EDN values into a representation suitable to be transacted.

`mentat_core`

This is the lowest-level Mentat crate. It collects together the following things:

Fundamental domain-specific data structures like ValueType and TypedValue.
Fundamental SQL-related linkages like SQLValueType. These encode the mapping between Mentat's types and values and their representation in our SQLite format.
Conversion to and from EDN types (e.g., edn::Keyword to TypedValue::Keyword).
Common utilities (some in the util module, and others that should be moved there or broken out) like Either, InternSet, and RcCounter.
Reusable lazy namespaced keywords (e.g., DB_TYPE_DOUBLE) that are used by mentat_db and EDN serialization of core structs.

Types

`mentat_query`

This crate defines the structs and enums that are the output of the query parser and used by the translator and algebrizer. SrcVar, NonIntegerConstant, FnArg… these all live here.

`mentat_query_sql`

Similarly, this crate defines an abstract representation of a SQL query as understood by Mentat. This bridges between Mentat's types (e.g., TypedValue) and SQL concepts (ColumnOrExpression, GroupBy). It's produced by the algebrizer and consumed by the translator.

Query processing

`mentat_query_algebrizer`

This is the biggest piece of the query engine. It takes a parsed query, which at this point is independent of a database, and combines it with the current state of the schema and data. This involves translating keywords into attributes, abstract values into concrete values with a known type, and producing an AlgebraicQuery, which is a representation of how a query's Datalog semantics can be satisfied as SQL table joins and constraints over Mentat's SQL schema. An algebrized query is tightly coupled with both the disk schema and the vocabulary present in the store when the work is done.

`mentat_query_projector`

A Datalog query projects some of the variables in the query into data structures in the output. This crate takes an algebrized query and a projection list and figures out how to get values out of the running SQL query and into the right format for the consumer.

`mentat_query_translator`

This crate works with all of the above to turn the output of the algebrizer and projector into the data structures defined in mentat_query_sql.

`mentat_sql`

This simple crate turns those data structures into SQL text and bindings that can later be executed by rusqlite.

The data layer: `mentat_db`

This is a big one: it implements the core storage logic on top of SQLite. This crate is responsible for bootstrapping new databases, transacting new data, maintaining the attribute cache, and building and updating in-memory representations of the storage schema.

The main crate

The top-level main crate of Mentat assembles these component crates into something useful. It wraps up a connection to a database file and the associated metadata into a Store, and encapsulates an in-progress transaction (InProgress). It provides modules for programmatically writing (entity_builder.rs) and managing vocabulary (vocabulary.rs).

Syncing

Sync code lives, for referential reasons, in a crate named tolstoy. This code is a work in progress; current state is a proof-of-concept implementation which largely relies on the internal transactor to make progress in most cases and comes with a basic support for timelines. See Tolstoy's documentation for details.

The command-line interface

This is under tools/cli. It's essentially an external consumer of the main mentat crate. This code is ugly, but it mostly works.

SQLite dependencies

Mentat uses partial indices, which are available in SQLite 3.8.0 and higher. It relies on correlation between aggregate and non-aggregate columns in the output, which was added in SQLite 3.7.11.

It also uses FTS4, which is a compile time option.

By default, Mentat specifies the "bundled" feature for rusqlite, which uses a relatively recent version of SQLite. If you want to link against the system version of SQLite, omit "bundled_sqlite3" from Mentat's features.

[dependencies.mentat]
version = "0.6"
# System sqlite is known to be new.
default-features = false

License

Project Mentat is currently licensed under the Apache License v2.0. See the LICENSE file for details.

mentat's People

Contributors

Stargazers

Watchers

Forkers

ncalexan stephensong bgrins galdolber barbagrigia jeremiahrhall ngoffee joewalker cascadian vidhuran jsantell nunb alexxnica dand-oss thinhptran kevinmehall kryndex ianmadlenya oleschoenburg hilalisa grigoryk pombredanne yarwelp neuroradiology ferjm rleungx rdr8 cvnb vladikoff rowhit mmacedoeu samuell cdbfoster mhammond rukor carolkng maweki slamdunking eenblam victorporof ilovejs graydon ghosthamlet pocmo mwatts happy-ferret sthagen rishi0991 sc13-bioinf ozialien racket-pirates pkpkpk theronic nrc edbaskerville d-r-q qpdb folcon justinmcp tiensonqin abhishekkumarjha kisai billbarnhill spacejam ferdinand-beyer warsus njaremko shrivatsas claj michaelhuffman erewhon cpdean kn666 hoangpq andrewdmeier dissoc gburd peter-wilkins khofaai crazypyro suj1th gtr8 rossjones cangus patrickp89 emptydiagram yatender-oktalk igxactly-forks dingguijin shards-lang paulstuart metacritical pkulchenko dotemacs travisallison liamdgray gitrdm alecostard nickik vivekmuthu1910

mentat's Issues

Support interned keywords

:db/valueType :db.type/keyword is used to denote a restricted set of values. But we store these as strings. We should consider a schema flag to signal that a keyword value should be automatically interned (as an ident or otherwise) to avoid storing duplicate strings in the database.

Consider using schema value types to generate broad query constraints

Take, for example:

[:find ?age :where [
  [?person :foo/age ?age]]

where :foo/age is declared in its schema as being numeric.

A naïve query for this might be:

SELECT datoms123.v AS age FROM datoms datoms123 WHERE datoms123.a = ?
bindings: [(intern :foo/age)]

In some more complicated queries, particularly if ?age contributes its type constraint to another variable, it might be valuable to produce additional constraints derived from the schema:

WHERE typeof(?age) IS 'int'

WHERE 0 <= ?age AND ?age < 9999

Don't target `all_datoms` when an unbound attribute is known to be non-textual

Consider this query:

(datomish.query/find->sql-string
  (datomish.query.context/->Context (datomish.query.source/datoms-source nil) nil nil)
  (datomish.query/parse
   '[:find ?e ?a ?v
     :in $
     :where [?e ?a ?v]
            [(> ?v 1000)]])
  {})

Currently this produces SQL:

["SELECT DISTINCT all_datoms11581.e AS e, all_datoms11581.a AS a, all_datoms11581.v AS v FROM all_datoms all_datoms11581 WHERE (all_datoms11581.v > ?) " 1000]

Two situations should change this:

There are no fulltext attributes in the schema.
A value is constrained to be non-textual (e.g., a clause like [(< ?v 5)] is present).

In these situations, rather than targeting all_datoms (which is a view that includes the join between fulltext_datoms and fulltext_values), we should target only datoms.

Make it easy to log inputs, outputs, and performance metrics

It would be really great to make it really easy to get detailed information about the functioning of the knowledge store. We could add a flag to log inputs, outputs, and performance metrics, for example; or we could annotate transactions with those metrics; or we could expose them transiently to JS consumers; ...

Deal with space usage after large transactions

Importing my places library leaves a 446MB SQLite file — about 1KB per datom.

Vacuuming drops that to 172MB, about 2.5 times the size of the original places.sqlite, which is much more reasonable.

Most of this is due to the massive churn through tx-lookup.

We should consider options here:

In-memory databases?
A transient on-disk database?
Vacuuming after large transactions? (Better auto-vacuum settings, too.)

We should also be actively looking at reducing our on-disk space.

lein deps failure: honeysql

Tried to lein deps, as suggested in the README, but it failed with this error:

| mozilla-datomish @ 🚐   lein deps
Could not find artifact honeysql:honeysql:jar:0.7.1-SNAPSHOT in clojars (https://clojars.org/repo/)
This could be due to a typo in :dependencies or network issues.
If you are behind a proxy, try setting the 'http_proxy' environment variable.

What might I be missing?

Support non-trivial `or` expressions

This will involve generating a UNION of two or more conjoining clauses with the same projection.

[meta] Support a JSON value type

It would be convenient for consumers if we supported the JS equivalent of a blob — a JSON object. We would serialize and deserialize automatically, and they wouldn't collide with strings in the store.

Backlog

Here's a list of things to do.

Repo:

Set this up as a basic ClojureScript repo.
Take a dependency on node-sqlite3.
Import the promise wrapper for SQLite from Tofino.
Take a dependency on datascript for parsing.

Storage (Nick's current focus):

Define a SQLite disk schema for EAVT. This can mostly be grabbed from Nick's persist branch.
Implement the usual data structures for taking a DB and getting to a starting point.
Decide on value indexing. Opt-in, like Datomic? Partial indices (non-strings; use FTS for that)?
Versioned storage schema.
Pluggability: Sqlite.jsm. #48.
SQLite connection options: page size, for example. See Sqlite.jsm.
Expose vacuuming and other storage-level necessities.

Schemas:

Consider how we'll persist KB schemas/vocabulary fragments.

Querying (Richard's current focus):

Insertion etc.:

Translate add/retract expressions to SQL operations on transaction log and materialized EAVT.
Retraction
Excision, schema-aware, transaction log walking.

FTS:

Schema definition.
fulltext expression.
Snippets.

Tooling and docs:

GraphQL schema generation (e.g., to support GraphiQL).
Schema export, documentation, migration….
Backup and restore.
Query performance tools.
Document and write tests for each stage of the execution pipeline.

Frontend:

JS API. #49.
- JS API tests.

Consider special casing transact for 1, 2, 3, ..., k datoms

I expect most transactions to be small -- star this page, note this visit with this current title, etc. We might find considerably performance wins by special casing these "small" transactions and querying with hand-written SQL in the hot paths. Worth considering as we gain experience with the disk performance.

Store and use quick-hashed values for reverse lookups

A common pattern is to refer to records by some unique property:

[:find ?title :where [
  [[:page/url ?] :page/title ?title]
]]
; Binding: "http://foo.com/"

That involves matching string values of :page/url to find a page, then fanning back out to title, or matching all titles and pages and narrowing by URL.

Rather than doing string matching directly, it's often more efficient to store a known hash of a value, and query with a pattern like:

["… WHERE datoms.vh = ? AND datoms.v = ?" (quick-hash val) val]

We should optionally store these in the DB:

vh INTEGER       // Can be null

With a partial index:

CREATE INDEX idx_vh ON datoms (vh, v, p) WHERE vh IS NOT NULL;

and then add flags to opt in or out on a per-attribute level. We might choose to do this automatically for string-valued unique attributes like :page/url.

Remove restriction that a fulltext match must have a constant attribute

We will find it convenient to be able to write patterns like:

[:find ?page ?url :in $ :where
 [(fulltext $ _ "apricot") [[?save]]]
 [?save :save/page ?page]
 [?page :page/url ?url]]

— that is, "find me pages for which a saved version mentions the word "apricot" in title, excerpt, or full content".

SQLite makes this easy. In the future we can optionally even bind the attribute alongside the value in the output relation.

Use prepared statements

Wherever possible we should avoid parsing SQL.

We should also consider, after getting this working in the simple cases, whether it's better to loop over a prepared statement 300 times rather than to compose a 999-var SQL string and parse it.

Note that prepared statements will need to be closed at the appropriate time, and should ideally be lazily created as needed to avoid impacting time-to-open.

Don't use rowids as foreign key into fulltext values table

Per http://www.sqlabs.com/blog/2010/12/sqlite-and-unique-rowid-something-you-really-need-to-know/ and others, VACUUM can re-assign rowids. At this time, we don't ever delete fulltext values, so this is a theoretical concern; but we should fix it, possibly by adding an INTEGER PRIMARY KEY AUTOINCREMENT to the fulltext values table.

Efficiently ensure that an open store's schema is a superset of a provided schema

A typical Datomish session looks like this:

Open the store.
Transact the entire schema.
Do some work.

Transacting the entire schema through the generic transact layer is expensive: it's an exclusive write transaction, and it does a bunch of database-level work that's redundant, because everything about the schema in the DB was just loaded into memory during opening!

Instead of <transact!, then, we'd like something more like <ensure-schema!, which will immediately return if the provided schema form is already present in the known schema, and otherwise will transact as usual.

Queries hang if they fail

<?all-rows only fills and closes its channel if it gets results. Otherwise, the channel blocks forever. Oops.

Lookup refs are slow

Lookup refs are resolved prior to insertion, by running a SELECT query.

That's fine for 'unit transacts', but for bulk insertions it's cripplingly slow: 10-50msec per query multiplied by the number of lookup refs.

It would literally be quicker to read the entire database into memory, build a lookup table, and resolve in one pass. As a general policy, we should not run a linear number of queries on insert.

The best solution is probably to run a single query with WHERE (a = ? AND v = ?) OR (a = ? AND v = ?) — or just WHERE a IN (?, ?) for large inputs — and build a lookup table.

JavaScript API

We need one of these!

Optimization: eliminate non-fulltext attributes mentioned in a `fulltext` expression

If the user does something simplistic, like dump in a set of potential attributes (perhaps even some that aren't present in the store!) we should eagerly drop those on the floor. If that results in an empty attribute set, the entire fulltext expression will fail to match anything, and can be optimized away into failure.

Process `order-by` in JS API

We need to be able to spit out something like:

{:limit limit
 :order-by [[:_max_time :desc]]
 :inputs {:since since}}

Aggregates

Implement support for aggregates in projection, starting with max.

Aggregation implies grouping, so we will also need to support :with and be careful to deliberately hit the number-of-heads bug.

Rewrite query tests to use a real database

Query generation relies on the schema, including entity IDs — schemas are driven by entity IDs, not keywords.

Now that I've fixed schema-driven query production, the tests fail because they expect to be able to avoid entids, and the constraints are wrong because the schema lookups fail.

Rather than continuing to hack these facilities into the mock source, we should just rewrite the tests to use real storage.

Translate values from disk representation to Clojure/JavaScript datatypes during projection

We store value tags in the database, because many values end up as numeric — different kinds of numbers, but also idents and others.

See sqlite_schema.cljc:

(def value-type-tag-map
  {:db.type/ref     0
   :db.type/boolean 1
   :db.type/instant 4
   :db.type/long    5 ;; SQLite distinguishes integral from decimal types, allowing long and double to share a tag.
   :db.type/double  5 ;; SQLite distinguishes integral from decimal types, allowing long and double to share a tag.
   :db.type/string  10
   :db.type/uuid    11
   :db.type/uri     12
   :db.type/keyword 13})

During projection we should (lazily or eagerly) translate (value, tag) pairs into appropriate types.

Consider also an entid abstraction to turn entities without ident keywords into something more descriptive than a plain integer.

Allow to search set of fields using `fulltext`

Right now, fulltext supports a single field (:for/example) or the special symbol :all to search all fields. This ticket tracks generalizing to allow a set of fields, like #{:for/example :and/another}.

Test long string round-tripping in Node

I was just reminded that node-sqlite3 silently truncates strings longer than 32K in bindings, presumably in both queries and inserts. We should make sure that this doesn't impact Datomish.

Implement Datomic's `:db.fn/retractEntity` and DataScript's `:db.fn/retractAttribute`

These are convenience methods that can be efficiently translated to SQL.

Support querying historical state

datoms is effectively a roll-up of the transaction log: all assertions, in order, minus subsequent retractions.

It's sometimes useful to be able to query the history itself, either for testing or to answer questions like "was this ever asserted?" or "when was the earliest mention of X?".

We should:

Make the Context aware of its source(s).
Make pattern handling a function of source.

That is: a source that queries a range of the transaction log might take a simple pattern like

[?x :foo/bar ?y]

and produce

:from [[:transactions :t123]]
:where [[:= :t123.p (intern :foo/bar)] [:> t123.tx 456]]

Schema-aware query analysis

When the domain and range of an attribute are known (and in our case the domain is always an entity), we can do tiers of analysis on queries.

The first tier is trivial value-space comparison. A pattern like:

[?foo :bar/baz 5.0]

where :bar/baz is defined as valueType ref is obviously going to fail to return any bindings (with the obvious exception of any collision in our design between integer ref IDs and integer values). That's very likely a programming error, and we should warn as such, as well as optimizing the query.

The second tier is to do this across multiple patterns:

[?foo :bar/age ?age]
[?age :bar/name ?name]

expects ?age to be both a ref and a number, and thus cannot succeed.

Downcase/normalize UUIDs for storage

Until #44, we should at least make sure that two UUIDs that differ only in string case end up the same.

Sqlite.jsm support

For deployment in Firefox we'd like to support Sqlite.jsm, not node.js stuff.

Additional find specifications

We currently support relational find specifications:

:find ?foo ?bar

There are three others:

Collection: :find [foo ...]
Single tuple: :find [?foo ?bar]
Single value: :find ?title .

These should be relatively straightforward to implement.

Future: support for hardening parts of a schema into a tabular representation

We present a tradeoff:

Put all of your data in Datomish, getting expressivity but losing out on compactness and query performance for some kinds of regular data.
Put all of your data in a SQLite database, getting some significant performance improvements at a cost of expressivity.

It's not all that feasible to have a consumer manually do both. Information is messy and interconnected, which is one of the motivations for doing this work in the first place!

We ourselves split the difference in a couple of areas: for example, we store value tags and values next to each other, rather than being fully normalized.

Some consumers can guarantee that some of their data will be of a particular non-sparse shape. We should consider supporting specialized storage for subparts of a schema which puts values into columns, generating query clauses and processing transactions accordingly.

Restrictions might be:

Attributes must all be cardinality: one.
All attribute values must be present in the same transaction? Not strictly necessary…
Tables cannot be modified after creation.

For example, we might denote a page visit like this:

{:db/ident :page/visitTime
 :db/valueType :db.type/instant
 :db/cardinality :db.cardinality/one
 :db/index true}
{:db/ident :page/visitDevice
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}
{:db/ident :page/visitType
 :db/valueType :db.type/keyword
 :db/cardinality :db.cardinality/one}

{:table "visits"
 :columns [:page/visitTime :page/visitDevice :page/visitType]}

and we expect that to produce a single table and index pair:

CREATE TABLE datom_table_visits 
(e INTEGER NOT NULL,
 pageVisitTime INTEGER,    -- No tag needed!
 pageVisitDevice TEXT,
 pageVisitType TEXT);

CREATE INDEX datom_table_visits_pageVisitTime ON datom_table_visits(pageVisitTime);
CREATE INDEX datom_table_visits_e ON datom_table_visits(e);

and then queries like

[:find ?page ?type :in $ :where
 [?page :page/visitTime ?time]
 [(> ?time 1234567890)]
 [?page :page/visitDevice "abcdefg"]
 [?page :page/visitType ?type]]

would turn into

SELECT d123.e AS page, d123.pageVisitType AS type
FROM datom_table_visits AS d123
WHERE d123.pageVisitTime > 1234567890 AND
      d123.pageVisitDevice = "abcdefg" AND
      d123.pageVisitType IS NOT NULL;        -- NOT NULL no longer implied for these columns.

and churn through Datomish's existing projection pipeline.

Consumers can still make additional non-tabular references to visits, both in transacts and queries, but get the performance benefits of a SQL-like table structure where it makes sense.

Transparently fail for unknown attributes

We know at query translation time whether an attribute is present in the store: one can't transact a datom that refers to an attribute not present in the schema, so if the query system sees :foo/bar and can't turn it into an entity ID, that clause cannot return results.

At this point we know that:

A not clause can be eliminated.
An or branch can be pruned.
A regular conjoining clause will always return no results.

One might think of this as producing ⊤ and ⊥ clauses from ordinary clauses.

We can optionally warn in this case: perhaps you intended to refer to a different attribute. We should not throw, because your query might be intended to work over a heterogeneous store, and so part of the query might adapt to different schema.

Store UUIDs as BLOBs

16-byte blob beats 36 bytes of string space-wise.

Write a "getting started" guide

Extend query syntax to support limits

The conceptual model for the query system is one of sets and relations, but often these are then sorted and sliced up programmatically — "10 most recent sites".

This becomes increasingly expensive as the size of the result set grows: it is impractical to fetch 50,000 sites just to find the most recent ten.

As our data sizes grow, it will be increasingly useful to allow limiting of query sequences within, not outside, the query engine itself.

(The Datomic/DataScript approach to this is to use index chunks directly. This makes queries hard to read, hard to send directly from a client, and much more verbose. There is no need for this with our storage model.)

Naturally, ordering and limiting are related, so a solution here needs to address both.

Allow callers to specify SQL options

When opening or creating a database, we should set reasonable defaults (and allow consumers to override). In particular:

Foreign key support
Page size
Secure deletion
WAL configuration

See https://github.com/mozilla/firefox-ios/blob/01a471f60424224bc8db8f29f01c00cebb99f1ef/Storage/ThirdParty/SwiftData.swift#L414 for one example.

Decide on and document behavior when a lookup ref does not resolve

This isn't documented for Datomic. We can either ignore the datom or fail, and perhaps we should expose that as an option…

Generate query value type constraints from bound values

This is the flip side of #30.

Values used in bindings should generate additional SQL clauses based on their type tag.

For example, a query like:

[:find ?e ?a :in $ ?v :where
 [?e ?a ?v]]
{:v true}

should generate SQL:

SELECT datoms123.e AS e, datoms123.a AS a
FROM datoms AS datoms123
WHERE datoms123.v = 1 AND datoms123.value_type_tag = 1   -- boolean

to avoid matching against entities, integers, etc.

Note that schema awareness can make some of these checks unnecessary. Given a schema:

{:db/ident :foo/truth
 :db/valueType :db.type/boolean}

a query like

[:find ?e :in $ ?v :where
 [?e :foo/truth ?v]]
{:v true}

can simply result in

WHERE datoms123.a = 65538 AND datoms123.v = 1

because the type of ?v can be determined from the known attribute and its valueType.

Include index flag(s) to enable non-standard index walks in queries

We have a few non-standard (opt-in) indexes defined: AVET and VAET for looking up :db/index true and :db.type/ref attributes, respectively. The indexes are defined with flags, like:

;; Opt-in index: only if a has :db/index true.
"CREATE UNIQUE INDEX idx_datoms_avet ON datoms (a, value_type_tag, v, e)
 WHERE index_avet IS NOT 0"

that means we need to include index_avet in the query (in some way) to enable SQLite to walk that index. Further, we may need to tweak the index definition, since SQLite more or less uses queries only if they "directly match" the query syntax: https://www.sqlite.org/partialindex.html.

Further work on aggregates

After #39, the following items remain:

Inclusion of type tags. Check that aggregation behaves correctly (that is: reject the query!) for values of differing types.
Type handling for aggregates: e.g., max on instants should yield an instant, not a plain number.
Constant expressions (#585)
Limits (double-check) (#316)

Implement excision

See http://docs.datomic.com/excision.html.

Serialize and de-serialize non-keyword schema values correctly

This is mopping a set of TODOs around the schema materialized views. For expedience, we only handled keyword values in schemas, like :db.valueType :db.type/ref. This ticket tracks supporting non-keyword values, like db.index true. This requires generalizing the schema to include a value type tag and writing and reading said tag.

This would be easier if we used the query implementation, but it's difficult to do so while bootstrapping the DB, so we'll handle it manually.

Coalesce simple parts of a complex 'or' expression

We implement simple or via alternation:

[?page :page/title ?foo]
[?page :page/meta ?foo]

becomes

(:or
  [:= :datoms123.a :page/title]
  [:= :datoms123.a :page/meta])

We don't currently recognize the simple subparts of a non-simple or, so we generate distinct UNION arms instead.

Implementing this optimization will be a nice performance win, and shouldn't be too difficult.

Don't touch SQL store when allocating entids

Right now, we query and insert entids into the SQL store each time we allocate a tempid. This can be avoided by keeping temporary data in the TxReport and updating the SQL store just once per transaction.

Support a JSON format for schema

This is necessary for JSON consumers, for whom EDN and the full capability of our schema language is unnecessary. They want to name attributes and define their type, uniqueness, docs, and cardinality.

Prior art: https://github.com/sullerandras/datomic-json-rest/blob/master/features/post-schema.feature

{"name": "admin",
 "attributes": [{"name": "name",
                 "type": "string",
                 "cardinality": "one"},
                {"name": "phone",
                 "type": "string",
                 "cardinality": "many"}]}

Presumably the enclosing object is a hook for naming collections of attributes. This is a good place to also version these schema parts, allowing for migrations.

Consider an alternative disk representation for datom flags

We have four fields in datoms: index_vaet, index_avet, index_fulltext and unique_value.

These fields appear in datoms and in the indices idx_datoms_eavt and idx_datoms_aevt.

Some rows also appear in idx_datoms_avet, idx_datoms_vaet, idx_datoms_fulltext, and idx_datoms_unique_value.

That is: each datom is responsible for anywhere from 12 to 28 TINYINTs in the DB. Sometimes these won't contribute to space (if they're zero, I've read that sqlite should pack them down to nothing), but assuming one byte per field, half a million datoms will add up to 14MB to the DB just for these flags.

These flags are entirely derived from the schema: a datom's attribute is the sole determinant (AFAICS) of whether these flags are 1 or 0 for a row.

Now, partial indexes cannot refer to other tables or call functions, so we can't simply join against the schema in the CREATE INDEX … WHERE clause. But we could use a more complex operator-driven expression, including bitmasks, to compress these flags.

We could also consider approaches to simply removing columns:

index_vaet and index_fulltext are mutually exclusive, so they could be two integer values in a single field.
index_vaet corresponds to :db/valueType :db.type/ref, which is already implicitly represented by a value_type_tag of 0, so we can filter on that instead.

Finally, we could consider direct schema creation as an approach: rather than having idx_datoms_fulltext, for example, we could create per-attribute indices when we register a schema fragment that includes :fulltext true:

CREATE INDEX idx_datoms_fulltext_page_title ON datoms (value_type_tag, v, a, e) WHERE a = 65538 -- The interned attribute.

I don't know if sqlite will happily use a collection of such indices (perhaps it'd be quicker for real-world queries!), but it's worth exploring.

Support connection listeners

Datomic and DataScript show a simple API for listening to transactions. We'll want something similar, in particular to allow the UAS to listen for changes and propagate diffs out to clients.

Full-text search

In order to support FTS, we need:

A virtual FTS4 table with appropriate tokenizing and collation.
A column on datoms indicating whether the associated value is a rowid.
A view that transparently joins, something like:

CREATE VIEW view_datoms_with_fts (e, a, v, tx) AS
SELECT e, a, v, tx FROM datoms WHERE fts IS 0
UNION ALL
SELECT e, a, datoms_fts.text AS v, tx FROM datoms, datoms_fts 
WHERE datoms.fts IS 1 AND datoms.v = datoms_fts.rowid

Adjustments to the query processor to query from datoms if a is known to be non-free-text, or from view_datoms_with_fts if it might be.
Binding predicates in the query processor.
A fulltext predicate.
Schema support.
Transact support: inserting values into the FTS table and storing the rowid in datoms.
Transaction log support. Efficiency here is a concern, and might be best achieved by treating the FTS table as an until-excision permanent store — don't store the text twice.

Make it possible to avoid reading back a committed transaction

This is often unnecessary — typically if the caller doesn't need to resolve a tempid, or pull out the transaction ID, it only wants to know success or failure.

We should provide an alternative transact! that doesn't go the unwanted effort.