This survey aims to build a modest bridge between the areas of databases and sparse scientific computing. At the core of both fields are two key ingredients: (1) a representation of sparse data - data for which storing all possible worlds is suboptimal or intractable, and (2) a mathematical formalism for computing over sparse data. Clever representations in turn enable nontrivial asymptotic gains in performance, when performing sparse computations. We conclude with some thoughts on where to go next.
Sparsity in scientific data appears most commonly in the form of sparse matrices. These matrices come from various sources, including:
- physical systems with relatively few pairwise interactions between components, such as particles with local forces,
- linear systems with many variables and constraints, but with few co-appearing variables in each constraint, and
- graphs with
$m$ edges on$n$ nodes where$m \ll n^2$ .
The last example--sparse graphs--is particularly fundamental, in the sense that the earlier examples can be viewed through the lens of interaction graphs between the variables/components of the underlying system. We can think of an unweighted sparse graph (and its boolean adjacency matrix) as the structural skeleton of a sparse system. The nodes and edges of these graphs are then augmented with domain-specific data.
Figure 1a. A sparse graph obtained from a boolean satisfiability problem by connecting variables with clauses that they appear in.
Figure 1b. A sparse matrix derived from a finite difference computation on a 3D mesh. From https://sparse.tamu.edu/Norris/torso1.
We can easily generalize sparse graphs to sparse hypergraphs, and consequently sparse matrices to sparse tensors (in the multidimensional array sense). For example, let
The presence of sparsity in databases is much more immediate. Consider a table
name | salary |
---|---|
bob | 80000 |
alice | 85000 |
Now what would a dense representation look like? We would need to represent all possible entries of the table, and indicate (by storing a boolean) which entries are actually present. That would look something like this:
0 | 1 | ... | 80000 | ... | 85000 | ... | 4294967296 | |
---|---|---|---|---|---|---|---|---|
a | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
alice | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
bob | 0 | 0 | ... | 1 | ... | 0 | ... | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
z...z | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This is not a good idea! It is in this sense that tabular data is "naturally" sparse.
Clearly there is a fundamental connection between sparse matrices, tensors, and tables. From a data perspective, they are more or less the same thing. But we are interested in actually doing something with the data. We want to run computations and queries over the data. Thankfully, database researchers have thought about computing over sparse data for a long time. We will use graph traversal as our running example for showing how tensor algebra and relational algebra can be unified.
Suppose we have a graph stored as a table of edges, and we want to find all pairs of nodes reachable by paths of length exactly 3 in the graph. Relationally, we can consider 3 copies of the graph with appropriately relabeled attributes,
over which we compute the natural join and project out our source and destination nodes:
If instead we had the adjacency matrix
$$
Q_{ij} = (A^2){ij} = \sum_k A{ik}A_{kj} = \text{number of ways
as the product
$$
Q_{i\ell} = (A^3){i\ell} = \sum_j (A^2){ij} A_{j\ell} = \text{number of ways
The pairs we want are exactly the indices
The similarity to the relational algebra query is of course not a coincidence. Note that
Minimizing distance. Now what if, instead of detecting such paths, we wanted to find the (minimum) distance between pairs of nodes when taking paths of length 3? In the relational approach, we would first need to augment the table with a weight associated to each edge:
The result could then be computed by joining the three relations as before, grouping by
On the other hand, if we interpret
This mirrors our previous expression for counting the number of paths, where we first computed the results for paths of length 2. Curiously, addition distributes over min, so we can just write
We are now ready for the punchline.
What we have just seen are three instances of semirings. We will give a formal definition below.
Definition 1. A semiring is a tuple
-
$\oplus:S\times S\to S$ is associative and commutative with identity$0$ , -
$\otimes:S\times S\to S$ is associative with identity$1$ and annihilator$0$ , and -
$\otimes$ distributes over$\oplus$ on the left and right, i.e.$a\otimes(b\oplus c) = a\otimes b\oplus a\otimes c$ and$(b\oplus c)\otimes a = b\otimes a\oplus c\otimes a$ .
We will assume that
When we counted the number of paths between nodes by calculating
An
Definition 2. A schema is just a cartesian product of sets
Definition 3. Given a schema
Example 1. A tensor is an
We have already seen how to translate joins and projections over standard relations into those over
Natural join. Given
$$ [R_1\bowtie R_2] = R_1(\mathbf{a}_1,\mathbf{a})\otimes R_2(\mathbf{a},\mathbf{a}_2). $$
In other words, the semiring value of a join output tuple is the product of the values of the input tuples that matched on the shared attributes.
Projection. Projection is a sum over the unprojected attributes: given
$$ [\prod_{\mathbf{A}'} R] = \bigoplus_{\mathbf{a} \in \mathbf{A}} R(\mathbf{a}, \mathbf{a}'). $$
Note that the sum ranges across all values
Union. Union of relations with the same schema is elementwise addition:
$$ [R_1\cup R_2] = R_1(\mathbf{a})\oplus R_2(\mathbf{a}). $$
Selection. We can actually support selection in the
where
Aggregation. Aggregation is replaced by sum and product. The "group-by" attributes are the free variables in the sum, and the aggregation function is induced by the choices of semiring and which column(s) store the semiring values. Note that you can interleave different summation operators, as long as you are careful about maintaining the semiring properties (but dropping commutativity between the different sums). In general, SQL aggregations may be more expressive, since the operators aren't required to obey the semiring properties. For example, AVG
is not associative.
Negation. Negation is much more subtle, and not generally well defined for arbitrary semirings. Indeed, the "semi" qualifier of "semiring" explicitly drops the existence of additive inverses. One option is to support negation as an uninterpreted function on the underlying set of the semiring. For example, the relational difference
We will see later how to support negation for a special type of semiring, in the context of seminaive evaluation.
While the above algebra captures a good chunk of relational and linear algebraic workloads, it misses a key feature of many computations: recursion. A classic result from logic is that first-order logic (of which relational algebra is a subset of) cannot express the transitive closure of a graph. This motivated the development of datalog, which builds upon relational algebra to express fixed point computations. Quickly summarized, datalog programs consist of a set of queries together with a set of input tuples called the extensional database (EDB). The queries are then executed on the inputs to form the intensional database (IDB), in a loop, until the results no longer change. All datalog programs terminate in polynomial time with respect to the size of the EDB.
This iteration is often described using the immediate consequence operator
If we want to generalize datalog to arbitrary semirings, we need to think about these properties. What partial order should
The original provenance semirings paper suggests using formal power series over
The main idea with datalogo is to define a separate partial order
The authors also generalize semi-naive evaluation to work over distributive dioids, which are naturally ordered,
The authors note that the min-tropical semiring with
Semi-naive using this subtraction operator recovers the common shortest-paths optimization.
Naturally, we want to optimize these semiring programs. In this section, we will show how thinking about joins from a sparse tensor perspective naturally leads to a worst-case optimal join algorithm, which was a breakthrough result from database researchers in the last decade. This optimization is rooted in a factorization of the underlying data, a technique that is ubiquitous in sparse computing but much rarer in databases until recently.
The most familiar representation of a database table is as a list of tuples. This list can be stored either row-wise or column-wise with different tradeoffs, but in general all tuple elements are explicitly stored (modulo certain compression optimizations). On the other hand, this "listings" format is quite rare in high-performance sparse computing. For example, graph algorithms rarely use a list of edges, but instead use an adjacency list structure, where all neighbors of a node are packed and accessed together. Analogously, a sparse matrix is often stored as a compressed sparse row (CSR) matrix, which is effectively a compact adjacency list data structure. In both cases, the tuples are stored in a factorized form - the tuple elements have a distinguished order with which they must be accessed, and duplicated values are elided according to this order.
To make things concrete, consider the table
x | y | z |
---|---|---|
1 | 2 | 3 |
1 | 3 | 4 |
1 | 3 | 5 |
2 | 1 | 1 |
Alternatively, we may factorize
Mathematically, this representation is factorized in the sense that set
An effect of factorization is that different attribute orderings no longer result in isomorphic storage representations. In the listings format, all permutations of attributes result in the same memory footprint. With a factorized representation, different orderings lead to different values being stored. Accessing attributes out of order is also highly inefficient. So why would we want a factorized representation?
Sparse representations avoid storing unnecessary information. But space (obviously) isn't the whole story. We need a way to perform computations over sparse data efficiently: we don't want to spend any more time than the space we use. To be more precise, we'd like our runtime to be somehow proportional to the size of the output, asymptotically. The key to achieving such guarantees, in the context of joins, is efficiently computing intersections - after all, a join can be defined as the cartesian product of all input attributes filtered by their intersection on shared attributes. We'll see how to be smart with intersections by working through the infamous triangle query, from a sparse tensor algebra perspective.
Let's modify the query we saw earlier. Instead of all paths of length 3, we are now only interested in paths of length 3 that start at end at the same vertex - triangles. Let's start by listing all the triangles. For notational clarity, we'll switch to a relational calculus with explicit variables in place of implicit attributes:
How might we compute this? Mainstream database systems would emit a binary join plan, which amounts to computing one join first, materializing the result, then joining the intermediate with the remaining relation. If
We can also do something similar using linear algebraic operations:
where
From dense to sparse loops. Consider the following code that computes the triangle count
count = 0
for i in range(I): # iterate over rows of A and M
for k in range(K): # iterate over columns of A, rows of B
for j in range(J): # iterate over columns of B, columns of M
count += A[i,k] * B[k,j] * M[i,j]
This is more or less a direct translation of the tensor algebra expression. Now suppose the inputs are actually sparse; what would we need to change to make this work? The first thing we should do is the following: for each loop, only run the body when a value is present for all inputs. Intuitively, given a specific binding of values (i,k,j)
, the product in the body evaluates to zero if any of the matrices lack a value at the indices being looked up. Adding zero does just as much as skipping the loop entirely. In other words, each loop variable should iterate over the intersection of the inputs' values on that dimension:
count = 0
for i in A.i ∩ M.i:
for k in A.k ∩ B.k:
for j in B.j ∩ M.j:
count += A[i,k] * B[k,j] * M[i,j]
We haven't actually described how to compute this intersection. An important subtlety here is that the values of k
depend on i
; more precisely, the sets A.k
and B.k
depend on which i
values survive the intersection above. We don't want to iterate over k
s for which there isn't even a relevant i
in A
and M
! This dependency is due to the variable ordering we have chosen. Luckily, using factorized representations for our inputs solves this:
count = 0
# A: i -> j -> value
# B: j -> k -> value
# M: i -> j -> value
for i in A:
if i not in M
continue
for k in A[i]:
if k not in B:
continue
for j in B[k]:
if j not in M[i]:
continue
count += A[i][k] * B[k][j] * M[i][j]
In this code, we iterate over one of the sets in the intersection and probe the other sets for hits. As it turns out, if we always choose the smallest set to iterate over, we recover Generic Join - a worst case optimal join algorithm [5]. In other words, the count is computed in time
Sparse tensor compilers. This flavor of sparse loop iteration was studied deeply by the TACO (Tensor Algebra Compiler) project [6]. They developed a compiler that emits efficient code for coiterating factorized tensors (i.e., iterating over intersections) and performing computations. For optimizations, TACO follows the Halide philosophy of separating the compute definition from its low-level implementation (the schedule). TACO's optimizations are mainly loop-centric, including loop tiling, untiling (sometimes called fusion), and reordering. Although originally intended for numerical applications, subsequent work incorporated user-defined functions including support for arbitrary semirings [7].
Factorization in databases. idea of factorizing sparse data for efficient computation has been extensively explored by the Factorized Databases (FDB) Research group led by Dan Olteanu [8]. This line of research has resulted in implementations of scalable in-database machine learning, which would otherwise be infeasible using traditional machine learning frameworks [9]. On the theory side, the group has also shown how factorization is deeply connected to further worst-case optimal results such as the work by Ngo et al on sum-product queries [10].
This survey has hopefully shown how sparse computations and relational query processing can be seen as branches of a shared underlying formalism. As a core example, thinking in terms of sparse loops enables joins to be computed asymptotically faster. Of course, the fields of sparse computing and databases have diverse motivations, constraints, and technologies, so it would be foolish to conflate them. However, bringing them closer together can only help useful ideas proliferate. In the remainder of this section, we'll free associate on some next steps.
Worst-case optimal join is robust to variable ordering, asymptotically. But practically, the trie from earlier shows how a "bad" ordering can lead to unnecessary duplication. Additionally, constructing and maintaining a trie for every table is potentially a lot of work. Free Join [11] addresses these problems by using a traditional database optimizer to generate a variable ordering, and materializes the trie lazily from columnar data to reduce preprocessing cost. This two-staged approach works surprisingly well, but the idea of a factorized-native optimizer remains tantalizing. Interesting questions include:
-
what kind of factorized statistics could we compute and utilize, beyond traditional column-level statistics like histograms?
-
how do we accurately estimate and compare the cost of building tries, table scans, index lookups, and more?
-
how do we parallelize factorized joins? do we partition the relations like Gamma [12], or can we exploit fine-grained loop-level parallelism? how do we achieve load balance in the presence of skew?
A common feature of tensor programs that was not covered here is the presence of affine arithmetic over index variables. For example, in image processing pipelines, an output pixel
[1] Green, T. J., Karvounarakis, G., and Tannen, V. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (New York, NY, USA, 2007), PODS ’07, Association for Computing Machinery, p. 31–40
[2] Khamis, M. A., Curtin, R. R., Moseley, B., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. Functional aggregate queries with additive inequalities, 2020.
[3] Wang, Q., and Yi, K. Conjunctive queries with comparisons. SIGMOD Rec. 52, 1 (jun 2023), 54–62.
[4] Abo Khamis, M., Ngo, H. Q., Pichler, R., Suciu, D., and Wang, Y. R. Convergence of datalog over (pre-) semirings. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (New York, NY, USA, 2022), PODS ’22, Association for Computing Machinery, p. 105–117.
[5] Ngo, H. Q., Re, C., and Rudra, A. Skew strikes back: New developments in the theory of join algorithms, 2013.
[6] Kjolstad, F. Sparse Tensor Algebra Compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2020.
[7] Henry, R., Hsu, O., Yadav, R., Chou, S., Olukotun, K., Amarasinghe, S., and Kjolstad, F. Compilation of sparse array programming models. Proc. ACM Program. Lang. 5, OOPSLA (oct 2021).
[8] Olteanu, D., and Závodný, J. Factorised representations of query results: size bounds and readability. In Proceedings of the 15th International Conference on Database Theory (New York, NY, USA, 2012), ICDT ’12, Association for Computing Machinery, p. 285–298.
[9] Schleich, M., Olteanu, D., Khamis, M. A., Ngo, H. Q., and Nguyen, X. A layered aggregate engine for analytics workloads, 2019.
[10] Khamis, M. A., Curtin, R. R., Moseley, B., Ngo, H. Q., Nguyen, X., Olteanu, D., and Schleich, M. Functional aggregate queries with additive inequalities, 2020.
[11] Wang, Y. R., Willsey, M., and Suciu, D. Free join: Unifying worst-case optimal and traditional joins. Proc. ACM Manag. Data 1, 2 (jun 2023)
[12] Dewitt, D. J., Ghandeharizadeh, S., Schneider, D. A., Bricker, A., Hsiao, H. I., and Rasmussen, R. The gamma database machine project. IEEE Trans. on Knowl. and Data Eng. 2, 1 (mar 1990), 44–62.
Let
Proof. First, define
- if
$i$ is light, then$|A(i)| \leq |N(i)| < \sqrt{m}$ so$|A(i) \cap A(j)| < \sqrt{m}$ ; - otherwise if
$i$ is heavy, then$|A(i)| \leq \sqrt{m}$ since every$v\in A(i)$ must also be heavy, so$|A(i) \cap A(j)| \leq \sqrt{m}$ .
Together, it follows that