I am proposing that we deprecate the use of crossfilter and adopt a more SQL oriented data layer.
The current mapd-crossfilter
library falls short in three key areas:
- expressivity
- development velocity
- modularity
By adopting a more expressive, relational data layer, the full potential of the mapd-core backend can finally be utilized. In addition, it opens the door to new possibilities such as leveraging the Vega runtime.
Issues With Crossfilter
Expressivity
The only way to increase crossfilter's expressivity is to decrease its maintainability.
Crossfilter works for building very simple OLAP queries, but once you start building slightly more complicated projections and aggregations, the API falls apart.
For instance, when one projects (creates a "dimension") on a time field, this field is implicitly cast but is then "uncast" with regex when queries are constructed. We see the same pattern of using an awkward projection setter and then extracting field information with regex when creating our raster chart projections.
Since the dimension
constructor is meant for simple projections, one has to pretty much hack the library to express more complicated relations such as subquerying. Even using SQL expressions such as COALESCE
and CASE
in a clean and maintainable way is pretty much off the table.
Development Velocity
Developing the crossfilter library should be a simple matter of understanding SQL and a minimal API to build relations. It is not.
In order to actually make changes or fixes to the library, one has to first understand the peculiarities of the codebase itself.
For example, one must have knowledge of idiosyncratic details such as:
Modularity
The coupling of concerns makes the library hard to use, understand, and develop. Separate concerns are not adequately modularized, which hinders testability and readability.
One example of this coupling is in the group
query methods group.all()
, group.top()
, group.bottom()
where the query writing and the querying are bound together in one method. But despite such coupling, we see that developers naturally want to decouple these pieces.
Another example is the library itself as an intermixing of concerns such as query writing, querying, caching, data processing, and result formatting.
Advantanges of a SQL Data Layer
To replace crossfilter, I am proposing a data-layer centered around SQL and relational operations. Two good starting points from which to model an API are the Vega Data Transform API and the Calcite Algebra API
The key idea here is that since crossfilter is basically a way to build up a stack of relational transformations, one might as well orient the data-layer API to this abstraction level.
Expressivity
By using a data-layer geared toward building pipelines of relational transformations, we can make full use of SQL, and therefore, mapd-core
. All SQL expressions -- even subquerying -- come right out of the box.
Furthermore, if in building these transformation stacks, the API becomes a cumbersome activity, the data-layer would be expressive and flexible enough that one could easily build a higher-level abstraction (syntactic sugar) such as crossfilter.
Development Speed
Without being burdened with codebase specific knowledge, developers would just need to know how the transformation pipelines are represented and how that translates to SQL.
Cleaner Charting Code
With a cleaner and proper implementation, data layer state setting and getting would be easier to reason about. As a result, confusing code at the charting level that works around these complications would be resolved.
New Possibilities: Vega
A declarative Vega inspired data-layer would make interoperability with the Vega runtime much more natural. One can imagine a Vega oriented API where the user specifies SQL data transformations in the same way visual encodings are specified: declarative JSON specifications.
While it would not be impossible to "refactor" crossfilter to be more SQL oriented, I would actually argue that, in the time it would take to do that, one could have already built a properly abstracted data-layer, extended it, and leveraged it to enhance the capabilities of the charting library.
In the end, it's a matter of spending a finite amount of development time -- either spending it on a sunk cost or a sustainable, forward-thinking solution.