mavoweb / treecle Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 0.0 183 KB

WIP. A toolbox for hierarchical JS objects.

Home Page: https://treecle.mavo.io

License: MIT License

JavaScript 92.94% Nunjucks 4.14% CSS 0.22% HTML 2.70%

treecle's Introduction

🌳 Treecle

Need to handle tree-like objects? We’ve got you covered! Work in progress, try at your own risk.

Features:

Tree-shakable
No dependencies
TypeScript support (WIP)

Utility functions for:

Traversal
Transformations
Search

Installation

npm install treecle

import * as treecle from "https://treecle.mavo.io/dist/treecle.js";

Or import functions individually from src.

Usage

For details on what functions exist and their signatures, see the API docs.

This page exposes a global treecle object for experimentation. Open the console and try things out!

Data shape

Treecle is designed to work with tree-like objects, where nodes are objects and edges are properties. Arrays are used to represent multiple children of a node. Arrays of arrays have no meaning in such a structure.

However, these constraints are not enforced, and whenever it would not be costly in terms of performance, treecle does try to handle data gracefully.

Parent pointers

Certain methods like closest() and children.replace() depend on parent pointers, i.e. being able to get from a node to its parent. When Treecle traverses an object, it also stores a path from the object to its parent. To avoid mutating the object, this is stored in a private WeakMap, but you can access it via parents.path(node). To ensure every node in a (sub)tree has a parent pointer, use parents.update(root).

Getting a node’s children

By default, Treecle assumes that every property that points to an object is a parent-child relationship. You can customize this by importing the config object and setting config.getChildProperties to a function that returns the child properties of a node as an array of strings.

You can also override config.isNode(node) to be more specific about what should be considered a node. By default it considers all plain objects (i.e. not instances of a class other than Object) are cobsidered nodes.

treecle's People

Contributors

Stargazers

Watchers

treecle's Issues

Use cases

Edit: We are now maintaining use cases in this doc

Here we'll collect cases where libraries are writing their own helpers to do the kinds of things Treecle helps them do.

hTest’s object helpers https://github.com/LeaVerou/htest/blob/main/src/objects.js
- children(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L8
- walk(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L54
- reduce(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L93
- clone(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L108
- join(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L163
- stringify(): https://github.com/LeaVerou/htest/blob/main/src/objects.js#L209
- check.deep(): https://github.com/LeaVerou/htest/blob/main/src/check.js#L43
- check.equals(): https://github.com/LeaVerou/htest/blob/main/src/check.js#L91
Vastly
Mavo's Mavo.Node traversal methods
- walk(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L164
- walkUp(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L188
- propagate(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L240
- getClosestItem(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L387
- getPath(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L395
- pathFrom(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L400
- getDescendant(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L409
- getCousin(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L413
- contains(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L462
- getClosest(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L545
- getClosestItem(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L555
- children(): https://github.com/mavoweb/mavo/blob/main/src/node.js#L568
jiff (JSON patch implementation)
- equals(): https://github.com/cujojs/jiff/blob/master/lib/deepEquals.js#L9
- clone(): https://github.com/cujojs/jiff/blob/master/lib/clone.js#L12
JSON PATCH (JSON patch implementation)
- getPath(): https://github.com/Starcounter-Jack/JSON-Patch/blob/master/src/helpers.ts#L97
Bliss
- extend(): https://github.com/LeaVerou/bliss/blob/v2/src/extend.js (Like Object.assign() but copies accessors instead of executing them)
- each(): https://github.com/LeaVerou/bliss/blob/v2/src/each.js (Like our walk())

How to allow different configurations to co-exist?

Currently, which properties are child properties and what values are nodes are functions that live in src/config.js. The workflow being that authors would import that and override its properties.

The issue with this workflow is that currently you can only have one configuration. Given how generic Treecle is, it makes sense to allow different configurations to co-exist.

There are several solutions to this, each with its own pros and cons:

A Treecle class. This is the most obvious option for facilitating different configurations, however it will be a pain to facilitate tree-shaking with it (we'd need to support both the separate functions and the class, and write the function code in such a way that it works for both (e.g. let myConfig = this.config ?? config).
Add an optional config argument or option to functions that allows overriding the global config (either entirely or for specific properties). The downside is inconsistency, since functions have different signatures and we cannot have this in the same place for every single one.
A combination of 1 and 2, where people can use an object with a config property as the context of a function to provide a custom config. I.e.

let c = { context: {...}};
walk.call(c, tree, callback);

This allows us to essentially accept it as an argument without it affecting the function signatures.
And bonus, if we decide to do 1, the code is already there.
The downside is verbosity (4 extra characters per call) and …weirdness.

typedoc improvements

Following the revelation in #21 that we can actually use a generic Node type in our docs, for example, @param {Node}, we should go through the codebase and change any instances of object to Node.

In addition, we can get rid of the any return type by specifying the return type of the function in the @returns, e.g. @returns {Node}.

Can we accommodate trees defined in a "bottom-up" way? (nodes pointing to parents)

I recently came across some use cases of trees where familial relationships were defined in the inverse way of what we assume: instead of nodes pointing to their children and assuming the parents can be derived, the nodes were pointing to their parents, and the children could be derived. I cannot remember what they were, but will update this comment when I do.

How could we accommodate such use cases?

At the very least, we should be able to accommodate use cases where the parent is part of the data, and does not need to be hidden away in a weakmap.

`children` does not check whether returned nodes are nodes

The children function needs to use the isNode function from the context to check whether nodes are actually nodes

Supporting different schemas around getting child nodes

Problem

In #16 we converted our hardcoded { node, property, index } parent pointer structure to a more flexible { node, path[] } schema. However, we did not change our internals much to really support arbitrary paths. We still assume children are found by:

If no getChildProperties() is specified, we follow all properties on a node, and call isNode() on them to see if they are child nodes. This only goes 1-2 levels deep: values that are obtained by following one property, or array values if that one property is an array.
If getChildProperties() is specified, we follow these properties and do not call isNode(), we just filter by existence.

Currently, this assumes a specific structure that may be overfit to Treecle’s AST beginnings.
I suspect in the wild there are two common ways to represent tree data structures:

Follow specific properties that always point to children (the AST case)
Follow a property that always points to children (children for Mavo Nodes, childNodes for DOM Nodes)

Currently, the API only supports providing a function that takes a node and returns a list of properties. As an example, this is how this setting is specified in vastly:

export const properties = {
	CallExpression: ["arguments", "callee"],
	BinaryExpression: ["left", "right"],
	UnaryExpression: ["argument"],
	ArrayExpression: ["elements"],
	ConditionalExpression: ["test", "consequent", "alternate"],
	MemberExpression: ["object", "property"],
	Compound: ["body"],
};

defaults.getChildProperties = (node) => {
	return properties[node.type] ?? [];
};

This appears to be overfit to 1. Yet, I suspect 2 may even be more common.
How do we allow both to be specified without making either more complicated due to the existence of the other?

Ideas

`getChildProperties()` to `getChildPaths()`, handle both `string[][]` and `string[]`?

We don't want to complicate 1 to cater to 2, but what if we could do both? If the function returns an array of strings, they are single properties. If it returns an array of arrays, they are paths.

The problem is, we don’t necessarily have specific child properties in 2, often once you get from the node to its children, everything in that data structure is a child.

Wildcards? JSON Paths?

Basically, we want a way to say children/* for these cases. What if we handle / and * in properties specially?

But then we’re basically creating a path microsyntax, and restricting the potential syntax of actual real properties accordingly.
OTOH, that's basically JSON Path syntax, which is quite well established.

The advantage of something like this is that we can still handle properties like Vastly’s in exactly the same way.

Not a huge fan of any of these ideas, so I’ll continue brainstorming.

Config option for how to get parents

There are many tree implementations that are already doing their own parent pointers. For example, in DOM nodes, you can always get to the parent via node.parentNode. We should have a config option for this so we don't end up doing duplicate work and introducing unnecessary error conditions.

This could be a config option (which also becomes a Treecle instance property). For treecle to properly use this, it needs to know both how to read parents, and set parents.

Strawman: A parent property whose value is either a string (read/write property name) or a { get: function(node), set: function(node, otherNode) } for more elaborate schemas.

For example, for the DOM, this could look like this:

parent: {
	get: n => n.parentNode,
	set: (node, newParent) => newParent.append(node);
}

I'm not convinced this is a good design: what if we need to specify where exactly the node gets appended? Not to mention this affects other properties too.

Perhaps it should be possible to just say parent: "parentNode" and just error if we try to use any properties that require writing parents. Or even have an option to turn that off and fail silently (since the tree may automatically handle that).

People should be able to use treecle for read-only stuff like walk() or closest() or getting complex familial relationships without having to also use its parent mgmt system (or have to implement complex parent setting helpers they don't need just to provide a value to parent.set).

Get and set object values by path

Background

This came up in #18 as an implementation detail, but I think it's incredibly useful in its own right.

Sample use cases and prior art:

Mavo.subset()
Lodash’s set() and unset() functions (and their various permutations)
Underscore’s get()
Bliss’s $.value()

I pushed two util functions a few days ago (getByPath() and setByPath() but I think we should clean up the code, make it more flexible, and expose at the top level.

Some design decisions, requirements, questions below.

Signature

Function names

I’m leaning towards the simple get() and set() that are already established in prior work. Is there anything else we may want to save the names get() and set() for?

Arguments

Strawman:

get(obj, path [, options])
set(obj, path, value [, options])

Might be worth to later have overloads that allow specifying value, path as part of the options object, but don't see compelling motivation to include that in the MVP.

Should there be a way to set value via a function that takes the path and current value as parameters? Or is it encroaching too much into transform() territory at this point?

Path structure

Data type

Paths should be provided as arrays, we don’t want to deal with string parsing and trying to distinguish paths from property names. Strings/numbers should be accepted as well, but they’re just a path of length 1.

We may want to also support objects to provide additional metadata (see below).

Predicates

It seems obvious that entirely literal paths will not suffice (at the very least we need wildcards).
Should we just use JSON Path? Hell no! First it's overkill for these use cases, and second once you go beyond literal property names + wildcards, the syntax becomes cryptic AF. And despite its complexity, there are some pretty common use cases like case insensitivity it doesn’t seem to support.

So since we can’t just use JSON Path, what do we use? What predicates do we want to support? Examples:

Wildcards (any property at this level)
Case-insensitive property names?
Alternatives? (e.g. "foo or bar")
Ranges of numbers? (e.g. "top 3 items")
Property queries (e.g. "get items with id=foo") — essentially the path version of CSS :has(), so we'd probably want to frame it that way, i.e. "children that match this path", so I’ll call them child queries from now on
Property names that start/end with a given string?
Property name regex?

We generally want to keep the MVP simple until use cases emerge, but it helps to take these things into account at the design stage so that the API has room to expand.

As mentioned above, wildcards are certainly needed.
Case-insensitive matching might be worth to include in the MVP, since at least the Mavo use cases need it.
The rest we can probably ship without and add as needed.

Syntax for predicates

So that begs the question, how do we express these predicates?

Special syntax. This works decently for some of them:

Wildcards: *
Alternatives: foo|bar
Number ranges: 0-3 or 0 .. 3
Property queries: id=foo

However, but there is no obvious fit for any of the others. Also, inventing a new microsyntax has several drawbacks:

The larger the syntax space for special syntax, the more challenging to distinguish it from literal property names. How do you do the escaping? Backslashes? How do you distinguish a literal "\*" property then? More backslashes? It's backslashes all the way down!
Devs would need to use string concatenation when the criteria is variable, which is awkward. E.g. Mavo paths support property queries like id=foo and I now think that's a terrible idea and we dropped that kind of support from get() (it's now only supported in mv-path, which being an HTML attribute it only takes strings so it can't take anything more structured).
It forces you to come up with syntax for things where there is no obvious syntax to use, resulting in a cryptic language.

So instead, I think we should go with an approach of strings for literals + wildcards as the only exception, since these are very common and have a very obvious syntax. Anything else would require making that part of the path an object literal.

This means even if we only ship wildcard as the only predicate, we need to support object literals at least to escape that and specify that something is a literal property name. If we have that escape hatch, we could in the future explore more options to add syntax for certain things where a readable syntactic option is obvious, as a shortcut (e.g. "foo|bar" for alternatives)

Predicate schema

Strawman for all of the above predicates (even though we don't plan to implement them all):

Path: string | (string | PathSegment)[]
PathSegment: Object with keys (all optional):
- name: Literal property name (string) but maybe could also be a RegExp?
- ignoreCase (boolean)
- range: Numerical range (number[2] or {from, to} or even {gt, gte, lt, lte}?)
- or: Alternatives ((string | PathSegment[]))
- has: Return only children for which this would be non-empty (Path)
- startsWith
- endsWith
- regexp

Notes:

ignoreCase is special. All other criteria are independent, but ignoreCase affects how other criteria work, i.e. is a modifier rather than a predicate:
- name: from strict equality to equality after .toLowerCase()
- regexp: Adds the i flag if not present
- startsWith/endsWith: applies .toLowerCase() before matching
- or and has: inherits to any path segments that don't have their own ignoreCase
- Are there any other modifiers that we may conceivably want to support in the future (so we can take them into account in the design)?
Multiple independent criteria can be specified, and the result is the intersection. This way, since we already have or complex logical criteria can be created by just nesting these. 😁
Should we also handle arrays as sugar for {or: array}?

How do predicates work with `set()`?

Setting is only an issue for the last part of the path — until then it's still a getting task.

So if the last part of the path is a…

Wildcard: Set every property that exists?
Alternatives: Set every property among the alternatives or only those that already exist on the object?
Numerical ranges: set every number in the range?
Child queries: 🤷🏽‍♀️ Replace these objects with the value?
Regexps or Starts/ends with: 🤷🏽‍♀️🤷🏽‍♀️🤷🏽‍♀️

Return value

One ore more values? For static paths, there can only be a single return value. However, when predicates are involved, there could be multiple, and it's impossible to tell whether one or more values are expected.
Array or object subset?: when returning multiple values, how much of the original object structure do we want to preserve? There are use cases for getting a completely flat array, and use cases for subsetting the object, i.e. using the path as an allowlist of properties while preserving the original object structure.

Following the design principle that function return values should not vary wildly based on the options passed, perhaps we actually need more than just a single get() function:

get(): Array of all values
first(): First value only
subset(): Subset of object

Or perhaps get() for one value and getAll() for multiple?

Options for the whole path

These will be passed to the functions as part of the options dictionary.

Case insensitive matching (when we want it for the whole path)
set() only: What object to create when part of the path doesn’t exist? {} by default. Might be useful to take a function to customize based on the path.
We definitely don't want to throw if the path doesn't exist, since avoiding that is one of the primary reason to use such a helper. Is there value in having an opt-in to stricter handling?

`walk` bug

Currently, there is a bug in walk where it won't return a value unless it's at the root. See here; it will only traverse the descendants but never return anything

Node type flexibility & TypeScript

One important thing we need to work out before converting this library to TS is how flexible node types are. If they're too flexible, TypeScript becomes useless. For example, we could choose to enforce that nodes have to be objects, or we could allow them to be custom classes. If we allow them to be custom classes and objects, TS isn't going to be much help.

The main question is: if a developer has specific-enough needs to want to build their own Node class, would they even be using Treecle? Are there any common use cases where someone with a non-object type of node would want to use Treecle?

improve documentation

Improve documentation by:

adding more detail to all specifications in all files in Treecle
Extend README (and therefore the homepage of https://treecle.mavo.io) to include examples and tutorials

Function for getting arbitrary familial relationships

Currently we have methods for getting parents, children, ancestors (walkUp(), closest()), descendants (find(), walk()).
While the use cases for other relationships are more niche, they do exist (e.g. in Mavo we have things like getCousin() because we need it for finding the same note across repetitions of its parent item).

We definitely don't want to add functions like getCousin()! But I do wonder if there is something more general that might make sense here.

Make all functions top level?

I was thinking, it may make for a more predictable API to avoid namespaces and just have every function be top-level, e.g. setParent() instead of parents.set(). That works better when you do import * from treecle . What do you think @adamjanicki2?

walk up function

Another useful function is one that walks up a tree to the root and executes a callback on each node it passes over

Optionally binding `Treecle` instances to a specific root node

While in #1 we envisioned the Treecle class as a way to reuse a Treecle config, I think it may be useful to allow optionally binding it to a specific tree too. There are use cases like Vastly's where Treecle will be reused with the same settings across multiple trees, but also use cases where Treecle will be used on the same tree repeatedly (e.g. an app that operates a single data tree).

This has several ergonomics benefits over simply passing nodes of the same tree to a Treecle instance repeatedly:

Being able to omit arguments when the node to be passed is the tree root (e.g. for walk(), map(), transform(), find())
For all methods that require parent pointers, we can simply add them to the tree if they don't exist, rather than having to produce an error.

So I’m thinking we ideally want to allow the following:

new Treecle(root, config);
new Treecle(root, otherTreecle); // copy config from another treecle and use it on a new tree
new Treecle(otherTreecle); // pass in both root and config from another treecle, essentially clone
new Treecle(root); // use default config

The latter introduces a bit of a catch-22: how can we distinguish between new Treecle(config) and new Treecle(root)?
It's not impossible, just adds some complication. E.g. I can think of two ways to do it:

only copy any properties from the argument that are present in defaults instead of indiscriminately copying them all via Object.assign() (which is probably good practice in general). The downside is that in a situation where nodes can have properties with the same name as config settings, this will break.
Another way would be to call defaults.isNode() on the argument, and edit the default isNode() to never return true for config objects. Can't think of any way to reliably tell between nodes and config objects in the general case without making the same assumption as above though (that there are no name conflicts).

I'm inclined to simply postpone new Treecle(root) since there's no robust way to implement it, and it adds complexity for a rather small benefit to authors.

Branching by type in a generic way

In the initial Treecle port, I removed all the functionality that supports type => something object literals and instead it only supports functions for these things. I instead added code to Vastly to transform object literals into functions.

But it just dawned on me: this would be very useful in Treecle too, it's just that we can't depend on node.type returning a useful type. But we could have a config.getType() method, which by default returns the object [[ Class ]], but can be overridden (e.g. Vastly would do config.getType = n => n.type).

Any objections @adamjanicki2 ?

Following edges in a generic way

One issue in the current code is the fact that we are assuming in the ported code that properties can be accessed as edges to lead to children (i.e. parents module is no longer correct because it assumes paths are properties and indices which isn't necessarily true anymore), but what we need is something in the config to define how to follow edges. If an author has their own custom node classes, then using the property/index parent paths that we used in vastly are no longer correct.

We need some type of config that authors set that tell us how to follow an edge from a parent to a child. By default, it should be some way of accessing properties, i.e.

function followEdge(node, ...what else goes here?) {
    return node[prop];
}

but an author could make a custom one like

function followEdge(node, ...what else goes here?) {
    return node.children[prop]; // or however their node class stores children
}

It's pretty tricky; I can't figure out immediately what the function needs to look like exactly, but this infrastructure will be needed before we do anything else in the library. What do you think @LeaVerou?

Context is not passed around

Just realized — the context is not actually passed around on function calls our functions are making, which means the Treecle class won't work properly.

General `findPath` function

One of our use cases is a general findPath function, so I wanted to make an issue to discuss design before diving into implementing it. My thoughts are it should be a function like follows

function findPath(start, end) { ... }

which returns a path of type Array<string | number> to get from start to end. The function should attempt to use parent pointers to climb from end to start, but do a BFS/DFS if no parents are set. Before implementing this, we will need to resolve #26 because in a situation like this where parents aren't necessary but helpful, it would be nice to do a console.warn about it (see end of #26)

what to do when parent pointers aren't set

There are several functions that require parent pointers to be set to function properly. As of now, we handle cases where parent pointers are not set in an ad-hoc way, particularly, either by failing silently or by throwing an error, but we need to standardize the procedure for a parent not being set across the library.

There are a few options of how to handle cases like these:

Throw an error: the benefits of throwing errors is two-fold: it let's the author know quickly that they are not properly using Treecle, and it also lowers our codebase complexity as we don't have to try and consider special cases where parent pointers are not set.
Fail silently: another option is to have a function that requires parent pointers to just return and not do anything, which I see as harmful to authors since it will be harder to debug.
Console warning: something we've also discussed in the past is making a custom warn function which runs console.warn in dev environments, which could warn authors that parent pointers are not set without throwing an error.
Try to set parent pointers ourself: this option is interesting because on one hand, it would be ideal since the authors won't have to worry about error handling or forgetting to set parent pointers. The only issue with this is we do not possess the ability to set relevant parent pointers from a given function if we do not have the root of the tree, which means this wouldn't be a standard across the whole library.

My personal preference would be either option 1 or 3, and more specifically, building an internal function called checkParentPointers(node) which would standardize what would happen when no parent pointers are set. Any thoughts here @LeaVerou ?

Paths as arrays

I think it would be a good idea to convert our notion of paths to arrays. Currently, we implement an object containing a property and index, however, it would be a good idea to convert our codebase to using arrays, as this would support a more general use case of arbitrary long paths. For example, we might want to implement a general path function, which finds a path of arbitrary length, in which case, we would need an array.