syntax-tree / unist Goto Github PK

Universal Syntax Tree used by @unifiedjs

syntax-tree ast cst unist-nodes unist-files unist-utilities remarks unist mdast hast

unist's Issues

Unist vs Loyc trees

How would you say the purpose of Unist is different from Loyc trees?

The primary use of Loyc trees is to represent the core elements of source code - identifiers, literals, and "calls" - and from looking at the readme, it doesn't look like Unist has the same goals, but I'm not quite sure what its goals are.

What is a `column` or `offset`, exactly?

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Problem

…or are they supposed to be vague and should we reflect that?

For ASCII characters, such info is pretty clear.
But once you get to other unicode characters, taking emoji as a well-known
example, it gets complex.
And, unist is designed for other programming languages too, practically
now with Rust, we may need to choose how to represent this across languages.

There are two main ways that positional info like this is used:

to access the source string: if there is something appearing from
1000 to 1002, a user will want to do doc.slice(1000, 1002)
to access that thing.
In Rust, 1000 and 1002, in an example of &doc[1000..1002],
will yield a different result
to point a code editor to such a thing, for warnings and such, so that
the squigly lines, or “jump to” are correct

To make the first easy and fast, it makes a lot of sense to use the positions
that are based on how the host language stores strings.
But that means markdown-rs and micromark will yield different results.
Quickly checking VS Code, injecting 👨‍👩‍👧‍👦 into the document, seems to increment
by 7, which equals [...'👨‍👩‍👧‍👦'].length.
So that’s different from what markdown-rs and micromark use.
We can also be vague about this here in the spec, and replace the section added
in 49032b9 to reflect that.
Or we could adhere to that?

Solution

a) Make vague
b) make everything consistent

Alternatives

is it possible to insert a node or remove a node in ast tree?

Subject of the feature

insert a node or remove a node in ast tree

Problem

could not find some method for Node to modify its text, its position.

Expected behaviour

add some api for Node

Alternatives

jquery can modify the dom ele positions and its attrs

Use proper naming for position/location

Currently, location and position is used interchangeably, while they differ.
This confusion derives from node.position, which holds a location, and a location has start and end set to a position.

I propose:

position for node.position (aka “location”, “positional info”)
point for node.position.start, node.position.end (as it refers to a point in a file)

Stringifiability of other Node properties

Do they have to be JSON-stringifiable as well? (I think they should.)

If they better be, is there a way to specify this?

`line`, `column`, and `offset` in `point` underspecified

According to docs:

The end field of Position represents the place of the first character after the parsed source region.

If the last parsed character is a newline, does end have a column of 0 and a line of current line + 1? If we are at the end of the source, does the end position represent an imaginary character after the end of the document?

More strict notion of JSON-stringifiability for all properties

Currently the only requirement for data property is stringifiability which is defined as follows:

Its only limitation being that each property should by stringifyable: not throw when passed to JSON.stringify().

I think this is too broad, in particular because JSON.stringify won't usually throw:

> JSON.stringify({ foo: function () { console.log('foo') }})
'{}'
> JSON.stringify({ foo: undefined })
'{}'

I was worried about this when hacking on unist-builder-blueprint: there is no reliable way to compile functions to source code (with closures and stuff like heap references), so a more strict guarantee like "data and JSON.parse(JSON.stringify(data)) should be equivalent and interchangeable" or even deepEqual(JSON.parse(JSON.stringify(data)), data) would be helpful.

Error at C:/web/node_modules/@types/unist/index.d.ts:92:58: ';' expected.

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

@types/unist & 2.0.6

Link to runnable example

No response

Steps to reproduce

angular - 4.3.6
node- 10.15.0
react-markdown- 4.0.3
with above version if we try to build angular app, we are getting below error.

Expected behavior

it should build the app.

Actual behavior

due to @types/unist package issue we wont be able to build angular app.

Runtime

Other (please specify in steps to reproduce)

Package manager

yarn v1

OS

Windows, Linux

Build and bundle tools

Webpack

A swiss army knife for Unist-util-*.

Just like lodash, I need an object bound every unist-util-* methods.
It might be a bad practice for the production code. But, for testing code, I think it would be fine.

Also, I need others for hast-util-* and mdast-util-* too...

What do you think?

adast - asciidoc syntax tree

Came across unist & remark via mdast, and I was wondering what would be involved in creating an entirely new flavour of syntax tree within this family?

More concretely, I'm interested in creating adast, an asciidoc syntax tree, and while the input format is rather different from markdown, there's potentially a huge chunk of the implementation that could potentially be same or similar to that of mdast, as the output is going to be quite similar to it.

Thoughts?

Unist Test Suite

If test suite exist, it help implementor to implement parser.

Is there any chance create test suite for unist?

My image:

var unist = require("unist-test");
var AST = { ... };
unist.runTests(AST); // if invalid, throw error.
unist.isUnist(AST);// true or false

Add tree traversal methods to glossary

Subject of the feature

There are several ways to traverse a tree, typically preorder, postorder, but also inorder and breadth-irst / level-order. See also WikiPedia.

Problem

Projects working with unist typically do this, but either a) don’t document how they do it, or b) document this themselves. Both lead to lacking, incorrect, or incomplete docs.

Expected behaviour

These (most common) types should be documented in the glossary. Maybe with a diagram.

Alternatives

The alternatives (status quo) are shown in problem above.

Would unist make a good programming language AST format?

Hi again @wooorm (and other AST enthusiasts),

I develop an experimental programming language called eslisp, which is basically a JavaScript syntax optimised for code-modifying macros that let users add language features. It might be helpful to think of it as a programming language processing tool.

Eslisp's current AST representation contains exactly the same information as Unist, right down to location data, but currently organised differently. I was writing my own tools for reading and modifying it, then realised I'm basically duplicating Unist utilities.

I thought I'd open a dialogue before I start "hammering on screws" and making it a dependency. Have you considered programming languages as a Unist use-case? Is this a sane thing to be doing, long-term?

Add `enter` and `exit` terms to glossary

Subject of the feature

Other that the different ways of traversing a tree, as raised in GH-22 and resolved in GH-23, the terms enter and exit are also often used when discussing tree traversal. These should be added to the glossary as well.

Problem

Some form of state is often mutated when entering or exiting a node by unist utilities. Describing these terms here means they can be linked to from other docs to clarify them in a single place.

Expected behaviour

Both should be added to the glossary.

Alternatives

The alternatives (status quo) are shown in problem above.

Using hast instead of mdast to describe markdown documents

I like a lot the Unist ecosystem and its AST-oriented approach.

However, I do not understand the motivation behind creating two widely different ASTs for markdown and HTML documents. I would expect HAST to include MDAST (as shown below) and therefore an HTML parser to produce an AST that a markdown compiler would understand, without requiring a transformation step (e.g., with rehype-remark). Conversely, a markdown parser should be able to produce an AST that an HTML compiler would understand.

In other words, would it make sense to build a rehype-stringify-markdown and a rehype-parse-markdown and ignore the MDAST? Is there something that would prevent that?

For example, this MDAST node:

{
  type: "paragraph",
  children: [{
    type: "text",
    value: "Hello!"
  }]
}

…contains at most as much information as this HAST node:

{
  type: "element",
  tagName: "p",
  properties: {},
  children: [{
    type: "text",
    value: "Hello!"
  }]
}

What's the unit of character in Point

In Point section, it's mentions:

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:

[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units

I tried using remark to parse this markdown piece:

a𠮷b

Here, 𠮷 is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:

'a𠮷b'.length
//=> 4

But in other languages like Python:

len('a𠮷b')
#=> 3

As for remark, the above markdown piece is parsed into:

{
  "type": "text",
  "value": "a𠮷b",
  "position": {
    "start": {
      "line": 1,
      "column": 1,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 5,
      "offset": 4
    },
    "indent": []
  }
}

The column of end is 5, while the offset of end is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.

So what's the unit of character? It's so confused.

Empty children arrays

This is a question and a suggestion regarding this part of Unist readme (emphasis mine):

Unist nodes:

may have either a value property set to a string or a children property set to an array of one or more Unist nodes;

I read it as “Unist nodes may have a children property, in which case it is guaranteed that its length is ≥1”. If this is correct, then it follows that both retext, mdast, and hast violate this specification by producing trees with empty children arrays:

> retext.parse('')
{ type: 'RootNode', children: [] }
> mdast.parse('')
{ type: 'root',
  children: [],
  position: { start: { line: 1, column: 1 }, end: { line: 1, column: 1 } } }
> mdast.parse('#')
{ type: 'root',
  children: [ { type: 'heading', depth: 1, children: [], position: [Object] } ],
  position: { start: { line: 1, column: 1 }, end: { line: 1, column: 2 } } }
> hast.parse('')
{ type: 'root', children: [] }

If I haven't missed anything then I guess this requirement should be relaxed to include empty children (or removed if it doesn't require anything) or, alternatively, retext, mdast, and hast should be fixed to never output nodes with empty children arrays. The latter seems more problematic (the obvious workaround is returning null on empty input but I feel that it's better for parsers to always output a valid syntax tree) so I opened the issue here.

Specify `indent`

node.position.indent isn’t used a lot, but it could be specced better.

Currently, it’s a list of integers, if a node spans multiple lines, where each value refers to the column a line (node.position.start.line + index).
This only supports the start of a line (which is useful in markdown). But not the end of a line.
It’s also awkward to access, as there’s no explicit line access.

It could make sense for indent to be an Array.<{start: Point, end: Point}>.

Rename abstract `text` interface to `literal`

The abstract text interface (nodes with a value) interferes with the type: "text" node provided by hast and mdast (and the type: "TextNode" provided by nlcst).

Another downside is that text implies (and specifies) string values on the value field.
Say unist was used for programming values, the value of value could be specified as number, for example.

I’m open to other names, but I’m searching for something close to “raw”, “leaf”, and whatnot.

/CC @ChristianMurphy What do you think?

No utility for creating a selector from a root and a leaf node.

@wooorm is there a utility to build a selector if you pass it a root and a leaf node?

e.g. buildSelector(root, node); // returns 'html > body > div:nth-child(1) > h1'

I'm creating a tree component that displays the AST and when you click on a node, it selects the selected nodes of the AST. and I'm wondering how to create the selector needed for the select util.

syntax-tree / unist Goto Github PK

unist's Issues

Initial checklist

Problem

Solution

Alternatives

Subject of the feature

Problem

Expected behaviour

Alternatives

Initial checklist

Affected packages and versions

Link to runnable example

Steps to reproduce

Expected behavior

Actual behavior

Runtime

Package manager

OS

Build and bundle tools

Subject of the feature

Problem

Expected behaviour

Alternatives

Subject of the feature

Problem

Expected behaviour

Alternatives

Recommend Projects

Recommend Topics

Recommend Org