jviereck / regjsparser Goto Github PK

View Code? Open in Web Editor NEW

73.0 9.0 19.0 1.92 MB

Parsing the JavaScript's RegExp in JavaScript.

Home Page: http://www.julianviereck.de/regjsparser/

License: BSD 2-Clause "Simplified" License

JavaScript 94.72% HTML 1.33% TypeScript 3.95%

regjsparser's Introduction

RegJSParser

Parsing the JavaScript's RegExp in JavaScript.

Installation

npm install regjsparser

Usage

var parse = require('regjsparser').parse;

var parseTree = parse('^a'); // /^a/
console.log(parseTree);

// Toggle on/off additional features:
var parseTree = parse('^a', '', {
  // SEE: https://github.com/jviereck/regjsparser/pull/78
  unicodePropertyEscape: true,

  // SEE: https://github.com/jviereck/regjsparser/pull/83
  namedGroups: true,

  // SEE: https://github.com/jviereck/regjsparser/pull/89
  lookbehind: true
});
console.log(parseTree);

Testing

To run the tests, run the following command:

npm test

To create a new reference file, execute…

node test/update-fixtures.js

…from the repo top directory.

regjsparser's People

Contributors

Stargazers

Watchers

Forkers

termi mathiasbynens kpdecker inno-v addaleax nicolo-ribaudo bnjmnt4n longjohncoder jlhwung pygy silicon-beach-labs sanchezzzhak crtn32002 tjenkinson mpadev0103 liuxingbaoyu fisker stulov

regjsparser's Issues

Add support for Unicode code point escape sequences `\u{1D306}`

See RegExpUnicodeEscapeSequence in https://people.mozilla.org/~jorendorff/es6-draft.html.

Parsing null char literal \0 reports incorrect range

version: 0.6.8

The null character literal \0 reports its range start incorrectly.

const parse = require('regjsparser').parse;
console.log(parse('\\0').range);

Expectation:

prints [0, 2]

What I got:

prints [-1, 2]

Add a AST traverser to the library

In planning of rewriting the RegExp.JS library towards the new AST form, I am wondering if it makes sense to include a traverser in the regjsparser library or if it should be a separate project. The traverser should expose roughly the same functionality as the estraverse package (which traverses the Esprima-generated JS AST).

Here is an example from estraverse:

estraverse.traverse(ast, {
    enter: function (node, parent) {
        if (node.type == 'FunctionExpression' || node.type == 'FunctionDeclaration')
            return estraverse.VisitorOption.Skip;
    },
    leave: function (node, parent) {
        if (node.type == 'VariableDeclarator')
          console.log(node.id.name);
    }
});

My feeling is, it should go into a separate project — every npm package should do one thing. How do you feel about this?

Lone quantifier bracket should not throw when unicode flag is not enabled

REPL: https://runkit.com/embed/etooxrk5pmq3

Expanded Atom can be Extended Pattern Character, spec

ExtendedPatternCharacter ::
  SourceCharacterbut not one of ^ $ \ . * + ? ( ) [ |

Merge characterClassEscape and dot type?

This came to my mind when preparing the presentation for Amsterdam.JS:

Currently there is a special type="dot" for things like /./. Per see there is nothing wrong with this, but the type feels very similar to type="characterClassEscape ". How do you feel to merge the type characterClassEscape and dot? Maybe into specialCharacterClass?

Or, alternative idea: similar to how different types got merged into type=value, merge dot, characterClassEscape and the existing characterClass into characterClass and add a new kind entry? I like this, as it not only gets away with the type dot, but also with the type characterClassEscape, which sounds similar to characterClass, but is still completly different although similar. Like:

{
  type: "characterClass",
  kind: "range",
  body: [ { type: "characterClassRange", ...} ]
}

{
  type: "characterClass",
  kind: "singleChar",
  char: "d"
  // The body is the not needed here
  // body: [ ]   
}

This looks interesting to me, but I dislike the inconsistency by using body in one case and char in the other one to encode the "meaning" of the characterClass. In the case of value, all the different kinds have a codePoint entry. A possible way to achieve a similar feeling of consistency here could be to store on the body of the type: "characterClass in the case of the kind: "singleChar" the actual ranges that are matched. E.g. in the case of /\d/:

{
  type: "characterClass",
  kind: "singleChar",
  body: [ {type: "characterClassRange", from: 48, to: 57} ],   
  raw: "\d"
}

Looks nice, but encoding /\s/ this way will result in a very large body :/ Here are the two functions used in RegExp.JS to test for a /\s/ string:

function isWhiteSpace(ch) {
    return (ch === 32) ||  // space
        (ch === 9) ||      // tab
        (ch === 0xB) ||
        (ch === 0xC) ||
        (ch === 0xA0) ||
        (ch >= 0x1680 && '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(String.fromCharCode(ch)) > 0);
}

// 7.3 Line Terminators

function isLineTerminator(ch) {
    return (ch === 10) || (ch === 13) || (ch === 0x2028) || (ch === 0x2029);
}

Personally, I am not sure if the consistency is worth the larger AST output here.

So, maybe go with specialCharacterClass and characterClass? Any thoughts? Or do you think merging dot into a different type is not worth the efford and this issue should be closed right away ;)?

Kill the `type: empty` AST node

The type: empty AST node is used internally as parseAtom is expected to return something. But I doubt there is sense in having it in the resutling AST.

bug: optional forward lookahead

In JavaScript, the following regex is valid:

/(?=a)?/

However, it is rejected by regjsparser (SyntaxError: Expected atom at position 5).

I think an optional forward lookahead is a little pointless, but it seems to be used. For example, this appears in the wild in ua-parser-js: https://github.com/faisalman/ua-parser-js/blob/3c3c03ceeba9920437533dcf1d72276acac091a3/src/ua-parser.js#L464

Use type: "character" in createCharacter for unicode surrogate pair?

The current code for createCharacter(...) looks as shown below. If hasUnicodeFlag is set and a unicode surrogate pair is detected, an AST entry of type escape is created. Otherwise the type is character. This looks strange/wrong to me.

I guess before PR #24 it was necessary to use escape here, but now that both character and escape have a codePoint entry, all AST entries constructed by this function should have type: 'character'.

Any opinions?

    function createCharacter(matches) {
      var _char = matches[0];
      var first = _char.charCodeAt(0);
      if (hasUnicodeFlag) {
        var second;
        if (_char.length === 1 && first >= 0xD800 && first <= 0xDBFF) {
          second = lookahead().charCodeAt(0);
          if (second >= 0xDC00 && second <= 0xDFFF) {
            // Unicode surrogate pair
            pos++;
            return addRaw({
              type: 'escape',
              name: 'codePoint',
              codePoint: (first - 0xD800) * 0x400 + second - 0xDC00 + 0x10000,
              from: pos - 2,
              to: pos
            });
          }
        }
      }
      return addRaw({
        type: 'character',
        codePoint: first,
        from: pos - 1,
        to: pos
      });
    }

Does not support astral ID_Continue characters in capture group names

/(?<$𐒤>a)/u should parse according to test262.

Regular expression syntax error

Hi, this regular expression parsing error

const exp = /\s*(?<typedef>typedef)?\s*(?<struct>struct)\s*(?<structName>\w+)\s*{(?<props>[^}]*)}(\s*(?<aliasName1>\w+)?\s*,\s*(?<aliasName2>\*\w+)?\s*;\s*)?/gi;

babel/babel-loader#888

REPL

SyntaxError: Expected atom at position XX

When running 262, a use case involving regular expressions will get a syntax error, but the use case design will not get the error
as：

match = /[\c_]/.exec('\x1e\x1f\x20');
assert.sameValue(match[0], '\x1f', '\\c_ within CharacterClass');

There will be the following syntax errors：

SyntaxError: classEscape at position 2
[\c_]
^

annexB\language\literals\regexp\class-escape.js
annexB\language\literals\regexp\quantifiable-assertion-followed-by.js
annexB\language\literals\regexp\quantifiable-assertion-not-followed-by.js

Can you help me troubleshoot @pygy @mathiasbynens @adrianheine @jviereck @kpdecker

Discussion: Create Github RegJS organisation?

\to @D10 @mathiasbynens @termi

Hi there,
while brainstorming about my talk in Amsterdam about RegExp.JS I think more and more about the "family" of JS-RegExp tools we build so far together:

regjsgen: https://github.com/d10/regjsgen
regexpu: https://github.com/mathiasbynens/regexpu
regjsparser: https://github.com/jviereck/regjsparser
regexpjs: https://github.com/jviereck/regexp.js

Soon, there will also be a

regjstraverse

Given this family, how do you feel about creating a central GitHub organisation, to host these projects? This should make it easier for people to find them and also I also think it's good to have the repos in one place, as they depend on each other heavily.

Thoughts?

Rename assertion to achor

Reading a chapter about RegExp, it turned out things like ^$\b etc are called "anchors" in the literature. See also 1. Therefore, I think it makes sense to move away from the type:assertion and replace it with type:anchor.

AST: match Esprima’s `range` output?

Our AST has from and to properties. Since these properties are probably consumed together anyway, and to match Esprima’s AST format for source code ranges, should we change this to a range property that contains [from, to] instead?

Before:

…
"from": 0,
"to": 5,
…

After:

…
"range": [ 0, 5 ],
…

Add support for the `/y` flag (`RegExp.prototype.sticky`)

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-get-regexp.prototype.sticky

This doesn’t affect parsing, so it seems fairly easy to do.

The `parse()` API

Currently the API is parse(string), meaning parse accepts a single string representing a regular expression.

It should be possible to specify which flags apply to the regular expression, since they might influence parsing (as is the case for the ES6 /u flag). Any ideas on the best way to extend the API, moving forward? parse(string, flags) where flags is an object such as { g: true, i: false, m: false, u: true }?

Or would we at some point need to pass more than just flags to parse, e.g. options? In that case we’d be better off using parse(string, options) where options.flags represents the flags.

Invalid error "unescaped or unmatched closing brace"

The latest Babel version (7.8.7) is using this package in the latest version (0.6.3), which now produces an error (with Babel 7.8.4 this error did not occur).

This regular expression is an absolutely valid JavaScript expression:

/\{([^}]+)\}/gu

But regjsparser throws the following error:
{code}
Module build failed (from ../node_modules/babel-loader/lib/index.js):
SyntaxError: ...\src\ts\context\i18n\index.tsx: unescaped or unmatched closing brace at position 6
{([^}]+)}
^
at bail (...\node_modules\regjsparser\parser.js:1129:13)
at createCharacter (...\node_modules\regjsparser\parser.js:237:11)
at parseClassAtomNoDash (...\node_modules\regjsparser\parser.js:1105:16)
at parseClassAtom (...\node_modules\regjsparser\parser.js:1094:16)
at parseNonemptyClassRanges (...\node_modules\regjsparser\parser.js:1052:18)
at parseClassRanges (...\node_modules\regjsparser\parser.js:1008:15)
at parseCharacterClass (...\node_modules\regjsparser\parser.js:986:15)
at parseAtom (...\node_modules\regjsparser\parser.js:649:22)
at parseTerm (...\node_modules\regjsparser\parser.js:490:18)
at parseAlternative (...\node_modules\regjsparser\parser.js:463:21)
at parseDisjunction (...\node_modules\regjsparser\parser.js:443:16)
at finishGroup (...\node_modules\regjsparser\parser.js:520:18)
at parseGroup (...\node_modules\regjsparser\parser.js:516:14)
at parseAtom (...\node_modules\regjsparser\parser.js:665:16)
at parseTerm (...\node_modules\regjsparser\parser.js:490:18)
at parseAlternative (...\node_modules\regjsparser\parser.js:463:21)
{code}

If I escape the }, this error does is gone:

/\{([^\}]+)\}/gu

but here, the ESLint rule "no-useless-escape", correctly complains about the useless escaping.

Fix range for non-standard confirm /\91/

Current AST output for \9:

{
  "type": "value",
  "kind": "symbol",
  "codePoint": 57,
  "range": [
    1,
    2
  ],
  "raw": "9"
}

The raw bit should spawn the backspace here as well.

Document AST format

After all the PRs are done we should look at providing documentation to the AST format.

Cannot parse valid regexp

Thanks for your package! It appears that babel plugin use it to convert regexps.

I found an issue with it. This is a valid js regexp [\]}{]+ but when you try to parse it using regjsparser you will get an error.

const parse = require('regjsparser').parse;

const regexp = /[\]}{]+/
parse('[\]}{]+') // throws SyntaxError: Expected atom at position 3

The workaround is to escape { and } but this regexp is not my package, that's why I decided that it needs to be fixed here instead

Wanted: a code generator to complement regjsparser (AST → source code)

regjsparser converts source code into an AST. (cfr. Esprima)

We need a tool that does the opposite. (cfr. Escodegen)

Rename `escapeChar` to `characterEscape`

This way it follows the naming of the ecma standard:

http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.10

The standard in addition has the notion of CharacterClassEscape, which at the moment get merged into escapeChar as well. For simplicity, I think it's okay to merge CharacterClassEscape and CharacterEscape into the same AST type: characterEscape.

Replace `escapeChar` with `escaped:controlLetter`

Looking at this line:

regjsparser/parser.js

Line 690 in d2ed3e8

return createEscapedChar(res[0]);

I think it's inconsitent to have createEscapedChar and createEscaped at the same time. Here my proposal:

Remove createEscapedChar(...)
Replace it with createEscaped("char", ..)

There is also createEscaped('identifier'...), but I don't remember off my head what this was doing. Will look at it later.

Remove `firstMatchIdx` and `lastMatchIdx` from the AST

Groups currently have additional properties like firstMatchIdx and lastMatchIdx:

➜  regjsparser git:(cleanup) ./bin/parser '()'
{
    "type": "group",
    "behavior": "normal",
    "disjunction": {
        "type": "alternative",
        "body": [],
        "range": [
            1,
            1
        ],
        "raw": ""
    },
    "range": [
        0,
        2
    ],
    "raw": "()",
    "matchIdx": 1,
    "lastMatchIdx": 1  // <<< LOOK HERE!
}

These properties are handy when writing a RegExp matcher, but I doubt this kind of information should be stored on the AST. It's easy to compute them when walking the AST.

Therefore, I propose to remove these properties.

Fix range on type "quantifier"

See the comment here by @mathiasbynens: #59 (comment)

  "\\u{1234}": {
    "type": "quantifier",
    "min": 1234,
    "max": 1234,
    "greedy": true,
    "body": [
      {
        "type": "value",
        "kind": "identifier",
        "codePoint": 117,
        "range": [0, 2],
        "raw": "\\u"
      }
    ],
    "range": [2, 8],
    "raw": "{1234}"
  },

The range and raw bit on the outer quantifier is wrong.

Remove `value` property from type: "escape" entries?

Currently, objects with type: "escape" all have a value property, e.g.

{
  "type": "escape",
  "name": "unicode",
  "value": "0020",
  "from": 2,
  "to": 8,
  "raw": "\\u0020"
}

What’s the point of this property/value in this format? I think we can remove it, and add something like codePoint instead (#24).

Agree on indention level

The parse.js file and other files use 4-space indention. The binary in PR #2 uses 2-space indention. I think 2-space indention is better.

@mathiasbynens, any objections to change the entire code base to 2-space indention. Happy to do the changes myself.

Unicode code point escapes are only valid when the `u` flag is set

Unicode code point escapes are only valid when the u flag is set. Currently, regjsparser parses them even if the u flag is not set:

require('regjsparser').parse('\\u{1D306}', '');
/*
{ type: 'value',
  kind: 'unicodeCodePointEscape',
  codePoint: 119558,
  range: [ 0, 9 ],
  raw: '\\u{1D306}' }
*/

classEscape SyntaxError

Module build failed (from ./node_modules/babel-loader/lib/index.js):
SyntaxError: /home/vsts/work/1/s/lib/linter/config-comment-parser.js: classEscape at position 12
    a-zA-Z0-9\-/]+):
              ^
    at bail (/home/vsts/work/1/s/node_modules/regjsparser/parser.js:1118:13)

ref eslint/eslint#12660

Use `body` for group AST entries as well

Have a look at this AST:

➜  regjsparser git:(cleanup) ./bin/parser '()'
{
    "type": "group",
    "behavior": "normal",
    "disjunction": {
        "type": "alternative",
        "body": [],
        "range": [
            1,
            1
        ],
        "raw": ""
    },
    "range": [
        0,
        2
    ],
    "raw": "()",
    "matchIdx": 1,
    "lastMatchIdx": 1
}

All the other "container" AST nodes use body to represent their child AST elements. I think it would be great to do this here as well. Also, the body should be an array to have the same structure as the other AST entries like alternative etc.

No tests for `type: 'ref'`

There are currently no tests/examples for atoms with type: 'ref' (see createRef()).

cannot parse [\-]

version: 0.6.7

I expected the regexp [\-] to parse as a characterClass with one identifier value, -. Instead, I saw:

SyntaxError: classAtom at position 4
    [\-]
        ^
    at bail (/home/robert/Source/test/node_modules/regjsparser/parser.js:1179:13)
    at parseNonemptyClassRangesNoDash (/home/robert/Source/test/node_modules/regjsparser/parser.js:1125:9)
    at parseHelperClassRanges (/home/robert/Source/test/node_modules/regjsparser/parser.js:1088:13)
    at parseNonemptyClassRanges (/home/robert/Source/test/node_modules/regjsparser/parser.js:1114:14)
    at parseClassRanges (/home/robert/Source/test/node_modules/regjsparser/parser.js:1033:15)
    at parseCharacterClass (/home/robert/Source/test/node_modules/regjsparser/parser.js:1015:15)
    at parseAtomAndExtendedAtom (/home/robert/Source/test/node_modules/regjsparser/parser.js:670:22)
    at parseTerm (/home/robert/Source/test/node_modules/regjsparser/parser.js:484:18)
    at parseAlternative (/home/robert/Source/test/node_modules/regjsparser/parser.js:457:21)
    at parseDisjunction (/home/robert/Source/test/node_modules/regjsparser/parser.js:437:16)

To repro:

const parse = require('regjsparser').parse;
parse('[\\-]')

/[\w-e]/ and /[e-\w]/

First, hello and thank you for your work.
The spec says it would be a SyntaxError if the calculated charset for either the min or max of a range encompasses more than a single character; The implication would be that ranges like [e-\w] or [\w-e], among others, are invalid syntactically.
Annex B in the 2015 version of the spec made a (possibly missed) effort to have things like above be treated as valid syntax, as I have detailed in this issue.
In regjsparser, for example, /[\w-e]/ parses as a regex comprising a CharacterClass consisting of a single ClassRange with rangeStart = ClassEscape('w') and rangeEnd = SourceCharacter('e'); in Firefox and Google chrome, /[\w-e]/ parses as a regex comprising a CharacterClass consisting of ClassEscape('w'), SourceCharacter('-'), and SourceCharacter('e').
I'm not sure which one is the correct behavior.
To be honest, neither could be the correct way to parse /[\w-e]/. At least, I was unable to come up with a list of derivations, starting from the Pattern goal, that ends in [\w-e], even with the amendments made by Annex B.
I would be very grateful if you could help me figure whether it is a misunderstanding on my side, or else kindly resolve it in regjsparser. Thanks!

More non-standard tests

Things like \a are treated as a by engines as per Annex B. When the u flag is enabled, this is no longer allowed. https://bugs.ecmascript.org/show_bug.cgi?id=3157#c9

TODO:

add some tests for e.g. /\a/ to see if it’s parsed correctly (as if it was /a/)
make sure we throw an error in such a case when the u flag is set, e.g. for /\a/u

Remove redundant `raw` properties

Here’s the current AST for [a-z]:

{
    "type": "characterClass",
    "body": [
        {
            "type": "characterClassRange",
            "min": {
                "type": "value",
                "kind": "symbol",
                "codePoint": 97,
                "range": [
                    1,
                    2
                ],
                "raw": "a"
            },
            "max": {
                "type": "value",
                "kind": "symbol",
                "codePoint": 122,
                "range": [
                    3,
                    4
                ],
                "raw": "z"
            },
            "range": [
                1,
                4
            ],
            "raw": "a-z"
        }
    ],
    "negative": false,
    "range": [
        0,
        5
    ],
    "raw": "[a-z]"
}

All the raw properties here can be removed except "raw": "a" and "raw": "z". Should they be removed?

Rename `classRanges`

The current AST for simple characterClass looks like this:

➜  regjsparser git:(ast-cleanup) node bin/parser '[a]'
{
    "type": "alternative",
    "terms": [
        {
            "type": "characterClass",
            "classRanges": [
                {
                    "type": "value",
                    "kind": "character",
                    "codePoint": 97,
                    "range": [
                        1,
                        2
                    ],
                    "raw": "a"
                }
            ],
            "negative": false,
            "range": [
                0,
                3
            ],
            "raw": "[a]"
        }
    ],
    "range": [
        0,
        3
    ],
    "raw": "[a]",
    "lastMatchIdx": 0
}

The problem is, that

            "type": "characterClass",
            "classRanges": [

but the classRanges just contains a single entry - and the entry is not even a range, but a single value.

Should we adopt the naming from esprima and replace classRanges with body?

Actually, body is a really cool naming - how about using it instead of terms for type: alternative etc as well?

Manpage for regjsparser

I started working on a package for Debian, and wrote a manpage for regjsparser ; if you're interested, it's under BSD (see debian/copyright).

AST suggestion: add `symbol`/`codePoint` to `characterClassRange`’s `min`/`max`

For the characterClassRange construct in the AST, I believe it would be useful to add either a symbol (i.e. a string containing the Unicode symbol) or a codePoint (i.e. a number indicating the Unicode code point) property to its min and max properties. The type of these min/max objects is either 'character' or 'escape' and currently, any tools consuming the AST need to figure out the code point or the symbol based on that. Adding this data to the AST would avoid this friction.

Support `\p{…}` and `\P{…}`

While working on a spec proposal for \p{…} and \P{…} I’d like to start prototyping an implementation in regexpu.

For now I’d just want regjsparser to parse \p{foo} and \P{bar} in u regexps and expose the foo and bar values as properties in the AST. Perhaps something like this:

> regjsparser.parse('\\p{foo}', 'u')
{
  type: 'unicodePropertyEscape',
  negative: false,
  value: 'foo',
  range: [ 0, 7 ],
  raw: '\\p{foo}'
}

> regjsparser.parse('\\P{foo}', 'u')
{
  type: 'unicodePropertyEscape',
  negative: true,
  value: 'foo',
  range: [ 0, 7 ],
  raw: '\\p{foo}'
}

Any thoughts/strong preferences on what the AST should look like?

Flatten the top of the AST if there is only one alternative

The AST for a single character looks like this:

➜  regjsparser git:(ast-cleanup) node bin/parser 'a'
{
    "type": "alternative",
    "terms": [
        {
            "type": "value",
            "kind": "character",
            "codePoint": 97,
            "range": [
                0,
                1
            ],
            "raw": "a"
        }
    ],
    "range": [
        0,
        1
    ],
    "raw": "a",
    "lastMatchIdx": 0
}

While not wrong, the top alternative is not necessary. Let's kill the type: alternative node if it has exactly one entry.

Add a binary

Would you mind if I added a binary to the project that accepts a string representing a regular expression, and that simply prints out the pretty-formatted parse tree?

Add support for the `/u` flag (`RegExp.prototype.unicode`)

https://people.mozilla.org/~jorendorff/es6-draft.html#sec-get-regexp.prototype.unicode

Contrary to #8, this does affect parsing; based on whether or not the u flag was set, different parsing logic applies.

Add demo for browsers

~~Since regjsgen can be used on browsers, it probably should be tested on browsers as well. I can send a PR adding support for this using qunit and qunit-extras.~~

Updated issue to suggest a demo for browsers instead.

Regression: `(a)\1` throws when parsed with the `u` flag

parse("(a)\\1", "u")

throws, but it's a valid regex.

This regression was probably introduced by #115; I'm working on a fix.

Reformat tests

At the moment, the regular expressions to be tested are located in one file in a big array, and the expected results are in another file in another array.

parse_input.json:

[
  "^a+b|c$"
]

parse_output.json:

[
  { … } /* the AST for that regex */
]

It would be more useful to use a single file with this format instead:

{
  "^a+b|c$": { … } /* the AST for that regex */
}

This would make reviewing diffs like https://github.com/jviereck/regjsparser/pull/32/files#diff-1 much more clear.

(Not that it matters, but this is also how Esprima does it: https://github.com/ariya/esprima/blob/master/test/test.js)

Quantifiable assertions should be disallowed with `u` flag

// Copyright (C) 2016 the V8 project authors. All rights reserved.
// This code is governed by the BSD license found in the LICENSE file.
/*---
esid: sec-regular-expressions-patterns
es6id: B.1.4
description: Quantifiable assertions disallowed with `u` flag
info: |
    The `u` flag precludes quantifiable assertions (even when Annex B is
    honored)

    Term[U] ::
         [~U] QuantifiableAssertion Quantifier
negative:
  phase: parse
  type: SyntaxError
---*/

$DONOTEVALUATE();

/.(?!.){2,3}/u;

In 0.9.1, regjsparser throws

SyntaxError [Error]: Expected atom at position 6
      .(?!.){2,3}

In 0.10.0, it parses successfully. This is likely a regression introduced in #131.

c.f.
https://github.com/tc39/test262/blob/main/test/language/literals/regexp/u-invalid-range-negative-lookahead.js
mathiasbynens/regexpu-core#88

Merge type:character and type:escape

Okay, maybe I am in extreme AST cleanup mode at the moment, but:

      return addRaw({
        type: 'character',
        codePoint: first,
        from: pos - 1,
        to: pos
      });

and

      return addRaw({
        type: 'escape',
        codePoint: codePoint,
        name: name,
        from: pos - (value.length + fromOffset),
        to: pos
      });

just look too much the same to me. From the perspective of implementing a matcher, they are the same - here's the original lines from RegExp.JS:

        case 'character':
        case 'escape':
            res = bText(nodeToChar(node, ignoreCase));
            break;

So, therefore, I propose to make them equal ;) I just cannot come up with a good name for the type. Something along the lines of:

      return addRaw({
        type: 'value',
        name: 'character',
        codePoint: codePoint,
        from: pos - (value.length + fromOffset),
        to: pos
      });

but value is such a generic word :/ Also I think name is to unspecific. Something more along the lines of sub-type or kind?

Backreferences are sometimes parsed as octal escapes

Originally reported by @nhahtdh, here: mathiasbynens/regexpu#19

$ regjsparser '(\1)+\1\1'
{
  "type": "alternative",
  "body": [
    {
      "type": "quantifier",
      "min": 1,
      "max": null,
      "greedy": true,
      "body": [
        {
          "type": "group",
          "behavior": "normal",
          "body": [
            {
              "type": "value",
              "kind": "octal", // ←
              "codePoint": 1,
              "range": [
                1,
                3
              ],
              "raw": "\\1"
            }
          ],
          "range": [
            0,
            4
          ],
          "raw": "(\\1)"
        }
      ],
      "range": [
        0,
        5
      ],
      "raw": "(\\1)+"
    },
    {
      "type": "reference",
      "matchIndex": 1,
      "range": [
        5,
        7
      ],
      "raw": "\\1"
    },
    {
      "type": "reference",
      "matchIndex": 1,
      "range": [
        7,
        9
      ],
      "raw": "\\1"
    }
  ],
  "range": [
    0,
    9
  ],
  "raw": "(\\1)+\\1\\1"
}

Add new "verbose" mode

This was brought up before and also arises in #54. The idea is to have a verbose mode, which splits out the current from of the AST. If the verbose mode is turned off, following properties are omitted from the AST:

range
raw
kind

The current parse function accepts as second argument flags:

  function parse(str, flags) {

to turn on verbose mode, I think the flags string could just contain a v, like:

  parse('/hello/', 'v')

Thoughts?

Decimal escape in capture group is incorrectly parsed as a reference

AST Explorer.

Currently,

/([\1])/

is parsed as

{
  "type": "group",
  "behavior": "normal",
  "body": [
    {
      "type": "characterClass",
      "body": [
        {
          "type": "reference",
          "matchIndex": 1,
          "range": [
            2,
            4
          ],
          "raw": "\\1"
        }
      ],
      "negative": false,
      "range": [
        1,
        5
      ],
      "raw": "[\\1]"
    }
  ],
  "range": [
    0,
    6
  ],
  "raw": "([\\1])"
}

where \1 should have been parsed as a value.