Coder Social home page Coder Social logo

regexpp's Introduction

regexpp

npm version Downloads/month Build Status codecov Dependency Status

A regular expression parser for ECMAScript.

๐Ÿ’ฟ Installation

$ npm install regexpp
  • require Node.js 8 or newer.

๐Ÿ“– Usage

import {
    AST,
    RegExpParser,
    RegExpValidator,
    RegExpVisitor,
    parseRegExpLiteral,
    validateRegExpLiteral,
    visitRegExpAST
} from "regexpp"

parseRegExpLiteral(source, options?)

Parse a given regular expression literal then make AST object.

This is equivalent to new RegExpParser(options).parseLiteral(source).

  • Parameters:
    • source (string | RegExp) The source code to parse.
    • options? (RegExpParser.Options) The options to parse.
  • Return:
    • The AST of the regular expression.

validateRegExpLiteral(source, options?)

Validate a given regular expression literal.

This is equivalent to new RegExpValidator(options).validateLiteral(source).

  • Parameters:

visitRegExpAST(ast, handlers)

Visit each node of a given AST.

This is equivalent to new RegExpVisitor(handlers).visit(ast).

RegExpParser

new RegExpParser(options?)

parser.parseLiteral(source, start?, end?)

Parse a regular expression literal.

  • Parameters:
    • source (string) The source code to parse. E.g. "/abc/g".
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.
  • Return:
    • The AST of the regular expression.

parser.parsePattern(source, start?, end?, uFlag?)

Parse a regular expression pattern.

  • Parameters:
    • source (string) The source code to parse. E.g. "abc".
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.
    • uFlag? (boolean) The flag to enable Unicode mode.
  • Return:
    • The AST of the regular expression pattern.

parser.parseFlags(source, start?, end?)

Parse a regular expression flags.

  • Parameters:
    • source (string) The source code to parse. E.g. "gim".
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.
  • Return:
    • The AST of the regular expression flags.

RegExpValidator

new RegExpValidator(options)

validator.validateLiteral(source, start, end)

Validate a regular expression literal.

  • Parameters:
    • source (string) The source code to validate.
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.

validator.validatePattern(source, start, end, uFlag)

Validate a regular expression pattern.

  • Parameters:
    • source (string) The source code to validate.
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.
    • uFlag? (boolean) The flag to enable Unicode mode.

validator.validateFlags(source, start, end)

Validate a regular expression flags.

  • Parameters:
    • source (string) The source code to validate.
    • start? (number) The start index in the source code. Default is 0.
    • end? (number) The end index in the source code. Default is source.length.

RegExpVisitor

new RegExpVisitor(handlers)

visitor.visit(ast)

Validate a regular expression literal.

  • Parameters:

๐Ÿ“ฐ Changelog

๐Ÿป Contributing

Welcome contributing!

Please use GitHub's Issues/PRs.

Development Tools

  • npm test runs tests and measures coverage.
  • npm run build compiles TypeScript source code to index.js, index.js.map, and index.d.ts.
  • npm run clean removes the temporary files which are created by npm test and npm run build.
  • npm run lint runs ESLint.
  • npm run update:test updates test fixtures.
  • npm run update:ids updates src/unicode/ids.ts.
  • npm run watch runs tests with --watch option.

regexpp's People

Contributors

bluelovers avatar mysticatea avatar ota-meshi avatar validark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexpp's Issues

Export ast types in index

Most types from regexpp/ast are not exported in the index, which makes dealing with them awkward.

Visitor: improve usability

The current visitor has some problems. The biggest is that there's no way to visit every expression, where an expression might be the content of a group or of a lookahead or lookbehind. All expressions have common needs -- in particular the alternatives need to be evaluated to know if the expression matches, after which different actions are appropriate depending on the specific type. The worst offender is the assertion type, because it may or may not be an expression at all, depending on whether the kind is lookahead or lookbehind. This may be fine for a backtracking engine since these can evaluate the assertion inside their own block scope, but it makes life fabulously difficult for non-backtracking engines which must always track the states until the input reveals whether the subexpression matches or fails.

bug about x?y*z+

[ { type: 'Quantifier',
    start: 2,
    end: 3,
    raw: '?', // <====== 'x?'
    min: 0,
    max: 1,
    greedy: true,
    element: 
     { type: 'Character',
       parent: [Circular],
       start: 1,
       end: 2,
       raw: 'x',
       value: 120 } },
  { type: 'Quantifier',
    start: 4,
    end: 5,
    raw: '*', // <====== 'y*'
    min: 0,
    max: Infinity,
    greedy: true,
    element: 
     { type: 'Character',
       parent: [Circular],
       start: 3,
       end: 4,
       raw: 'y',
       value: 121 } },
  { type: 'Quantifier',
    start: 6,
    end: 7,
    raw: '+', // <====== 'z+'
    min: 1,
    max: Infinity,
    greedy: true,
    element: 
     { type: 'Character',
       parent: [Circular],
       start: 5,
       end: 6,
       raw: 'z',
       value: 122 } } ]

i think u can make a new type for quantifier mark {n,m} + * ?

this form anthor old parser

[ Quantified {
    type: 'quantified',
    offset: 0,
    text: 'u{1,2}',
    body: Literal { type: 'literal', offset: 0, text: 'u', body: 'u' },
    quantifier: 
     Quantifier {
       type: 'quantifier',
       offset: 1,
       text: '{1,2}',
       min: 1,
       max: 2,
       greedy: true } },
  Quantified {
    type: 'quantified',
    offset: 6,
    text: 'u{1,}',
    body: Literal { type: 'literal', offset: 6, text: 'u', body: 'u' },
    quantifier: 
     Quantifier {
       type: 'quantifier',
       offset: 7,
       text: '{1,}',
       min: 1,
       max: Infinity,
       greedy: true } },
  Literal { type: 'literal', offset: 11, text: 'i', body: 'i' },
  Quantified {
    type: 'quantified',
    offset: 12,
    text: 'x?',
    body: Literal { type: 'literal', offset: 12, text: 'x', body: 'x' },
    quantifier: 
     Quantifier {
       type: 'quantifier',
       offset: 13,
       text: '?',
       min: 0,
       max: 1,
       greedy: true } },
  Quantified {
    type: 'quantified',
    offset: 14,
    text: 'y*',
    body: Literal { type: 'literal', offset: 14, text: 'y', body: 'y' },
    quantifier: 
     Quantifier {
       type: 'quantifier',
       offset: 15,
       text: '*',
       min: 0,
       max: Infinity,
       greedy: true } },
  Quantified {
    type: 'quantified',
    offset: 16,
    text: 'd+?',
    body: Literal { type: 'literal', offset: 16, text: 'd', body: 'd' },
    quantifier: 
     Quantifier {
       type: 'quantifier',
       offset: 17,
       text: '+',
       min: 1,
       max: Infinity,
       greedy: false } } ]

RegExpVisitor class is not exported

Hello and thanks for this library ๐Ÿ‘

It seems the RegExp visitor class is not exported at runtime.
However the documentation seems to imply it is.

regexpp/src/index.ts

Lines 1 to 6 in 4bcab0b

import * as AST from "./ast"
import { RegExpParser } from "./parser"
import { RegExpValidator } from "./validator"
import { RegExpVisitor } from "./visitor"
export { AST, RegExpParser, RegExpValidator }

I need the class itself to be exposed in order to modify the traversal, e.g:

  • Halt the traversal when something has been detected.
  • Avoid traversing certain sub-nodes of the AST.

Cheers.
Shahar.

printing

It would be pretty useful for debugging to be able to print nodes!

`\c` is parsed incorrectly

RegExpp: v3.1.0
NodeJS: v13.12.0


The following code:

const { RegExpParser } = require("regexpp");

const parser = new RegExpParser();
const ast = parser.parsePattern(/[\c]/.source);
console.log(JSON.stringify(ast, (key, value) => key === "parent" ? null : value, 4));

will output the following

{
    "type": "Pattern",
    "parent": null,
    "start": 0,
    "end": 4,
    "raw": "[\\c]",
    "alternatives": [
        {
            "type": "Alternative",
            "parent": null,
            "start": 0,
            "end": 4,
            "raw": "[\\c]",
            "elements": [
                {
                    "type": "CharacterClass",
                    "parent": null,
                    "start": 0,
                    "end": 4,
                    "raw": "[\\c]",
                    "negate": false,
                    "elements": [
                        {
                            "type": "Character",
                            "parent": null,
                            "start": 1,
                            "end": 2,
                            "raw": "\\",
                            "value": 92
                        },
                        {
                            "type": "Character",
                            "parent": null,
                            "start": 2,
                            "end": 3,
                            "raw": "c",
                            "value": 99
                        }
                    ]
                }
            ]
        }
    ]
}

As you can see, \c is parsed as a backslash character and the character c. This happens both inside and outside of character classes.
Instead, it should be parsed as a single character c.

Separate method for parsing unicode

Unicode support is easily the most expensive part of using this package, but there are many possible usages of it that do not require unicode at all, mine included. Unfortunately it is impossible to avoid the cost of your unicode implementation as there is no way to import the library that does not also import unicode, and no tree-shaking engine can remove code that is triggered by a runtime option to a used function.

This would require a breaking change to fix, but it is worth fixing.

Invitation to move to official `eslint-community` org

We would love to have this repo added to the official @eslint-community organization on GitHub.

As you can read in the '@eslint-community GitHub organization' RFC, the goal of this new org is to have a place where community members can help ensure widely depended upon ESLint related packages stay up to date with newer ESLint releases and doesn't hold the wider community back without depending on one person's GitHub/npm account.

Since this plugin is really popular (23M+ download/week), it's used by the main ESLint repo & since it's currently unmaintained (the latest commit is 1y+ old, the latest interaction is 1y+ old as well & the main ESLint repo is considering using a different package because #27 still isn't merged), we'd love you โ€”@mysticateaโ€” to transfer this repo to a better home, so you're welcome to transfer this repository to the new org.

Include documentation in `index.d.ts`

I noticed that index.d.ts doesn't include any documentation even though a lot of useful comments are defined in the source files. Having inline documentation in your IDE of choice is very useful, so could it please be added in the next release?

The only thing that has to changed for this is the removeComments definition. This means that comments will also be included in the JS files, but that shouldn't be an issue since they aren't minified anyway.

Should I make a PR?

Update Unicode to 13

ES2020 should use Unicode 13 for ID_Start, ID_Continue, and Unicode escape sequences (\p{}).

Expose parents in regexp vistor

Right now in the visitor it's not possible to see the chain of parent nodes, unless you add a handler for every single node type and maintain your own array. It'd be useful to pass the parents in as a second parameter.

Missing support for js-supported character class

I want to parse this regex:

/\p{ID_Start}\p{ID_Continue}+/u

Node 16 accepts it:

const idRegex = /^\p{ID_Start}\p{ID_Continue}+$/u
console.log(idRegex.test("anIdentifier")
// > true
console.log(idRegex.test("not an Identifier")
// > false

regexpp 3.2.0 does not:

regexpp.validateRegExpLiteral(/^\p{ID_Start}\p{ID_Continue}+$/u)
// > RangeError: Invalid code point -1
at Function.fromCodePoint in ECMAScript
at RegExpValidator.validateLiteral in regexpp/index.js โ€” line 411
at Object.validateRegExpLiteral in regexpp/index.js โ€” line 2084

These two character classes are important for programming language parsing: https://unicode.org/reports/tr31/

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode

When parsing a non-Unicode regex that contains named backreferences with the strict: true option, a syntax error will always be throws regardless of whether the regex is actually correct or not.

Example:

const { RegExpValidator } = require("regexpp")

const validator = new RegExpValidator({ strict: true, ecmaVersion: 2020 })
validator.validatePattern(/(?<foo>A)\k<foo>/.source, undefined, undefined, false)

This produces the following error:

SyntaxError: Invalid regular expression: /(?<foo>A)\k<foo>/: Invalid escape
    at RegExpValidator.raise ([...]\regexpp\.temp\src\validator.ts:847:15)
    at RegExpValidator.consumeAtomEscape ([...]\regexpp\.temp\src\validator.ts:1475:18)
    at RegExpValidator.consumeReverseSolidusAtomEscape ([...]\regexpp\.temp\src\validator.ts:1245:22)
    at RegExpValidator.consumeAtom ([...]\regexpp\.temp\src\validator.ts:1213:18)
    at RegExpValidator.consumeTerm ([...]\regexpp\.temp\src\validator.ts:1027:23)
    at RegExpValidator.consumeAlternative ([...]\regexpp\.temp\src\validator.ts:1000:53)
    at RegExpValidator.consumeDisjunction ([...]\regexpp\.temp\src\validator.ts:976:18)
    at RegExpValidator.consumePattern ([...]\regexpp\.temp\src\validator.ts:901:14)
    at RegExpValidator.validatePattern ([...]\regexpp\.temp\src\validator.ts:531:14)
    at validateRegExpPattern (my-project\app.ts:12:75)

However, the regex /(?<foo>A)\k<foo>/ is valid. As stated in the proposal:

In this proposal, \k<foo> in non-Unicode RegExps will continue to match the literal string "k<foo>" unless the RegExp contains a named group, in which case it will match that group or be a syntax error, depending on whether or not the RegExp has a named group named foo.

Since the regex contains a named capturing group, \k<foo> has to be parsed as a backreference. Since Annex B doesn't say anything about named backreferences, regexpp should parse this regex even with strict: true.

However, regexpp parses it as an invalid(?) escape and throws an error in strict mode. This is because validation is done is two passes (1, 2). The bug occurs because the n flag isn't set in the first pass causing the syntax error. This can be seen in the stack trace: the second-last line - at RegExpValidator.validatePattern ([...]\validator.ts:531:14) - is the first parsing pass.

The fix for this bug is to determine whether the regex contains named groups ahead of time, similar to how the number of capturing groups is counted before parsing. I will make a PR.

Allow transversal in RegExpVisitor

Estraverse allows the visitor to do several things in its functions:

  • return skip to avoid recursion
  • return break to stop iteration
  • call a replace function to replace the current node

It would be useful to implement these in regexpp as well.

Some questions about reusing this library

Hello.

I'd like to reuse your library in a Parsing library I've authored.
I am currently using a regExp Parser that I've written myself
But your library seems of a higher quality and it would save me the trouble of maintaining my own regExp parser ๐Ÿ˜„.

Questions:

  1. Can this library be run in the browser? are there any limitations on bundling it for browser usage, including IE11 ๐Ÿ˜ข
  2. Why is node >= 6.5 needed?
  3. Are there any limitation or constraints I should be aware of?

Cheers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.