pugjs / pug-lexer Goto Github PK

View Code? Open in Web Editor NEW

42.0 9.0 27.0 423 KB

The pug lexer (converts a string into an array of tokens)

License: MIT License

JavaScript 70.37% HTML 29.63%

pug-lexer's Introduction

pug-lexer

The pug lexer. This module is responsible for taking a string and converting it into an array of tokens.

Installation

npm install pug-lexer

Usage

var lex = require('pug-lexer');

`lex(str, options)`

Convert Pug string to an array of tokens.

options can contain the following properties:

filename (string): The name of the Pug file; it is used in error handling if provided.
plugins (array): An array of plugins, in the order they should be applied.

console.log(JSON.stringify(lex('div(data-foo="bar")', {filename: 'my-file.pug'}), null, '  '))

[
  {
    "type": "tag",
    "line": 1,
    "val": "div",
    "selfClosing": false
  },
  {
    "type": "attrs",
    "line": 1,
    "attrs": [
      {
        "name": "data-foo",
        "val": "\"bar\"",
        "escaped": true
      }
    ]
  },
  {
    "type": "eos",
    "line": 1
  }
]

`new lex.Lexer(str, options)`

Constructor for a Lexer class. This is not meant to be used directly unless you know what you are doing.

options may contain the following properties:

filename (string): The name of the Pug file; it is used in error handling if provided.
interpolated (boolean): if the Lexer is created as a child lexer for inline tag interpolation (e.g. #[p Hello]). Defaults to false.
startingLine (integer): the real line number of the first line in the input. It is also used for inline tag interpolation. Defaults to 1.
plugins (array): An array of plugins, in the order they should be applied.

License

MIT

pug-lexer's People

Contributors

Stargazers

Watchers

pug-lexer's Issues

Should `text-html` capture any lines that contain HTML?

Currently p This is <strong>html</strong> text yields the following tokens:

[ { type: 'tag', line: 1, val: 'p', selfClosing: false }
, { type: 'text', line: 1, val: 'This is <strong>html</strong> text' }
, { type: 'eos', line: 1 }
]

But should the text token be text-html, or is text-html only reserved for lines of text that start with HTML tags?

Riot compatibility

Greetings, I hope this is the right place for this issue.

In Riotjs, we do <ul each={item in items}></ul> but in Pug, this code is invalid: ul(each={item in items}) {item}, so we have to wrap the Riot as a string like so: ul(each="{item in items}") {item}.

But then we lose linting.

Would it be possible to have an option to allow spaces within the {}'s, or any other such solution so I can have Riot+Pug+linting?

Many thanks in advance

Identifiers starting with 'of' in the 'each' value variable break the lexer

This was reported against my eslint pug plugin:
valpackett/eslint-plugin-pug#10

in the value variable position, an identifier like offers and in fact anything that matches of.+ (i.e. not of itself) seems to explode:

> const lex = require('pug-lexer')
> lex('each x in of')
[{"type":"each","loc":{"start":{"line":1,"column":1},"end":{"line":1,"column":13}},"val":"x","key":null,"code":"of"},{"type":"eos","loc":{"start":{"line":1,"column":13},"end":{"line":1,"column":13}}}]
> lex('each x in ofX')
Uncaught Error: Pug:1:14
  > 1| each x in ofX
--------------------^

The value variable for each must either be a valid identifier (e.g. `item`) or a pair of identifiers in square brackets (e.g. `[key, value]`).

Attribute names starting with ":" are not allowed anymore

In vue.js templates, HTML-attributes starting with : are used for passing property values to a vue component.

An example is here:

https://github.com/vuejs/vue/blob/next/examples/select2/index.html (line 27)

With jade, they have been accepted without any problems, but pug does not allow them anymore:

  1| 
  2| #app
> 3|   hello(:name='name')
--------------------^
  4| 

":" is not valid as the start or end of an un-quoted attribute.
at makeError (/home/egon/dev/electron-vue/node_modules/pug-error/index.js:32:13)
at Lexer.error (/home/egon/dev/electron-vue/node_modules/pug-lexer/index.js:52:15)
at Lexer.attrs (/home/egon/dev/electron-vue/node_modules/pug-lexer/index.js:1044:18)

Newlines in `call` tokens do not increment line number

Spotted https://github.com/jadejs/jade-lexer/blob/master/test/cases/mixins.expected.json#L32-L33 while working on reporting column numbers. The newline character in the call token on line 32, should then cause the token on line 33 to be reported as being on line 18 rather than 17.

I can pop a fix in as part of my work if you like?

Should new lines after pipeless text be reported?

e.g.

.
  foo
    bar
      baz
  .

yields

{"type":"dot","line":1}
{"type":"start-pipeless-text","line":1}
{"type":"text","line":2,"val":"foo"}
{"type":"newline","line":3}
{"type":"text","line":3,"val":"  bar"}
{"type":"newline","line":4}
{"type":"text","line":4,"val":"    baz"}
{"type":"newline","line":5}
{"type":"text","line":5,"val":"."}
{"type":"end-pipeless-text","line":5}
{"type":"eos","line":5}

I would expect the eos token to be preceded by a newline token, and the eos token to be reported as being on line 6, as per

p test

yields

{"type":"tag","line":1,"col":1,"val":"p","selfClosing":false}
{"type":"text","line":1,"col":3,"val":"test"}
{"type":"newline","line":2,"col":1}
{"type":"eos","line":2,"col":1}

Report line and column numbers wherever possible

I'm going to start work in a fork (https://github.com/benedfit/jade-lexer) to report line and column numbers for tokens, and any errors encountered during lexing. Will keep you posted on my progress

case parsing

Reported by @neochrome in pugjs/pug#2235

Hi,
I run into problems parsing this kind of case/when

case "a:b"
  when "a:b"
    p a:b
  default
    p default

The error is expected "indent", but got "filter", but really it should handle "a:b" as a string literal - right?

This work around is a bit of a kludge, but at least it works:

-var test = "a:b";
case "a:b"
  when test
    p a:b
  default
    p default

Adding source endings to tokens

Currently working on adding source line endings to all tokens as discussed here. However, I wanted to move this to it's own issue instead of hijacking the other.

Going to make it match babylon's loc format which looks like:

var token = {
    loc: {
        start: { line:1, column:1 },
        end: { line:1, column:13 }
    }
};

Question: Where is release 3.1.0?

npm lists 3.1.0 as the latest release and I can't find the tag.

Question: which line number should an 'outdent' return?

@ForbesLindesay: While looking at the reporting columns for tokens I noticed an difference in the line reporting of 'outdent' tokens depending on where they occur and was wondering whether this was intentional?

Take the following jade as an example:

foo
  bar
foz
  baz

Which yields the following tokens:

[ { type: 'tag', line: 1, val: 'foo', selfClosing: false },
  { type: 'indent', line: 2, val: 2 },
  { type: 'tag', line: 2, val: 'bar', selfClosing: false },
  { type: 'outdent', line: 3 },
  { type: 'tag', line: 3, val: 'foz', selfClosing: false },
  { type: 'indent', line: 4, val: 2 },
  { type: 'tag', line: 4, val: 'baz', selfClosing: false },
  { type: 'outdent', line: 4 },
  { type: 'eos', line: 4 } ]

The first outdent is reported as being on line 3, essentially before foz. However the final outdent is reported as being on line 4, essentially after baz, which is correct as there is no line 5.

So it this intentional, or should the first outdent be reported as being on line 2?

Syntax error reported for valid Jade

.test #[span.test= totalCurrentPage] | #[span.test= total] currently reported as syntax error for the buffered code variables. However, testing this out on http://jade-lang.com/ does not throw an error

Switch to generators

Currently, the lexer returns an array containing all the tokens. For very large jade files, this means that the array has to contain ALL of the tokens, and therefore has to use a lot of RAM. For instance, it takes 112 megabytes of RAM to lex and parse a 756-kilobyte test file created by concatenating mixin.attrs.jade (if we don't copy the tokens in Lexer#getTokens it still takes 100 megabytes), when measured using GNU time.

To reduce this memory usage, we could consider using ES2015 generator functions on platforms where these are supported. In my preliminary tests, the same file only takes 91 megabytes to lex and parse, while being only marginally slower (~2%).

What's your opinion on this? Do you think the gains are enough to warrant the additional complexity?

The diffs I used:

jade-lexer:

diff --git a/index.js b/index.js
index 54badd7..d76bc91 100644
--- a/index.js
+++ b/index.js
@@ -3,6 +3,7 @@
 var assert = require('assert');
 var characterParser = require('character-parser');
 var error = require('jade-error');
+var GeneratorFunction = require('generator-function');

 module.exports = lex;
 module.exports.Lexer = Lexer;
@@ -10,6 +11,14 @@ function lex(str, filename) {
   var lexer = new Lexer(str, filename);
   return JSON.parse(JSON.stringify(lexer.getTokens()));
 }
+if (GeneratorFunction) {
+  module.exports.lexIterator = Function('Lexer',
+    'return function* (str, filename) {\n' +
+    '  var lexer = new Lexer(str, filename);\n' +
+    '  yield* lexer.getIterator();\n' +
+    '}'
+  )(Lexer);
+}

 /**
  * Initialize `Lexer` with the given `str`.
@@ -1088,5 +1097,18 @@ Lexer.prototype = {
       this.advance();
     }
     return this.tokens;
-  }
+  },
+
+  getIterator: (function () {
+    if (GeneratorFunction) {
+      return GeneratorFunction('',
+        'while (!this.ended) {\n' +
+        '  this.advance();\n' +
+        '  if (this.tokens.length === 1) yield this.tokens[0];\n' +
+        '  else yield* this.tokens[Symbol.iterator]();\n' +
+        '  this.tokens = [];\n' +
+        '}'
+      );
+    }
+  })()
 };

token-stream:

--- lib/array.js        2015-10-11 10:38:15.384840871 -0700
+++ lib/iterator.js     2015-10-11 10:56:40.112840871 -0700
@@ -2,28 +2,41 @@

 module.exports = TokenStream;
 function TokenStream(tokens) {
-  if (!Array.isArray(tokens)) {
-    throw new TypeError('tokens must be passed to TokenStream as an array.');
+  if (!tokens || !tokens[Symbol.iterator]) {
+    throw new TypeError('tokens must be passed to TokenStream as an iterable.');
   }
-  this._tokens = tokens;
+  this._iterator = tokens[Symbol.iterator]();
+  this._tokens = [];
 }
 TokenStream.prototype.lookahead = function (index) {
   if (this._tokens.length <= index) {
-    throw new Error('Cannot read past the end of a stream');
+    var j = index + 1 - this._tokens.length;
+    while (j--) {
+      var res = this._iterator.next();
+      if (res.done) throw new Error('Cannot read past the end of a stream');
+      this._tokens.push(res.value);
+    }
   }
   return this._tokens[index];
 };
 TokenStream.prototype.peek = function () {
-  if (this._tokens.length === 0) {
-    throw new Error('Cannot read past the end of a stream');
+  if (this._tokens.length) {
+    return this._tokens[0];
+  } else {
+    var res = this._iterator.next();
+    if (res.done) throw new Error('Cannot read past the end of a stream');
+    this._tokens[0] = res.value;
+    return res.value;
   }
-  return this._tokens[0];
 };
 TokenStream.prototype.advance = function () {
-  if (this._tokens.length === 0) {
-    throw new Error('Cannot read past the end of a stream');
+  if (this._tokens.length) {
+    return this._tokens.shift();
+  } else {
+    var res = this._iterator.next();
+    if (res.done) throw new Error('Cannot read past the end of a stream');
+    return res.value;
   }
-  return this._tokens.shift();
 };
 TokenStream.prototype.defer = function (token) {
   this._tokens.unshift(token);

The json files under test/cases are not really json

The files are not really json, just each lines are json.

{"type":"newline","line":3,"col":1}
{"type":"tag","line":3,"col":1,"val":"ul"}
{"type":"indent","line":4,"col":1,"val":2}
...

With a little trick you could turn them into json: use an array:

[
    {"type":"newline","line":3,"col":1},
    {"type":"tag","line":3,"col":1,"val":"ul"},
    {"type":"indent","line":4,"col":1,"val":2}
    ...
]

And json files can be read with the require() function. ;)

Class name rules are too strict

From HTML 4.01 onwards, the class attribute is allowed to have weird values, including Unicode symbols. The following code is perfectly valid:

<p class="#">Foo.
<p class="##">Bar.
<p class="♥">Baz.
<p class="©">Inga.
<p class="{}">Lorem.
<p class="“‘’”">Ipsum.
<p class="⌘⌥">Dolor.
<p class="{}">Sit.
<p class="[attr=value]">Amet.

However, the pug lexer restricts class names to values begining with -, _ or a letter and only containing _, -, a-z and 0-9. Is there a reason for it not being more loose?

Plugin API

Add an extra option, plugins which should be an array of plugin objects. Plugins can define methods to "override" any of the methods of the lexer which normally return true or false to indicate whether they should fall through. For example, if you wanted to implement the "id" token as a plugin you could use:

opts =  {
  plugins: [
    {
      advance: function (lexer) {
        var tok = lexer.scan(/^#([\w-]+)/, 'id');
        if (tok) {
          lexer.tokens.push(tok);
          lexer.incrementColumn(tok.val.length);
          return true;
        }
        if (/^#/.test(lexer.input)) {
          lexer.error('INVALID_ID', '"' + /.[^ \t\(\#\.\:]*/.exec(this.input.substr(1))[0] + '" is not a valid ID.');
        }
      }
    }
  ]
};

Plugins should be called in sequence until one of them returns true. In this way plugins can be combined relatively safely.

Nested template literals inside pug block return syntax error

I'm using babel-plugin-transform-react-pug in a React project where I'm passing dynamic class names. I use nested template literals like so:

return pug`
  div(className=${`${blockClass}__element`})
    | Whatever content  
`;

This works fine. But JSLint throws an error that seems to come from pug-lexer:

Syntax Error: Unexpected token
Error: Pug:1:16
  > 1| div(className=${`${blockClass}__element`})
----------------------^
    2|       | Whatever content

Syntax Error: Unexpected token
    at makeError (/home/deploy/homestars-www/node_modules/pug-error/index.js:32:13)
    at Lexer.error (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:58:15)
    at Lexer.assertExpression (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:86:12)
    at Lexer.attrs (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:1089:18)
    at Lexer.callLexerFunction (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:1319:23)
    at Lexer.advance (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:1356:15)
    at Lexer.callLexerFunction (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:1319:23)
    at Lexer.getTokens (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:1375:12)
    at lex (/home/deploy/homestars-www/node_modules/pug-lexer/index.js:12:42)
    at findVariablesInTemplate (/home/deploy/homestars-www/node_modules/pug-uses-variables/lib/findVariablesInTemplate.js:31:20)

If I remove the nested literal, this error doesn't occur.

Is this just because of the difference in formatting between babel-plugin-transform-react-pug and pugjs?

viz,

babel-plugin-transform-react-pug: `${}`
pugjs: `#{}`

Remove "pipeless" state

Rather than setting the various state options for pipeless text, we should just call pipelessText from tokens that we expect to be followed by pipless text. This would simplify our state model and thus make plugins much less brittle.

Make attributes part of the token stream

Lets define three new token types:

{type: 'start-attributes'}
{type: 'attribute', name: 'string', val: 'string', mustEscape: true}
{type: 'end-attributes'}

This way, we can stop making attributes such a uniquely vast token, and it will be easy/obvious how to put line numbers on them.

98b13a7 broke space separated attributes where the attribute names are quoted

a(foo='foo' "bar"="bar")
a(foo='foo' 'bar'='bar')

do not work anymore

pugjs / pug-lexer Goto Github PK

pug-lexer's Introduction

pug-lexer

Installation

Usage

lex(str, options)

new lex.Lexer(str, options)

License

pug-lexer's People

Contributors

Stargazers

Watchers

Forkers

pug-lexer's Issues

Recommend Projects

Recommend Topics

Recommend Org

`lex(str, options)`

`new lex.Lexer(str, options)`