yahoo / context-parser Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 14.0 5.21 MB

A robust HTML5 context parser that parses HTML 5 web pages and reports the execution context of each character.

License: BSD 3-Clause "New" or "Revised" License

JavaScript 14.66% Handlebars 0.01% HTML 85.33%

context-parser's People

Contributors

Stargazers

Watchers

Forkers

dmitris yukinying maditya imclab neraliu mozii ihacku adon-at-work chestercai luiseduardohdbackup phishing-factory caomw captainbarber99

context-parser's Issues

npm install shows deprecated packages warnings

$ npm install
npm WARN deprecated [email protected]: CoffeeScript on NPM has moved to "coffeescript" (no hyphen)
npm WARN deprecated [email protected]: Please update to minimatch 3.0.2 or higher to avoid a RegExp DoS issue
npm WARN deprecated [email protected]: to-iso-string has been deprecated, use @segment/to-iso-string instead.
npm WARN deprecated [email protected]: Jade has been renamed to pug, please install the latest version of pug instead of jade
npm WARN deprecated [email protected]: Please update to minimatch 3.0.2 or higher to avoid a RegExp DoS issue
npm WARN deprecated [email protected]: please upgrade to graceful-fs 4 for compatibility with current and future versions of Node.js
npm WARN deprecated [email protected]: Please update to minimatch 3.0.2 or higher to avoid a RegExp DoS issue

grunt test does not work

I would suggest to indicate explicitly that you need to install npm install -g grunt-cli so that users would not have to do preliminary research on the grunt CLI installation 😄

I'm trying to follow the "How to build" instructions (https://github.com/yahoo/context-parser#how-to-build) but it does not work for me:

$ grunt
grunt-cli: The grunt command line interface. (v0.1.13)

Fatal error: Unable to find local grunt.

If you're seeing this message, either a Gruntfile wasn't found or grunt
hasn't been installed locally to your project. For more information about
installing and configuring grunt, please see the Getting Started guide:

http://gruntjs.com/getting-started
$ npm install grunt
[email protected] node_modules/grunt
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected] ([email protected])
├── [email protected]
├── [email protected]
├── [email protected] ([email protected], [email protected])
├── [email protected] ([email protected], [email protected])
├── [email protected]
├── [email protected] ([email protected], [email protected])
├── [email protected] ([email protected], [email protected])
└── [email protected] ([email protected], [email protected])
$ grunt
>> Local Npm module "grunt-mocha-istanbul" not found. Is it installed?
>> Local Npm module "grunt-contrib-jshint" not found. Is it installed?
>> Local Npm module "grunt-contrib-clean" not found. Is it installed?
Warning: Task "clean:buildResidues" not found. Use --force to continue.

Aborted due to warnings.
$ npm install grunt-mocha-instanbul grunt-contrib-jshint grunt-contrib-clean
npm ERR! Darwin 14.1.0
npm ERR! argv "/Users/dmitris/.nvm/v0.12.0/bin/node" "/Users/dmitris/.nvm/v0.12.0/bin/npm" "install" "grunt-mocha-instanbul" "grunt-contrib-jshint" "grunt-contrib-clean"
npm ERR! node v0.12.0
npm ERR! npm  v2.5.1
npm ERR! code E404

npm ERR! 404 Not Found: grunt-mocha-instanbul
npm ERR! 404
npm ERR! 404 'grunt-mocha-instanbul' is not in the npm registry.
npm ERR! 404 You should bug the author to publish it (or use the name yourself!)
npm ERR! 404 It was specified as a dependency of 'context-parser'
npm ERR! 404
npm ERR! 404 Note that you can also install from a
npm ERR! 404 tarball, folder, http url, or git url.

npm ERR! Please include the following file with any support request:
npm ERR!     /Users/dmitris/dev/hack/context-parser/npm-debug.log
$ grunt
>> Local Npm module "grunt-mocha-istanbul" not found. Is it installed?
>> Local Npm module "grunt-contrib-jshint" not found. Is it installed?
>> Local Npm module "grunt-contrib-clean" not found. Is it installed?
Warning: Task "clean:buildResidues" not found. Use --force to continue.

Aborted due to warnings.
includesoft-lm:context-parser dmitris$ grunt test
>> Local Npm module "grunt-mocha-istanbul" not found. Is it installed?
>> Local Npm module "grunt-contrib-jshint" not found. Is it installed?
>> Local Npm module "grunt-contrib-clean" not found. Is it installed?
Warning: Task "clean:buildResidues" not found. Use --force to continue.

Aborted due to warnings.
$

Let me know if you need the npm-debug.log file.

Need to expose start and eng tag names

Currently the start tag name and end tag name are stored in this.tags[0] and this.tags[1] respectively, and the upstream project html-purify is using these internal variable directly. Unfortunately, this.tags, even may have stored value, would be valid for use only in certain HTML state.

We should construct the functions getStartTagName and getEndTagName that would only return tag name when they are in the correct HTML state.

I will push a few commits to make this happen.

@maditya @neraliu @adon-at-work

contextparse finds only one state (state 0) in a large HTML document

I copied the HTML source of https://www.yahoo.com and saved it in /tmp/yahoo.html, then ran contextparse on it, expecting a long list of states - but it finds only one of them, HTML-State 0:

$ wc /tmp/yahoo.html
     666    9290  322062 /tmp/yahoo.html
$ ./node_modules/context-parser/bin/contextparse /tmp/yahoo.html
  HTML-State { statesSize: 1 } +0ms
  HTML-State { ch: 0, state: 1, symbol: 0 } +2ms

I get the same result for any other HTML file - for example, this minimal test case:

<!DOCTYPE html>
<html>
    <head></head>
    <body>Do you yahoo?<p>
        <A href="https://github.com">GitHub Homepage</A>
        <script>console.log("Test message");</script>
    </body>
</html>

That should give some states for the tags, script etc., right? But the output of contextparse is the same as above:

 wc /tmp/foo.html
       8      14     178 /tmp/foo.html
$ ./node_modules/context-parser/bin/contextparse /tmp/foo.html
  HTML-State { statesSize: 1 } +0ms
  HTML-State { ch: 0, state: 1, symbol: 0 } +2ms

Let me know if there are any other details I can provide to troubleshoot (regarding environment etc.) or if I'm missing something obvious! 😄

Canonicalizing comments in RAWTEXT state

@yukinying

I realize a bug in canonicalizing comments in RAWTEXT state. Documented here are the observation, bug, and suggested fixes.

Facts/Observation:

HTML5: no HTML comments are assumed inside RAWTEXT <style>  {{insideData}} </style>
HTML4: comments can nullify the effect of end tag <style>  {{insideStyle}} </style>

Imagine, the placeholder {{insideData}} holds an attack vector x:expression(alert(1)). the filter applied by contextual escaping has no effect to the value, since inHTMLData/yd filter does not touch anything. Given a HTML5 browser, there's no XSS concern. But this placeholder could be interpreted as in inside style tag in HTML4 and become vulnerable to XSS. We can see this hurts users of older browsers.

The current mitigation is exactly the same as in RCDATA state. The core principle is to turn <! into <!. The replacement fulfills the HTML5 spec since it's an equiv. representation, and that the HTML5 spec won't consider them as comments anyway (i.e., no transitions possible to comment states). In addition, the resulted HTML won't have a chance to be interpreted as jumping into comment state, when rendered in HTML4 browser. As a result, this means the parsing experience is aligned across HTML4/5 browsers.

Such implementation worked properly in the mentioned inputs. But what I just realize is that it may break usability in the following use case: <style> div:after{content:'yo <!'} </style>, where it will become <style> selector:after{content:'yo <!'} </style> after canonicalization. When rendered in browser, the user will see really < instead of <. This is bad.

tl;dr.

The cause of such is actually due to the fact that RAWTEXT actually doesn't support char references, according to the html5 spec (what we overlooked). This motivates me to revisit every different RAWTEXT element for a better fix.

HTML5 spec documents that the following elements will result in RAWTEXT state:
style, xmp, iframe, noembed, or noframes + noscript

Inside <style>, we should turn <! to \3c !, therefore using the css escaped version. that solved the problem above¹.
Inside <iframe>, no handling needed, or we can simply drop <! if found, since ultimately all innerHTML will be dropped as browser will render content supplied via the src attr.
Inside <noscript>, no handling needed. HTML 5 quits once encountered the first close tag in <noscript> B </noscript> and can apply proper contextual filtering after it, i.e. in both positions A and B. Whereas in HTML4, the first close tag inside comment is ignored, and that means more content (A upto B) will get considered as inside noscript tag. So, no matter what filters are applied in them, they won't get rendered if script is enabled. And when script is disabled, we can be sure there're no XSS anyway.

The remaining ones <xmp>, <noembed>, and <noframes> are considered obsolete according to HTML5.

<xmp> works as if it's a <pre> with all < escaped as <. therefore, no tags inside it can render. Similar to <noscript>, HTML4 ignoring an commented end tag would consider more content as in xmp, i.e., more content becoming inert to rendering.
my final recommendation is to drop <noframes> and <noembed> (as if they're blacklisted tags). Rationale: (1) we do not (and hardly we can) support contextual filtering for contents inside the tags. Imagine <noframes><a href="{{url}}">vulnerable</a></noframes>. We might need to hack the state machine as to treat the inner contents as placed inside DATA state for contextual filtering. (2) those browsers that can possibly render the inner contents are super old, and are not supported in our browser list.

¹ discussed with @neraliu. he also considers the new approach makes sense, but would create an extra consideration of allowing \3c when we apply the strict CSS parser in style tags in the future.

note: tested samples

Canonicalization should take care of EOF as it would lead to XSS in IE8 or below.

Here's a sample of an EOF in Attribute value (double-quoted) state:

hello <a href="<script>{{untrusted}}</script>

According to the spec, when the EOF is encountered. It says it's a parse error, and that compliant browser will switch to DATA state. If rendered in latest browsers like Chrome and Firefox, only hello will get rendered, and ended in DATA state. the incomplete tag is actually NOT emitted to the DOM/output.

But unfortunately older browsers like IE7-8 behaved differently, the incomplete tag gets rendered, and that the string <a href="<script>{{untrusted}}</script> is considered as begun in DATA state, and somehow transitioned into SCRIPT state.

Context parser now considers the placeholder {{untrusted}} as placed in attribute value (double-quoted) state. But it ignored the consequence of EOF. That leads the downstream project secure-handlebars to simply insert a filter equiv. to uriInDoubleQuotedAttr() for that placeholder. An attacker using alert(1) will be able to launch XSS.

The EOF problem was marked as TODO inside the source code.

@neraliu @yukinying @maditya

installation instructions need adjustment

The installation / run instructions currently do not work as given:

$ npm install context-parser
[email protected] node_modules/context-parser
└── [email protected] ([email protected])
$ ./bin/contextparse /tmp/test.html
-bash: ./bin/contextparse: No such file or directory

Should either do npm install -g context-parser or provide the full path to the binary under node_modules:

$ ./node_modules/context-parser/bin/contextparse /tmp/test.html

yahoo / context-parser Goto Github PK

context-parser's People

Contributors

Stargazers

Watchers

Forkers

context-parser's Issues

npm install shows deprecated packages warnings

grunt test does not work

Need to expose start and eng tag names

contextparse finds only one state (state 0) in a large HTML document

Canonicalizing comments in RAWTEXT state

Canonicalization should take care of EOF as it would lead to XSS in IE8 or below.

installation instructions need adjustment

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent