fb55 / domhandler Goto Github PK

View Code? Open in Web Editor NEW

335.0 335.0 62.0 3.51 MB

Handler for htmlparser2, to get a DOM

Home Page: https://domhandler.js.org

License: BSD 2-Clause "Simplified" License

TypeScript 100.00%

dom dom-builder domhandler htmlparser2 tree

domhandler's People

Contributors

Stargazers

Watchers

Forkers

gotomypc bob-sims reijovosu mikerobe jugglinmike awwright kpdecker shawnhilgart dailymotion 9-9-9 donnut simonfan zhouhesheng fasterize mail-apps andrewattellwise xiaoshihr rayleesg ivanbacher jamiebuilds romellem chrisemoulton edi9999 log-os wooderpecker aknuds1 andrija-hers bryant1410 wiltonruntime zaork mabels jkva rachelmulligan sendgrid acidburn0zzz pavlohibey donsonliu yanghuabei bryanchance chipper1 orta zhengsk laden666666 jonathanmontane sidx1024 phated xinchro monis0395 ericjeney silicon-beach-labs trysound brettz9 hhy5277 visible nageshlop fossabot bestpika amazing-util rakhithjk ronn23 mpadev0103 popular-dependencies

domhandler's Issues

Domhandler is not a constructor

I follow the readme but when I use const handler = new DomHandler(() => {}); I have TypeError: DomHandler is not a constructor

Warning about mutating [[Prototype]] of elements Objects

Hello.

Using Firefox and others libraries which make use of domhandler, the following warning message is sometimes prompted into the console:

: mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[Prototype]] value using Object.create

This comes from these lines:

if (this._options.withDomLvl1) {
    element.__proto__ = element.type === "tag" ? ElementPrototype : NodePrototype;
}

More information about this warning can be found on the MDN documentation or in this SO question.

I would like to know if you was aware about this and if you thought it might need a fix, or is it an intended choice to implement the prototype mutation like this?

Publish new npm version

The current npm version is behind master and doesn't have the most recent bug fix. Could you publish a new version? Thanks!

serializer serializes elements removed through DomUtils

When you parse this code that contains a duplicate HTML element through parser.parseDocument():

<html>
  <body>
      <h1>Foo</h1>
  </body>
</html>

<html>
  <body>
      <h1>Bar</h1>
  </body>
</html>

... and run this code over the returned document which removes every html element after the first one...

const elements = DomUtils.getChildren(document)

let found = false
for (const child of elements) {
	if (found) {
		DomUtils.removeElement(child)
		continue
	}

	if (child.tagName === 'html') found = true
}

The children do not exist on the syntax tree anymore. But if you then serialize the document using dom-serializer the last HTML tag is back.

I think this has to do with the prev and next helper functions still having a reference to the second html element, but I am unable to confirm this as the htmlparser2 playground (https://astexplorer.net/#/2AmVrGuGVJ) is not capable of outputting json and uses version 5.0.1.

I know html documents are not supposed to have more than two elements, that is why I want to remove them automatically.

Update npm version

I'm using your great product, and I really need the new "EndIndices" feature.
I'm using the code as an npm package, but currently, I must direct my config file to the git himself due to the un-updated version in the npm.

Can you please update the npm package to the latest version?

Many thanks
Ysrael

"attribs" is an uncomfortable compromise between "attributes" and "attrs"

This could very well be a 'close, won't fix' issue, and if so, that's ok.

But, I wanted to point out how unusual it is to use attribs as the property name that stores HTML attributes. I would much prefer attrs or attributes. If this could be changed, it would be nice.

Like I said though, I understand how complex it can be to change after release.

Document synchronous access

Add synchronous version, please

s

normalizeWhitespace with attributes and CDATA

They should be preserved, or provide a preserveSensitiveWhitespace option.

The issue I'm having is with CSS styles within CDATA; content and URLs need those spaces.

How to delete an element?

What is the best way to delete an element?

Example: Remove the <book id="bk102"> element from

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>

Why AST different between parse5(as expected) and htmlparser2(not as expected)?

<em>hello</em> world

In htmlparser2:

But in parse5:

I think the result AST parse5 outputs is as expected?

Distinguish between Node.childNodes and Element.children

This package is proving useful in a script I'm writing to convert SVG files to config files! One issue I noticed, though, is that It appears that it's conflating children and childNodes.

The childNodes getter should be available on all node types and return a list of nodes (Node[]). (See https://developer.mozilla.org/en-US/docs/Web/API/Node/childNodes)

A children getter, however, should only be available on Elements, and it should only return a list of Element child nodes (aka HTMLCollection), excluding Text and Comment nodes. If a setter is provided, it should only take HTMLCollection as an argument (See https://developer.mozilla.org/en-US/docs/Web/API/Element/children)

In src/node.ts, there's a comment above the childNodes getter that reads "Same as children. DOM spec-compatible alias", but I don't see where that's stated in https://dom.spec.whatwg.org. On the contrary, at https://dom.spec.whatwg.org/#dom-parentnode-children, it appears to be in agreement with the Mozilla developer documentation.

How to get modified html ?

i have modifying all objects from inside function handler. and i want return the modified html. how to get that ?
no complete usage on this documentation and htmlparser2 documentation.

Use `type: 'tag'` for all elements in XML mode

Right now, we always use a special type for script and style tags. This is unexpected in XML mode.

Cannot read property 'Tag' of undefined.

I have a react.js project, which indirectly uses domhandler v4.2.0 viacheerio I believe.

Its worked fine for months and then suddenly, my project has started throwing this error when I try to build it.

C:\product-app\node_modules\domhandler\lib\node.js:32
    [domelementtype_1.ElementType.Tag, 1],
                                  ^

TypeError: Cannot read property 'Tag' of undefined
    at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\node.js:32:35)
    at Module._compile (internal/modules/cjs/loader.js:1158:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)
    at Module.load (internal/modules/cjs/loader.js:1002:32)
    at Function.Module._load (internal/modules/cjs/loader.js:901:14)
    at Module.require (internal/modules/cjs/loader.js:1044:19)
    at require (internal/modules/cjs/helpers.js:77:18)
    at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\index.js:15:14)
    at Module._compile (internal/modules/cjs/loader.js:1158:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)

Any ideas what might be causing it?

I'm running node.js 12.16.1 with typescript 3.9.10.

Typescript: Add "types": "lib/index.d.ts" to package.json

This would allows code inference for IDE's such as webstorm & vscode.

Option to include character indexes in output?

Would it be possible to add an option to include the start index of each node? That is, the node's starting character index in the original markup string.

For example...

Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->

[{
    data: 'Xyz ',
    type: 'text',
    startIndex: 0
}, {
    type: 'script',
    name: 'script',
    startIndex: 4
    attribs: {
        language: 'javascript'
    },
    children: [{
        data: 'var foo = \'<bar>\';<',
        type: 'text',
        startIndex: 33
    }]
}, {
    data: '<!-- Waah! -- ',
    type: 'comment',
    startIndex: 65
}]

I know this value is available on the htmlparser.Parser instance (you can look in parser.startIndex during an onopentag call)... but I don't know if a DomHandler instance could access this property, because its onopentag function doesn't have access to the parser instance that's using it... is there a way?

[QUESTION] Is there possible to manipulate Dom in DomHandler callback?

Is there possible to manipulate Dom in DomHandler callback? like babal's Path in it's AST tree?

Domhandler types does not expose all elements supported

Element types supported:

Element types exposed:

I could raise a PR for the same, need to get elements type:"text"

Open Source license?

Are you giving people permission to use this code in their projects? If so, please let us know. The best way is to put an open source license in your code project, or to indicate that you are giving permission for people to copy, use, and modify your code.

Here's the license text, you can just add a file to your project, or put this in the README -- this way we'll know you are giving permission to use the code, and also it will require users of your code to maintain your copyright, so that when you code is used you get credit for the code you created and shared.

http://opensource.org/licenses/BSD-2-Clause

Thanks!
Gil

Feature request: Identify missing ending tags

With a document such as this:

<!doctype html>
<html lang="en">
<title>My Document</title>
<h1>Title</h1>

Notice that it is missing the </html> end tag. Since this tag is optional I often omit it. It would be great if domhandler had some way to indicate that a tag is missing the ending. Something like closing: false would suffice.

Node#cloneNode does not inherit source indices

Thanks for keep maintaining this project & adding new features. Currently, Node#cloneNode does not clone indices from the original object regardless it is set or not.

Current behaviour

const [elm] = parseDOM(
  `<div>
    <p>
      Hello world
    </p>
  </div>`, {
  withEndIndices: true,
  withStartIndices: true,
});

assert(elm.startIndex === 0);
assert(elm.endIndex   === 48);

const newElm = elm.cloneNode(true);
newElm.startIndex // --> null :(
newElm.endIndex // --> null :(

Expected behaviour

cloned node to inherit startIndex or endIndex from the original object

TypeError: document.children.find is not a function

This happens when running serverside.
In my case inside Java Nashorn.
I typically have to Polyfill stuff.
So that might be the case here too.

Going to continue looking for "DOM" polyfills. Any suggestions on the way? :)

version 4.2.0 error

$ npm update             
npm ERR! code ETARGET
npm ERR! notarget No matching version found for domhandler@^4.2.0.
npm ERR! notarget In most cases you or one of your dependencies are requesting
npm ERR! notarget a package version that doesn't exist.

Feature request: attribs indices

Could it be done that, with withStartIndices and withEndIndices set, a node had attribsIndices property (or something like that) that would contain atributes names and values start and end indices?

It also would be great to have not only offsets but line and column numbers as well.

children and childNodes should not be identical

Currently, children and childNodes refer to the same thing, which is what browser-DOM calls childNodes. That's not spec compliant - children is an HTMLCollection containing only elements, without things like text nodes and comments (spec).

I can PR this if you're interested; while it breaks backwards-compat with this library, it's breaking compliance to spec, and surprised me quite a bit.

Add support for DOM standard interfaces

This is a great library and it's already been extremely useful for a number of things. 🙂👍

Is there a design reason why the APIs deviate from the DOM standard, or is it just the way it ended up?

I won't go into too much detail here, but the fact that property names and functions have different names and types is extremely disruptive, if you're trying to integrate with existing code and tests, etc. - and especially in TypeScript.

To be clear, I'm not asking for or expecting a full implementation of the DOM standard - I'm not asking for any new features per se. But even code that requires a subset of a DOM interface does not immediately work without remapping the node model to something compatible first.

Would you be at all open to changing this? I might be able to help. (It would be a break change, of course.)

(I apologize if this has already been asked and answered - it seems unlikely I could be the first person to ask, but I did search your issues and, to my surprise, I didn't find anything.)

patch release - Add npmignore test

The fix made in #51 hasn't been applied to major version 2 of this package. It would be nice for a version 2.4.3 to be published with such a fix as well so that dependents of v2 can also reap the benefit.

Decoding failure with decodeEntities enabled

https://github.com/kpdecker/cheerio/blob/master/test/api.manipulation.js#L795-798

Is failing when the decodeEntites flag is enabled. The assert is getting a value of "MM&M".

different parsed result from README

Hi, I tried the example code in README but there is no comment element after parsed.

in README:

[
    // ignoring first element
    {
        type: "script",
        name: "script",
        attribs: {
            language: "javascript",
        },
        children: [
            {
                data: "var foo = '<bar>';<",
                type: "text",
            },
        ],
    },
    {
        data: "<!-- Waah! -- ",
        type: "comment",
    },
];

with [email protected] & [email protected]

[
    // ignoring first element
    {
        type: "script",
        name: "script",
        attribs: {
            language: "javascript",
        },
        children: [
            {
                data: "var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->",
                type: "text",
            },
        ],
    }
];

I am not sure which should I expect.

Thanks in advance!

DomHandler with async callback

const handler = new DomHandler(null, null, async (element) => {
Currently i need a DomHandler that also allows a async function "with await".

incorrect version of @types/domhandler

you use @types/htmlparser2": "^3.10.1" and it uses @types/[email protected].
and @types/domhandler has interface for DomElement, but realization of your lib dosn't have it, that's why build always failed.

https://github.com/DefinitelyTyped/DefinitelyTyped/blob/02db5ccb68be79df3f24cfc323bad5a609ff4d5f/types/domutils/index.d.ts

node_modules/@types/domutils/index.d.ts:6:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?

6 import { DomElement } from "domhandler";
           ~~~~~~~~~~

node_modules/@types/htmlparser2/index.d.ts:17:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?

17 export { DomElement, DomHandlerOptions, DomHandler, Element, Node } from 'domhandler';
            ~~~~~~~~~~

node_modules/@types/sanitize-html/index.d.ts:17:10 - error TS2459: Module '"project/node_modules/htmlparser2/lib"' declares 'Options' locally, but it is not exported.

17 import { Options } from "htmlparser2";
            ~~~~~~~

  node_modules/htmlparser2/lib/index.d.ts:5:14
    5 declare type Options = ParserOptions & DomHandlerOptions;
                   ~~~~~~~
    'Options' is declared here.

test/cases/24-with-start-indices failing

Hi, I'm having this failure of the last test with nodejs v4.6.1:

  1)  withStartIndices adds correct startIndex properties:
     TypeError: Cannot read property 'startIndex' of null
    at DomHandler._addDomElement (/root/debian/node-cheerio/node-domhandler/index.js:71:36)
    at DomHandler.onprocessinginstruction (/root/debian/node-cheerio/node-domhandler/index.js:175:7)
    at Parser.ondeclaration (/usr/lib/nodejs/htmlparser2/lib/Parser.js:254:13)
    at Tokenizer._stateInDeclaration (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:336:13)
    at Tokenizer._parse (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:674:9)
    at Tokenizer.write (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:627:7)
    at Tokenizer.end (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:820:17)
    at Parser.end (/usr/lib/nodejs/htmlparser2/lib/Parser.js:322:18)
    at Parser.parseComplete (/usr/lib/nodejs/htmlparser2/lib/Parser.js:314:7)
    at Context.<anonymous> (/root/debian/node-cheerio/node-domhandler/test/tests.js:46:10)
    at callFn (/usr/lib/nodejs/mocha/lib/runnable.js:223:21)
    at Test.Runnable.run (/usr/lib/nodejs/mocha/lib/runnable.js:216:7)
    at Runner.runTest (/usr/lib/nodejs/mocha/lib/runner.js:373:10)
    at /usr/lib/nodejs/mocha/lib/runner.js:451:12
    at next (/usr/lib/nodejs/mocha/lib/runner.js:298:14)
    at /usr/lib/nodejs/mocha/lib/runner.js:308:7
    at next (/usr/lib/nodejs/mocha/lib/runner.js:246:23)
    at Immediate._onImmediate (/usr/lib/nodejs/mocha/lib/runner.js:275:5)
    at processImmediate [as _immediateCallback] (timers.js:383:17)

any idea why thus could happen ? thanks, Paolo

`withDomLvl1` option support prevents use in some browsers.

The withDomLvl1 option uses code, particularly the declarative syntax for [NodePrototype https://github.com/fb55/domhandler/blob/master/index.js#L78-L90], that causes issues in some browsers.

Can this be abstracted away to another module?

CDATA Nodes wrong type?

Seems to me CDATA input should be a DataNode rather than a NodeWithChildren?

Incorrect implementation in README

This code imports DomHandler, however it is never used in the following code.

const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const rawHtml =
    "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
const handler = new htmlparser.DomHandler(function(error, dom) {
    if (error) {
        // Handle error
    } else {
        // Parsing completed, do something
        console.log(dom);
    }
});
const parser = new Parser(handler);
parser.write(rawHtml);
parser.end();

Is there a way to "keepClosingSlash: true"

rawHtml:
<wxs ... />...

result:
<wxs ...>...

Updating html-webpack-plugin uses domhanlder 4 and breaking all tests

I updated Webpack from 4 to 5
This required I update html-webpack-plugin from v3 to v5
Now all my Jest v26 tests fail:

I asked in html-webpack-plugin but they were not helpful:
jantimon/html-webpack-plugin#1733

Do I need to update Jest or something else?

version 4.2.0 not tagget

can't install it with npm

run:
npm view domhandler

Normalize HTML entities

It would be a great feature if the parser could resolve HTML entities. For example, if the parser passes it a, the resulting "data" for the text node would be 'a' instead of the entity. Similarly, named entities, like   could be resolved to their Unicode equivalent characters.

Consider removing domelementtype dependency

There is a request opened two years ago to add license metadata to their npm distribution:
fb55/domelementtype#7

Use LF instead of CRLF

warning: CRLF will be replaced by LF in 
jshint/node_modules/htmlparser2/node_modules/domhandler/index.js.

Is there a way to "keepClosingSlash: true"?

Here is my rawHtml and result after write and end:
raw:
...
result:

<wxs src="../../../../wxs/imgUtil.wxs" module="imgUtil">...</wxs>