fb55 / domhandler Goto Github PK
View Code? Open in Web Editor NEWHandler for htmlparser2, to get a DOM
Home Page: https://domhandler.js.org
License: BSD 2-Clause "Simplified" License
Handler for htmlparser2, to get a DOM
Home Page: https://domhandler.js.org
License: BSD 2-Clause "Simplified" License
I follow the readme but when I use const handler = new DomHandler(() => {});
I have TypeError: DomHandler is not a constructor
Hello.
Using Firefox and others libraries which make use of domhandler
, the following warning message is sometimes prompted into the console:
: mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[Prototype]] value using Object.create
This comes from these lines:
if (this._options.withDomLvl1) {
element.__proto__ = element.type === "tag" ? ElementPrototype : NodePrototype;
}
More information about this warning can be found on the MDN documentation or in this SO question.
I would like to know if you was aware about this and if you thought it might need a fix, or is it an intended choice to implement the prototype mutation like this?
The current npm version is behind master and doesn't have the most recent bug fix. Could you publish a new version? Thanks!
When you parse this code that contains a duplicate HTML element through parser.parseDocument()
:
<html>
<body>
<h1>Foo</h1>
</body>
</html>
<html>
<body>
<h1>Bar</h1>
</body>
</html>
... and run this code over the returned document which removes every html element after the first one...
const elements = DomUtils.getChildren(document)
let found = false
for (const child of elements) {
if (found) {
DomUtils.removeElement(child)
continue
}
if (child.tagName === 'html') found = true
}
The children do not exist on the syntax tree anymore. But if you then serialize the document using dom-serializer
the last HTML tag is back.
I think this has to do with the prev
and next
helper functions still having a reference to the second html element, but I am unable to confirm this as the htmlparser2 playground (https://astexplorer.net/#/2AmVrGuGVJ) is not capable of outputting json and uses version 5.0.1.
I know html documents are not supposed to have more than two elements, that is why I want to remove them automatically.
Hi
I'm using your great product, and I really need the new "EndIndices" feature.
I'm using the code as an npm package, but currently, I must direct my config file to the git himself due to the un-updated version in the npm.
Can you please update the npm package to the latest version?
Many thanks
Ysrael
This could very well be a 'close, won't fix' issue, and if so, that's ok.
But, I wanted to point out how unusual it is to use attribs
as the property name that stores HTML attributes. I would much prefer attrs
or attributes
. If this could be changed, it would be nice.
Like I said though, I understand how complex it can be to change after release.
Add synchronous version, please
1
They should be preserved, or provide a preserveSensitiveWhitespace
option.
The issue I'm having is with CSS styles within CDATA; content
and URLs need those spaces.
What is the best way to delete an element?
Example: Remove the <book id="bk102">
element from
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>
This package is proving useful in a script I'm writing to convert SVG files to config files! One issue I noticed, though, is that It appears that it's conflating children
and childNodes
.
The childNodes
getter should be available on all node types and return a list of nodes (Node[]
). (See https://developer.mozilla.org/en-US/docs/Web/API/Node/childNodes)
A children
getter, however, should only be available on Elements, and it should only return a list of Element child nodes (aka HTMLCollection), excluding Text and Comment nodes. If a setter is provided, it should only take HTMLCollection as an argument (See https://developer.mozilla.org/en-US/docs/Web/API/Element/children)
In src/node.ts
, there's a comment above the childNodes
getter that reads "Same as children. DOM spec-compatible alias", but I don't see where that's stated in https://dom.spec.whatwg.org. On the contrary, at https://dom.spec.whatwg.org/#dom-parentnode-children, it appears to be in agreement with the Mozilla developer documentation.
i have modifying all objects from inside function handler. and i want return the modified html. how to get that ?
no complete usage on this documentation and htmlparser2 documentation.
Right now, we always use a special type for script
and style
tags. This is unexpected in XML mode.
I have a react.js project, which indirectly uses domhandler
v4.2.0 viacheerio
I believe.
Its worked fine for months and then suddenly, my project has started throwing this error when I try to build it.
C:\product-app\node_modules\domhandler\lib\node.js:32
[domelementtype_1.ElementType.Tag, 1],
^
TypeError: Cannot read property 'Tag' of undefined
at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\node.js:32:35)
at Module._compile (internal/modules/cjs/loader.js:1158:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)
at Module.load (internal/modules/cjs/loader.js:1002:32)
at Function.Module._load (internal/modules/cjs/loader.js:901:14)
at Module.require (internal/modules/cjs/loader.js:1044:19)
at require (internal/modules/cjs/helpers.js:77:18)
at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\index.js:15:14)
at Module._compile (internal/modules/cjs/loader.js:1158:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)
Any ideas what might be causing it?
I'm running node.js 12.16.1 with typescript 3.9.10.
This would allows code inference for IDE's such as webstorm & vscode.
Would it be possible to add an option to include the start index of each node? That is, the node's starting character index in the original markup string.
For example...
Xyz <script language= javascript>var foo = '<<bar>>';< / script><!--<!-- Waah! -- -->
[{
data: 'Xyz ',
type: 'text',
startIndex: 0
}, {
type: 'script',
name: 'script',
startIndex: 4
attribs: {
language: 'javascript'
},
children: [{
data: 'var foo = \'<bar>\';<',
type: 'text',
startIndex: 33
}]
}, {
data: '<!-- Waah! -- ',
type: 'comment',
startIndex: 65
}]
I know this value is available on the htmlparser.Parser
instance (you can look in parser.startIndex
during an onopentag
call)... but I don't know if a DomHandler instance could access this property, because its onopentag
function doesn't have access to the parser instance that's using it... is there a way?
Is there possible to manipulate Dom in DomHandler callback? like babal's Path in it's AST tree?
Are you giving people permission to use this code in their projects? If so, please let us know. The best way is to put an open source license in your code project, or to indicate that you are giving permission for people to copy, use, and modify your code.
Here's the license text, you can just add a file to your project, or put this in the README -- this way we'll know you are giving permission to use the code, and also it will require users of your code to maintain your copyright, so that when you code is used you get credit for the code you created and shared.
http://opensource.org/licenses/BSD-2-Clause
Thanks!
Gil
With a document such as this:
<!doctype html>
<html lang="en">
<title>My Document</title>
<h1>Title</h1>
Notice that it is missing the </html>
end tag. Since this tag is optional I often omit it. It would be great if domhandler had some way to indicate that a tag is missing the ending. Something like closing: false
would suffice.
Thanks for keep maintaining this project & adding new features. Currently, Node#cloneNode
does not clone indices from the original object regardless it is set or not.
const [elm] = parseDOM(
`<div>
<p>
Hello world
</p>
</div>`, {
withEndIndices: true,
withStartIndices: true,
});
assert(elm.startIndex === 0);
assert(elm.endIndex === 48);
const newElm = elm.cloneNode(true);
newElm.startIndex // --> null :(
newElm.endIndex // --> null :(
cloned node to inherit startIndex
or endIndex
from the original object
This happens when running serverside.
In my case inside Java Nashorn.
I typically have to Polyfill stuff.
So that might be the case here too.
Going to continue looking for "DOM" polyfills. Any suggestions on the way? :)
$ npm update
npm ERR! code ETARGET
npm ERR! notarget No matching version found for domhandler@^4.2.0.
npm ERR! notarget In most cases you or one of your dependencies are requesting
npm ERR! notarget a package version that doesn't exist.
Could it be done that, with withStartIndices
and withEndIndices
set, a node had attribsIndices
property (or something like that) that would contain atributes names and values start and end indices?
It also would be great to have not only offsets but line and column numbers as well.
Currently, children
and childNodes
refer to the same thing, which is what browser-DOM calls childNodes
. That's not spec compliant - children
is an HTMLCollection
containing only elements, without things like text nodes and comments (spec).
I can PR this if you're interested; while it breaks backwards-compat with this library, it's breaking compliance to spec, and surprised me quite a bit.
This is a great library and it's already been extremely useful for a number of things. ๐๐
Is there a design reason why the APIs deviate from the DOM standard, or is it just the way it ended up?
I won't go into too much detail here, but the fact that property names and functions have different names and types is extremely disruptive, if you're trying to integrate with existing code and tests, etc. - and especially in TypeScript.
To be clear, I'm not asking for or expecting a full implementation of the DOM standard - I'm not asking for any new features per se. But even code that requires a subset of a DOM interface does not immediately work without remapping the node model to something compatible first.
Would you be at all open to changing this? I might be able to help. (It would be a break change, of course.)
(I apologize if this has already been asked and answered - it seems unlikely I could be the first person to ask, but I did search your issues and, to my surprise, I didn't find anything.)
The fix made in #51 hasn't been applied to major version 2 of this package. It would be nice for a version 2.4.3 to be published with such a fix as well so that dependents of v2 can also reap the benefit.
https://github.com/kpdecker/cheerio/blob/master/test/api.manipulation.js#L795-798
Is failing when the decodeEntites
flag is enabled. The assert is getting a value of "MM&M".
Hi, I tried the example code in README but there is no comment
element after parsed.
in README:
[
// ignoring first element
{
type: "script",
name: "script",
attribs: {
language: "javascript",
},
children: [
{
data: "var foo = '<bar>';<",
type: "text",
},
],
},
{
data: "<!-- Waah! -- ",
type: "comment",
},
];
with [email protected]
& [email protected]
[
// ignoring first element
{
type: "script",
name: "script",
attribs: {
language: "javascript",
},
children: [
{
data: "var foo = '<<bar>>';< / script><!--<!-- Waah! -- -->",
type: "text",
},
],
}
];
I am not sure which should I expect.
Thanks in advance!
const handler = new DomHandler(null, null, async (element) => {
Currently i need a DomHandler that also allows a async function "with await".
you use @types/htmlparser2": "^3.10.1"
and it uses @types/[email protected]
.
and @types/domhandler
has interface for DomElement
, but realization of your lib dosn't have it, that's why build always failed.
node_modules/@types/domutils/index.d.ts:6:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?
6 import { DomElement } from "domhandler";
~~~~~~~~~~
node_modules/@types/htmlparser2/index.d.ts:17:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?
17 export { DomElement, DomHandlerOptions, DomHandler, Element, Node } from 'domhandler';
~~~~~~~~~~
node_modules/@types/sanitize-html/index.d.ts:17:10 - error TS2459: Module '"project/node_modules/htmlparser2/lib"' declares 'Options' locally, but it is not exported.
17 import { Options } from "htmlparser2";
~~~~~~~
node_modules/htmlparser2/lib/index.d.ts:5:14
5 declare type Options = ParserOptions & DomHandlerOptions;
~~~~~~~
'Options' is declared here.
Hi, I'm having this failure of the last test with nodejs v4.6.1:
1) withStartIndices adds correct startIndex properties:
TypeError: Cannot read property 'startIndex' of null
at DomHandler._addDomElement (/root/debian/node-cheerio/node-domhandler/index.js:71:36)
at DomHandler.onprocessinginstruction (/root/debian/node-cheerio/node-domhandler/index.js:175:7)
at Parser.ondeclaration (/usr/lib/nodejs/htmlparser2/lib/Parser.js:254:13)
at Tokenizer._stateInDeclaration (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:336:13)
at Tokenizer._parse (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:674:9)
at Tokenizer.write (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:627:7)
at Tokenizer.end (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:820:17)
at Parser.end (/usr/lib/nodejs/htmlparser2/lib/Parser.js:322:18)
at Parser.parseComplete (/usr/lib/nodejs/htmlparser2/lib/Parser.js:314:7)
at Context.<anonymous> (/root/debian/node-cheerio/node-domhandler/test/tests.js:46:10)
at callFn (/usr/lib/nodejs/mocha/lib/runnable.js:223:21)
at Test.Runnable.run (/usr/lib/nodejs/mocha/lib/runnable.js:216:7)
at Runner.runTest (/usr/lib/nodejs/mocha/lib/runner.js:373:10)
at /usr/lib/nodejs/mocha/lib/runner.js:451:12
at next (/usr/lib/nodejs/mocha/lib/runner.js:298:14)
at /usr/lib/nodejs/mocha/lib/runner.js:308:7
at next (/usr/lib/nodejs/mocha/lib/runner.js:246:23)
at Immediate._onImmediate (/usr/lib/nodejs/mocha/lib/runner.js:275:5)
at processImmediate [as _immediateCallback] (timers.js:383:17)
any idea why thus could happen ? thanks, Paolo
The withDomLvl1
option uses code, particularly the declarative syntax for [NodePrototype https://github.com/fb55/domhandler/blob/master/index.js#L78-L90], that causes issues in some browsers.
Can this be abstracted away to another module?
Seems to me CDATA input should be a DataNode
rather than a NodeWithChildren
?
This code imports DomHandler
, however it is never used in the following code.
const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const rawHtml =
"Xyz <script language= javascript>var foo = '<<bar>>';< / script><!--<!-- Waah! -- -->";
const handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
// Handle error
} else {
// Parsing completed, do something
console.log(dom);
}
});
const parser = new Parser(handler);
parser.write(rawHtml);
parser.end();
rawHtml:
<wxs ... />...
result:
<wxs ...>...
I asked in html-webpack-plugin but they were not helpful:
jantimon/html-webpack-plugin#1733
Do I need to update Jest or something else?
It would be a great feature if the parser could resolve HTML entities. For example, if the parser passes it a
, the resulting "data" for the text node would be 'a' instead of the entity. Similarly, named entities, like
could be resolved to their Unicode equivalent characters.
There is a request opened two years ago to add license metadata to their npm distribution:
fb55/domelementtype#7
warning: CRLF will be replaced by LF in
jshint/node_modules/htmlparser2/node_modules/domhandler/index.js.
Here is my rawHtml and result after write and end:
raw:
...
result:
<wxs src="../../../../wxs/imgUtil.wxs" module="imgUtil">...</wxs>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.