Comments (2)
I agree that we should handle this. I am more curious in knowing how you think we should handle this.
I have a few ideas but I am not fully aware of how browsers do this. If you have any insight into this or any ideas please let me know.
thanks
from php-html-parser.
Ultimately it comes down to detecting the >
that ends the start tag, which can be tricky if there's multiple >
inside other attributes in the tag, or an uneven number of "
marks spread across one or more attributes. I assume browsers use some complicated heuristics, but messing around with various befunged attributes in Firefox, it looks like it treats the first >
that's not inside an attribute (that is, not between a matching pair of "
marks) as the end of the open tag. Everything after that (until a <
) is considered part of the body of the element.
So there'd be two main (non-nominal) cases: an attribute with an extra pair of unescaped "
inside of it (e.g. <a title="This "is" an attribute">
), and an attribute with one extra "
inside it (e.g. <a title="This "is an attribute">
). The former is obviously easier; the latter could be treated as having one attribute called title
with value This
, and three other attributes named is
, an
, and attribute
, each with empty values (the "
immediately after attribute
would be discarded as an invalid token since you can't have a double quote after an attribute name, there has to be an =
first). The >
is thus not inside an attribute, from the parser's POV, and ends the tag.
Another edge case is something like this: <span title=">something</span>
. Firefox discards the entire thing as unparseable.
Presumably the corresponding logic can be found (somewhere) in the vast Firefox (Gecko?) and Chrome (Chromium) codebases.
from php-html-parser.
Related Issues (20)
- Parse XML <link /> problem
- Error when running composer require in php8 using laravel8 HOT 5
- Can't parse emails?
- Error with method signature of Collection::offsetGet() HOT 1
- Symbol "{" inside meta tag content attribute partially breaks the parser HOT 1
- Instalation problem with guzzlehttp/psr7
- Is this library still supported? HOT 3
- PHP 8.1 error HOT 3
- Fails to find a class HOT 1
- Too few arguments to function PHPHtmlParser\Dom::loadStr(), 1 passed and exactly 2 expected HOT 1
- /
- Incorrect parsing
- Html parser
- Html to array HOT 1
- Replace one htmlNode with another? HOT 1
- Invalid internal use of preg_match_alll() HOT 2
- Can't to get html from the page if the URI has the '#' and symbols after this.
- Find does not work with multiple attributes
- The function offsetGet raises a warning for PHP 8 HOT 1
- Provided option to allow Redirect follow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from php-html-parser.