Comments (3)
Some thoughts on this, and why I've added the maybe
label.
In general, there is a lot of work to do here, but none of it is impossible, but some is an excessive amount of work.
Things like validating emails is easy. Validating URLs is a bit more, validating patterns is a ton.
Why is validating patterns a ton of work? Well, HTML basically uses the JavaScript regular expression engine to evaluate the patterns. We use Python. Python Re != JavaScript Regexp. JavaScript adds \cXX
escapes. \u{xxxx}
escapes, it doesn't have lookbehinds etc.
So how do you get over this hurdle?
-
Option 1 would be to preprocess the patterns invalidating the patterns if it contains unsupported JS regexp syntax, escape anything unsupported by JS regexp that Python would trigger on, and translate things like
\cXX
and\u{xxxx}
to Python Re equivalent sytnax. I've done similar things in https://github.com/facelessuser/backrefs. The work wouldn't be as big as was in backrefs as I would not need to have Unicode properties implemented, just exclude certain syntax via failure or escaping, and translate a few other syntax tokens. -
Option 2, require some library (optionally) that provides bindings to something like V8 JavaScript engine to tap into a JavaScript regexp library.
Anyways, it's a lot of work. Some validation (like the work done for :in-range
and :out-of-range
) is easy, some is quite involved. The question is whether the payoff is worth the work required. All is doable, and all well within my skill set to implement, but we'll have to see if the motivation/payoff ratios align with the work load required.
from soupsieve.
JavaScript does allow look behinds. It didn't used to, but now it does in some browsers.
If this was going to be done with Python, it would have to be done with the regex
library as the re
library doesn't support variable width look behinds, but JavaScript does.
from soupsieve.
After doing some work to address a bug in :placeholder-shown
, it has come to my attention that browsers often do a bit of normalization before they compare things like length of a value. Things like carriage returns and such in a real browser environment may actually get normalized if they are raw and maybe not if they are inserted as an entity. I guess I kind of already knew this, but not something I really put much thought into until I started having to deal with that fact directly by coding logic around it.
Soup Sieve doesn't control such things, this is all handled by the Beautiful Soup and the parsers. By time Soup Sieve gets to look at the content in an input, it is has already had entities turned into Unicode characters and other characters normalized (html5lib
) or not normalized (html.parser
and lxml
). It may be difficult to mimic exactly what a browser would do in all cases due to this fact. html5lib
is probably the most likely be the closest in terms of how characters are handled. This is assuming it is doing what browsers do, and not some generalized approximation.
In some implementations, we may just have to accept we that we can only approximate how some selectors work based on the limitations of the environment.
from soupsieve.
Related Issues (20)
- Interesting psuedo class to keep an eye on `:in()` HOT 8
- Rework internal structure of "relations" HOT 1
- circular dependency /bs4 HOT 15
- Attribute selectors vs \n in values HOT 5
- Change in `:has()` CSS Level 4 spec - document our difference or update? HOT 1
- hatch? HOT 5
- Using Hatch in Python 3.6 is technically not allowed HOT 7
- setup.py is mentioned in readme but there is no setup.py HOT 2
- Invalid syntax error on python3.4 HOT 5
- Tracking `:scope` issue related to relative selector lists (`:has()`) HOT 1
- pyproject.toml: validation error since setuptools 61.2.0 HOT 8
- PermissionError: [Errno 13] Permission denied HOT 4
- missing dependency on `bs4` HOT 7
- LXML does not currently generate wheels for Python 3.11 on Windows
- `:has()` is no longer forgiving HOT 1
- malformed attribute selector HOT 7
- The new type hints cause pytest to hang after test session HOT 4
- Attribute Selector Case Sensitivity: Whitespace HOT 1
- Potentially rework CSS parsing
- Missing BUILD.bazel file in latest release HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from soupsieve.