facelessuser / soupsieve Goto Github PK
View Code? Open in Web Editor NEWA modern CSS selector implementation for BeautifulSoup
Home Page: https://facelessuser.github.io/soupsieve/
License: MIT License
A modern CSS selector implementation for BeautifulSoup
Home Page: https://facelessuser.github.io/soupsieve/
License: MIT License
This has been pulled out of the PySpelling project to be released as its own project. A bit of work to do to before it can go on PyPI.
When raising syntax exceptions, some of the exceptions could be phrased better, and also, character position may be nice to display.
It may also be nice to provide a DEBUG flag that will cause verbose messages describing how the CSS pattern is tokenized.
Finish up testing uncovered areas of code. This is a requirement for 1.0.0.
There is nothing wrong that we throw an error for unsupported pseudo, but currently, a supported pseudo with bad syntax may get caught and through this unsupported error. This can be confusing as it isn't exactly true. We should probably still catch these, but then compare the name against our supported list and issue a more appropriate error for supported pseudo classes.
Right now we have two iter functions commentsiter
and selectiter
. For 1.0, should they be called comments_iter
and select_iter
. Or maybe icomments
and iselect
?
We have some flexibility before 1.0 release, so we should really settle on something. commentsiter
and selectiter
just seem hard to read.
Fix issue noted here: MechanicalSoup/MechanicalSoup#263.
We were checking against the html namespace for type case insensitivity instead of whether the document was XML or not.
In [122]: xml = """<Envelope><Header>...</Header></Envelope>"""
In [123]: s = BeautifulSoup(xml, "xml")
In [124]: s.select("header")
Out[124]: [<Header>...</Header>]
In [125]: s.select("Header")
Out[125]: []
Before, BeautifulSoup accepted (and I think required) case-sensitive tag name in selector.
Now that BeautifulSoup uses soupsieve, it seems that only lower-case selectors are supported.
I'm really not sure why or if I can change this behaviour.
According to the spec
[att~=val]
Represents an element with the att attribute whose value is a whitespace-separated list of words, one of which is exactly "val". If "val" contains whitespace, it will never represent anything (since the words are separated by spaces). Also if "val" is the empty string, it will never represent anything.
We are currently not enforcing this, but we should (the whitespace part).
This is another selector that we just have to well define, once we understand the details, we can start work on this.
I've decided to move this to a separate issue as ||
is technically an "at risk" selector feature. There is very low priority, and if it doesn't make it into the spec, it will not be implemented. There still needs to be more clarification in the spec, or at least a reference implementation we can work off of.
Implement a closest
api function. it would function no different than noted here.
Summary, in short, given a selector and a tag, closest
would return the closest tag ancestor (including the tag given) that matches the selector. See link above for examples.
This is super easy to implement.
It might be nice to spawn a custom parser where you can just add your namespace mapping once, and all calls will pick it up. You could still do a one off call and manually feed them in, but you could also just create a custom parser and then just call as normal:
import soupsieve as sv
parser = sv.custom_parser()
parser.set_namespaces({"ns": "http:/ns.com"})
parser.select(':header.class', soup)
Originally this was planned to be done with :valid and :invalid, but these can be handled separately.
I like the function soup.select()
and soup.select_one()
. Maybe you can add select_one
to this repo? If the result is None, I can easily know that there are no match objects. Just like https://docs.python.org/3/library/re.html#re.search Infact, I don't know why you use limit
to control the number of the results. I think people would need one or all of the results.
SyntaxError
is a builtin exception meant for Python syntax errors, and it is specialized for that task. To avoid confusion for people, we should derive our new SelectorSyntaxError
from the general Exception
instead. For the 1.0 series, we will leave it as SyntaxError
to avoid breakage, but for 2.0, we will make this change.
Reference #105
Rework the parse engine to not assume that just because it is a "pseudo" class that it needs to close it. This will make some of our snippets less confusing as we have to include a )
at the end as we send replacement patterns through.
:playing
/ :paused
is for media stuff. We can't play or pause in our environment, so this will just match nothing (#39).:local-link
this will most likely match nothing (#39).:user-invalid
there can be no user interaction in our environment, so this will match nothing (#39).:scope
:root
, but I need to research this more, or understand what the future goal for this is.:root
if the document is under select or match, or match the tag if a tag is under match or select. It is used to make a pattern relative to the tag under evaluation. (#38)For contributors coming to add improvements, it would be helpful to document the internal structure of how we construct our CSS selector structure. This should be in the development page.
The idea of allowing user definable custom selectors is a cool idea that is currently in draft: https://drafts.csswg.org/css-extensions/#typedef-custom-selector.
The idea is essentially to allow create an custom pseudo-class as an alias for a more complex expression.
@custom-selector :--heading h1, h2, h3, h4, h5, h6;
:--heading { /* styles for all headings */ }
:--heading + p { /* more styles */ }
There seems to be some open issues for a lot more complexity than what is shown here, but I think it may be reasonable to to allow stuff like this, and then a user could construct aliases for whatever they would like. That way we don't ever actually need to support such things directly ourselves.
https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity
"Using !important, however, is bad practice and should be avoided because it makes debugging more difficult by breaking the natural cascading in your stylesheets."
But sometimes when I use complex words to select, it is hard for me to review.
Can we use something like parentheses (eg:3*(3+4)=21)?
This is planned, but before it can be implemented, we need to have a well defined understanding of what can be "read only" and what is "read write". Once we understand this, we should be able to implement this no problem.
Please, exclude all tests from the package. Otherwise they'll get installed in site-packages/tests
.
--- setup.py.orig 2019-02-19 09:37:15.000000000 +0000
+++ setup.py
@@ -51,7 +51,7 @@ setup(
author='Isaac Muse',
author_email='[email protected]',
url='https://github.com/facelessuser/soupsieve',
- packages=find_packages(exclude=['tests', 'tools']),
+ packages=find_packages(exclude=['tests', 'tests.*', 'tools']),
install_requires=get_requirements(),
license='MIT License',
classifiers=[
We could probably implement some things like: :hover, :active, :focus, :target, :visited, etc. Our environment (not in a browser) doesn't actually have these states, so we could have them just have them never match. The library cssselect apparently does this. We can pick up any CSS4 selectors (if any like this) and do the same.
We could also implement HTML only selectors, or selectors that are really only defined for HTML. We would have to make some assumptions based on our environment. For instance, :link
would match all elements as in our environment, all links are unvisited.
Basically, we want to work towards having all selectors defined in a way that makes sense for our environment if possible.
This is an initial list that is targeted. It may change. Not all undefined selectors are being targeted. Some may be implemented at a later time.
In an issue on the BS4 google groups, a discussion took place in regards to the change in how quoted attributes are treated (Soup Sieve follows the spec, while the old BS4 method is super lax), and that maybe a deprecation path should be provided. There are three cases that were noted:
td.+.class
: this case shows that the old BS4 select method would allow a class with no class name to be defined and would treat it as no class was defined. CSS actually won't match anything, Soup Sieve won't match either, but instead opts to fail. If we implemented a QUIRKS mode, this would ignore classes like this, and most likely ids (#
) as well.
> p
: BS4, before Soup Sieve, would treat this kind of like a relative selector. This was most likely due to omission that it was by design, but in at least one project this was exploited. And where there is one, there are many. QUIRKS mode would most likely inject :scope
for the user.
Note: If BS4 actually matched the siblings of the element select was called on in the case of + div
, we will not emulate that. Select should match downwards, not laterally. This point I will not budge on, but I'm pretty sure BS4 didn't match these...probably.
[attribute={}]
: BS4 used to have a very lax rule for the attribute value. It would allow most anything unquoted as long as it wasn't a double quote or closing square bracket. We cannot relax to this degree. We will continue to recognize both single and double quoted values. But within an unquoted value, will allow a great deal more except whitespace.
QUIRKS mode would only exist for as long as BS4 requires it. It is not guaranteed that we will do this, but it is being considered.
Support for this would require quite a bit of work. We would need to write proper validators for each kind of input type. I am not sure when this will get done, but it is large enough to be a case unto itself.
This would include Moved to a separate issue.:in-range
and :out-of-range
, though as :in-range
and :out-of-range
is more simple, it is possible that could get implemented first.
Python: 2.7.3
OS: RHEL
Dependency graph: MechanicalSoup โ beauthifulsoup4 โ soupsieve
With beautifulsoup4 4.7.0 soupsieve is installed as a dependency. Seems like python 2.7.3 had some issues in parsing one of the regular expressions in css_parser.py (RE_LANG to be specific). Do you have any plans on supporting python 2.7.4 or lower?
Here is the full stack trace for this issue:
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/mechanicalsoup/__init__.py", line 2, in <module>
from .browser import Browser
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/mechanicalsoup/browser.py", line 2, in <module>
import bs4
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/bs4/__init__.py", line 34, in <module>
from .builder import builder_registry, ParserRejectedMarkup
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/bs4/builder/__init__.py", line 7, in <module>
from bs4.element import (
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/bs4/element.py", line 12, in <module>
import soupsieve
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/soupsieve/__init__.py", line 30, in <module>
from . import css_parser as cp
File "~/virtualenvs/<venv>/lib/python2.7/site-packages/soupsieve/css_parser.py", line 144, in <module>
RE_LANG = re.compile(r'(?:(?P<value>{value})|(?P<split>{ws}*,{ws}*))'.format(ws=WSC, value=VALUE), re.X)
File "~/virtualenvs/<venv>/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "~/virtualenvs/<venv>/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Apparently, it is quite common to have the type attribute specifically treated as case insensitive. This seems to be the only attribute that follows this convention. For this reason, the s
sensitivity flag has been added to the CSS4 spec: *[type="submit" s]
. We should treat type
value as case insensitive, and also support s
to for sensitivity. In addition, we should ensure that case is not enforced for the the flag itself.
We must somehow be using a secondary pip, or calling the wrong Python when call tox.
We are careful to set our desired Python first in the path, and then use python -m
(calling the Python installation we set first in path) to ensure we use its tox. Yet, after upgrading to pip
18.1, we still get an install error for pip 7.x. It doesn't happen all the time, but it is frustrating as the cause is not understood. We need to get to the bottom of this moving forward so that we can have reliable, automated Windows testing.
We forgot to register NullSelector
to be pickled. So if a compiled selector object contains a NullSelector
, it will not pickle.
For XML, you can use the its
namespace to do dir
. I don't have time right now to look into it, but maybe in the future.
Not sure where to file this bug as I'm using soupsieve via BeautifulSoup4, but here goes:
Using BeautifulSoup4 4.7.1, with SoupSieve 1.6.2, under Python 3.6.
import bs4
source = """<html><body>
<div>1</div>
<div>2</div>
<div>3</div>
<div>4</div>
<div>5</div>
<div>6</div>
<div>7</div>
<div>8</div>
<div>9</div>
<div>10</div>
<div>11</div>
<div>12</div>
<div>12</div>
<div>13</div>
<div>14</div>
<div>15</div>
<div>16</div>
</body></html>"""
soup = bs4.BeautifulSoup(source, 'lxml') # same result with html5lib
print(soup.select("div:nth-of-type(9)")) # Expect 9, is 9
print(soup.select("div:nth-child(9)")) # Expect 9, is 9
print(soup.select("div:nth-of-type(10)")) # Expect 10, is 15
print(soup.select("div:nth-child(10)")) # Expect 10, is 15
print(soup.select("div:nth-of-type(11)")) # Expect 11, is 16
print(soup.select("div:nth-child(11)")) # Expect 11, is 16
print(soup.select("div:nth-of-type(12)")) # Expect 12, finds nothing
print(soup.select("div:nth-child(12)")) # Expect 12, finds nothing
It seems to work well with single-digit index, but either returns empty or the wrong element for 10 and upwards.
(Also filed this on BS4 LaunchPad.)
Turns out, Beautiful Soup wraps its attribute keys in a special string like object that exposes prefix and namespace:
>>> list(soup.use.attrs.keys())[0]
'xlink:href'
>>> list(soup.use.attrs.keys())[0].namespace
'http://www.w3.org/1999/xlink'
We were only checking the prefix as we didn't know this existed. We should be checking the namespace as the prefix doesn't matter. You could redefine prefixes anywhere in the document.
Rework testing to so that when we test HTML, we test all parsers, and when we test namespace specific stuff, we test HTML5 and XHTML.
This selector is currently noted as "at risk" in the specifications. Basically it isn't well defined yet. It will most likely be fleshed out more at some point, but it cannot be implemented in its current state. Once it is actually fleshed out more, we can take a stab at implementing it.
I think we can simplify this and remove document flags.
This is a bug with handling valid XML namespaces; soupsieve assumes all namespaces have a prefix:
<prefix:tag xmlns:prefix="...">
but the prefix can be omitted to define a default namespace:
<tag xmlns="...">
meaning that any element without a prefix:
prepended to the tag name is in that namespace. See section 6.2 of the XML namespaces 1.1 spec.
During parsing, lxml
passes in a default namespace under the None
key, e.g. {None: "..."}
, and unique keys are accumulated in the soup._namespaces
dictionary. soupsieve assumes the dictionary only ever has string keys, so an XML document with a default namespace leads to an exception.
Test case (using BeautifulSoup 4.7 for convenience):
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.7.0'
>>> sample = b'''\
... <?xml version="1.1"?>
... <!-- unprefixed element types are from "books" -->
... <book xmlns='urn:loc.gov:books'
... xmlns:isbn='urn:ISBN:0-395-36341-6'>
... <title>Cheaper by the Dozen</title>
... <isbn:number>1568491379</isbn:number>
... </book>
... '''
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup._namespaces
{'xml': 'http://www.w3.org/XML/1998/namespace', None: 'urn:loc.gov:books', 'isbn': 'urn:ISBN:0-395-36341-6'}
>>> soup.select_one('title')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1345, in select_one
value = self.select(selector, namespaces, 1, **kwargs)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 50, in compile
namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings
where <title>Cheaper by the Dozen</title>
was expected.
There are a couple selectors that might be useful that are found in JQuery's custom selector engine, but there are some that I absolutely will not support. We already support [attr!=value]
and the old rejected CSS :contains()
. Let's start off with what we won't support:
:selected
is already covered by CSS4's :checked
. If I need to specifically target options, it is easy enough to do manually.I'm not sure if I care to implement these, but these are possibilities:
:parent
: This seems like it may be useful. Essentially it would be an alias for :has(> *|*)
. It's a possibility.:header
: Who doesn't hate doing this: h1, h2, h3, h4, h5, h6
? This might be useful and would simply be an alias for :is(h1, h2, h3, h4, h5, h6)
.:checkbox
for input[type=checkbox]
, etc. Button would be mildly more complicated: :button
~= :is(button, input[type=button])
.:input
would be a shortcut for all inputs: input
, button
, select
, and textarea
. I guess :is(input, button, select, textarea)
.I'll at least leave this open for discussion.
Most of these can now be implemented with the coming custom selector support. You could have :--parent
to implement parent, etc.
The JQuery selectors that cannot be supported with custom selectors would be the following:
:first
, :last
, :even
, :odd
, :eq
, :nth
, :lt
and :gt
will not be supported in any way shape or form. This is because these require for us to preserve order of the selectors in a compound selector and bubble of a list of elements that match each one allowing these to then filter them. This add a lot of complexity and code that I am not willing to do.
This actually wouldn't be as difficult as once thought. The key would be that we need to create a dictionary with each entry related to a unique indexing pseudo-class. There we could track the count for each one and simply apply a mod to tell if we should return it. These type of selectors would have to be evaluated at the very end.
It would only work effectively in one direction (the positive direction), so something like :eq(-1)
would not be feasible as we don't accumulate a list of elements before yielding them, we yield them as we find them. I have no intention of changing this mechanic. The complexity of managing :eq(-1)
nested in pseudo-classes such as :is()
and :not()
would be super complicated. But if we allowed positive numbers only, this would be easy to do. Positive values would be the only indexes I would consider.
Anyways, I would most likely have to receive requests for such a feature first.
Before and many more
After and many more
As Soup Sieve is planned for inclusion in Beautiful Soup 4, we need to add Python 2 support.
For the most part, there shouldn't be any huge changes. I am assuming we can work primarily in Unicode. With that we will have to be aware of wide and narrow characters when converting CSS Unicode escapes. I assume we'll use surrogate pairs for wide characters on narrow systems. Outside of that, it should be fairly straight forward.
I want to break up tests more to single cases, and maybe abstract the testing of different parsers and quirks etc. Not a pressing issue, but something I do want to look at.
Right now we allow things like #3
which isn't really allowed in CSS. We should use proper patterns that mimic CSS appropriately.
It isn't bad, but it very brief. It more just says, hey you could use these with a brief example without much context. I don't think I need to dedicate a page to each selector, but maybe it would be nice to have each selector have a section that describes the selector in more detail and provides an example with HTML and Soup Sieve parsing it.
It'd be a lot of work, but if I don't get anymore Soup Sieve bugs, I could take my time on it. There probably aren't many more features to add per se. There are some selectors in the backlog, but I can't/am not implementing them right now.
The :contains()
selector came about a long time ago and was abandoned in the CSS spec. We currently mimic contains as it was described originally.
It is important to note if that since the original spec is dead, there will be no updates. I imagine, if :contains()
did not die way back when, that it is not impossible to think it would have been expanded to allow a comma-separated list of content: p:contains("some text", "some other text"
).
As :contains()
currently supports valid identifiers or quoted values, a comma-separated list would contains a list of valid identifiers or quoted values. :contains()
would match if any of the the items in the list match.
With things like :not()
and :lang()
moving towards comma-separated lists, and new functional selectors supporting comma-separated lists out of the box (except for a things like :nth-child()
, :dir()
, etc.), I think this evolution for :contains()
makes sense.
Over at csswg, it appears they plan to add :placeholder-select
.
RESOLVED: Accept and add the :placeholder-select pseudo class and add a note for ::placeholder that we're interested in working on it
I don't think they plan to extend :placeholder-shown
to included select options. Anyways, we'll have to wait until something is actually published before we even consider changing :placeholder-shown
or adding :placeholder-select
, but I want to at least track this so I remember to look at it in the future.
:placeholder-shown
will be modified based on the wording in the CSS level 4 spec at this time.
When we speak of placeholders and select, we are specifically referring to this case: https://html.spec.whatwg.org/multipage/form-elements.html#placeholder-label-option.
I don't know if these will make it into version 1.0 or not. There are no real implementations of this available. Some things seem to still be in flux, such as its renaming recently.
In parent || child
is parent always compared against col
and child against td
? What happens if specify something else: p || span
?
Same question applies to :nth-col()
, is the implied target td:nth-col()
if you do something like .class:nth-col()
?
There are a number of questions that I have which will need to be understood before this is implemented. There is also a bit of complexity involved here.
col
tags in table headertd
column and span based on captured infotd
fits within the relation.Add Python 3.8 to Travis and Appveyor, but allow it to fail to alert us if things change. Py 3.8 related issues were raised in #54.
'>+~' symbols at the beginning of the selectors.
These selectors worked in Beautiful Soup 4.6.x.
But in 4.7.x there is no support for such selectors.
For example, the code below causes an soupsieve.util.SelectorSyntaxError exception.
from bs4 import BeautifulSoup
BeautifulSoup('<a>test<b>test2</b></a>').a.select('> b')
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Programs\Programming\Python-3\lib\site-packages\bs4\element.py", line 1376, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 112, in select
return compile(select, namespaces, flags, **kwargs).select(tag, limit)
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 63, in compile
return cp._cached_css_compile(pattern, namespaces, custom, flags)
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 205, in _cached_css_compile
CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 1010, in process_selectors
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 888, in parse_selectors
sel, m, has_selector, selectors, relations, is_pseudo, index
File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 713, in parse_combinator
index
soupsieve.util.SelectorSyntaxError: The combinator '>' at postion 0, must have a selector before it
line 1:
> b
^
:defined
is a selector that is not in the CSS spec, but is mentioned in the HTML5 living spec. It is implemented by most browsers in some form or another. Ultimately it selects non custom selectors, or custom selectors that have been registered. Custom selectors are selectors with hyphens. If encountering a tag with :
in it, it is usually counted that the tag has a prefix, and the fact that the tag has hyphens is ignored.
We would select all non custom tags as described above, and since we cannot register custom elements to any registry in BeautifulSoup or SoupSieve, that will be all.
In XML, this has no meaning and :defined
will match nothing? Or maybe everything? Maybe we'll just declare it an HTML selector as it is browser specific so it matches nothing in XML.
This should be equivalent to :not([attr=value])
. It is not in the CSS specification, but it would be nice.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.