causal-agent / scraper Goto Github PK
View Code? Open in Web Editor NEWHTML parsing and querying with CSS selectors
Home Page: https://docs.rs/scraper
License: ISC License
HTML parsing and querying with CSS selectors
Home Page: https://docs.rs/scraper
License: ISC License
Is a way to get all elements with matching inner Html or innerText or the option is to iterate through the fragment?
Hello,
Following code cannot find frame
using scrapper 0.12.0
let fragment = Html::parse_fragment("<frameset><frame src='src1'></frameset>");
let selector = Selector::parse("frame").unwrap();
fragment.select(&selector).next().unwrap();
thread 'main' panicked at 'called
Option::unwrap()
on aNone
value'
Am I missing something?
Thanks
Dear maintainer,
I just tried to serialize a <br>
tag by creating the according Node
, appending it to a root node of a Html
and call root_element().html()
on it.
But the result gives me <br></br>
.
Is this intended? Or do I need to set some configuration options for the serialization to make this serialize to <br>
?
Best regards,
Lewin
Is there a way to access method prev_sibling_element()
?
It seems to be defined inside a private module so it is not accessible.
I need to traverse HTML where the elements are organized in a long list, pairs of <h4>
with title and <div>
with content, both of which I need to read.
Suppose I have the following HTML:
<body>
<article id="a">
<div>First</div>
<div>Before<div>Nested</div>After</div>
<div>Last</div>
</article>
<article id="b">
<div>Some other article</div>
</article>
</body>
My goal is to scrape the first article and get its id and number of direct-child <div>
s. Ignoring error handling, I could use something like:
pub fn scrape(doc: ElementRef) -> (String, usize) {
let first_article = doc.select(&Selector::parse("article").unwrap()).next().unwrap();
let id = first_article.value().attr("id").unwrap().to_string();
let content_count = doc.select(&Selector::parse("article:first-of-type > div").unwrap()).count();
(id, content_count) // ("a", 3)
}
There are a few issues with this code. The first one is that article:first-of-type
is not the same as getting the first article
match. In this case it's ok, but in general it's more brittle. The next one is that I am traversing more nodes that needed: I already have a reference to the first article and should be able to start the traversal from there with first_article.select
.
The problem is that the >
direct-child operator needs a left-side. How can I reference the current scope? More precisly, what should be the value of ???
in the following code?
let content_count = first_article.select(&Selector::parse("??? > div").unwrap()).count();
I tried to use the following values:
:root
: Document root:host
: Shadow-element root&
: Sass sigil referring to the current scopeIs it possible to reference the current scope in the selector or is this behavior unsupported at all?
select.rs
is able to serialize nodes back into HTML. It does this using the (undocumented, thanks Servo) html5ever::serialize
module (node.rs
). Scraper should be able to do the same.
ElementRef::html
ElementRef::inner_html
ElementRef::text
?Python's BeautifulSoup library allows to match based on both the tag as well as attributes. For example, in this HTML fragment
<a href="/contact" target="_blank">Contact</a>
<a href="/about">About</a>
Using Selector::parse("a").unwrap()
as the selector would return both of these elements, but BeautifulSoup's soup.find_all("a", attrs={"target":"_blank"})
would only return the first one. This can probably be handled at the user end but a proper API within the crate, if possible, would greatly alleviate friction in porting Python scraping code over to Rust.
Both Html
& ElementRef
support the same method select
. Having that method implemented as a Trait would allow to create a function that can accept both, and would allow to call that method generically, which doesn't seem to be currently possible.
My current workaround is to call root_element()
on Html
, to get its ElementRef
, which fortunately works even if it's just an html fragment without the <html>
tag... Although its description didn't seem to indicate so.
The scraper::node::Element
struct uses an HashMap
to store the list of attributes, which means that the attribute order is not preserved, and that serializing an Element
is not deterministic.
An easy fix for this would be to use an IndexMap
for storing attributes in Element
s.
Having deterministic serialization is pretty useful for things like unit tests, so this would be a welcome change.
It would be good to get all elements with a certain attribute.
The current of version of html5ever is utilizing sinnce-removed method (precise_time_ns
):
Removed
v0.1 APIs, previously behind an enabled-by-default feature flag
...
- precise_time_ns
error[E0425]: cannot find function `precise_time_ns` in crate `time`
--> ...../cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/macros.rs:27:26
|
27 | let t0 = ::time::precise_time_ns();
| ^^^^^^^^^^^^^^^ not found in `time`
|
::: ...../.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/tokenizer/mod.rs:230:27
|
230 | let (_, dt) = time!(self.sink.process_token(token));
| ------------------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
edit: looks like
html5ever = "0.25.1"
is still currently the latest published release of html5ever
still :\
This code panics (has a doctype):
let url = "https://stackoverflow.com/";
let html = reqwest::get(url).unwrap().text().unwrap();
let doc = Html::parse_document(&html);
doc.root_element();
while this does not (does not have doctype):
let url = "https://news.ycombinator.com/";
let html = reqwest::get(url).unwrap().text().unwrap();
let doc = Html::parse_document(&html);
doc.root_element();
After taking a quick look I think the problem is that root_element() assumes that the first element is always Element, so it panics when it is Doctype.
Can not pass the test below:
#[test]
fn test_render_with_parent() {
let test_html_source = r#"<p class='abc'><div class='c1'>c1 result</div></p><div class='c1'>at parent layer</div>"#;
let fragment = Html::parse_fragment(test_html_source);
let p_selector = Selector::parse("p").unwrap();
let c_selector = Selector::parse("div").unwrap();
let p = fragment.select(&p_selector).next().unwrap();
let mut element_count = 0i32;
for element in p.select(&c_selector) {
assert_eq!("div", element.value().name());
element_count += 1;
}
assert!(element_count > 0);
}
Could you help?
Currently, it doesn’t appear to be possible to configure how html5ever’s parser is configured.
This is mostly okay but if you want to override scripting_enabled
or other such settings you can’t right now!
Hi there
For a small personal project I'm working on I'd like to extract the recipe on this recipe website using scraper
.
When I try to parse the website however, I get a "not yet implemented" error on, I think, the get_template_contents
function.
Is this error expected? I couldn't find anything obvious in the documentation about features missing, or that this kind of error might come up. I'm not sure whether there is something I am doing wrong, or something I can do to avoid this error.
If there's anything I can do to test further or help out to fix it, please let me know. I'm keen to help.
Thanks in advance!
A minimal way to reproduce the error is:
extern crate reqwest;
extern crate scraper;
use scraper::Html;
fn main() {
let url_text = reqwest::get("https://www.foodnetwork.com/recipes/alton-brown/peanut-brittle-recipe-1914388").unwrap().text().unwrap();
let _doc = Html::parse_document(url_text.as_str());
}
Backtrace when running RUST_BACKTRACE=1 cargo run
is:
######@########:~/code/test_select$ RUST_BACKTRACE=1 cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.25s
Running `target/debug/test_select`
thread 'main' panicked at 'not yet implemented', /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/tree_sink.rs:186:9
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::print
at libstd/sys_common/backtrace.rs:71
at libstd/sys_common/backtrace.rs:59
2: std::panicking::default_hook::{{closure}}
at libstd/panicking.rs:211
3: std::panicking::default_hook
at libstd/panicking.rs:227
4: std::panicking::rust_panic_with_hook
at libstd/panicking.rs:511
5: std::panicking::begin_panic
at /checkout/src/libstd/panicking.rs:445
6: scraper::html::tree_sink::<impl markup5ever::interface::tree_builder::TreeSink for scraper::html::Html>::get_template_contents
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/tree_sink.rs:186
7: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::appropriate_place_for_insertion
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:373
8: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::insert_element
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:1177
9: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::insert_element_for
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:1210
10: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::step
at ./target/debug/build/html5ever-3104504867a7d440/out/rules.rs:353
11: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::process_to_completion
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:312
12: <html5ever::tree_builder::TreeBuilder<Handle, Sink> as html5ever::tokenizer::interface::TokenSink>::process_token
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:474
13: <html5ever::tokenizer::Tokenizer<Sink>>::process_token
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:232
14: <html5ever::tokenizer::Tokenizer<Sink>>::emit_current_tag
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:425
15: <html5ever::tokenizer::Tokenizer<Sink>>::step
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:628
16: <html5ever::tokenizer::Tokenizer<Sink>>::run
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:361
17: <html5ever::tokenizer::Tokenizer<Sink>>::feed
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:219
18: <html5ever::driver::Parser<Sink> as tendril::stream::TendrilSink<tendril::fmt::UTF8>>::process
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/driver.rs:88
19: tendril::stream::TendrilSink::one
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/tendril-0.4.0/src/stream.rs:47
20: scraper::html::Html::parse_document
at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/mod.rs:55
21: test_select::main
at src/main.rs:8
22: std::rt::lang_start::{{closure}}
at /checkout/src/libstd/rt.rs:74
23: std::panicking::try::do_call
at libstd/rt.rs:59
at libstd/panicking.rs:310
24: __rust_maybe_catch_panic
at libpanic_unwind/lib.rs:105
25: std::rt::lang_start_internal
at libstd/panicking.rs:289
at libstd/panic.rs:374
at libstd/rt.rs:58
26: std::rt::lang_start
at /checkout/src/libstd/rt.rs:74
27: main
28: __libc_start_main
29: _start
document.select
with Selector::parse
is not working when there's a newline directly after the the tag.
Code:
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
//...
}
HTML example that triggers this:
<a
href="...")"
When printing these affected elements:
Element(<a\n href="\\\"/...
Other elements in the query that are of the form Element(<a href="\\\"/...
don't trigger this problem. Happy for a workaround in the meanwhile.
I am trying to parse a large number of HTML documents, and I have noticed that the parsing took most of the time, around 97% of the program. Is there any way to speed up the parsing process?
To give you a perspective, the average parsing time is around 9ms per document.
fn main() {
let now = Instant::now();
// I have 10_000 HTML documents
let paths = fs::read_dir("../data").unwrap();
let mut reports: Vec<HashMap<String, Value>> = Vec::new();
for path in paths {
let data = fs::read_to_string(path.unwrap().path()).expect("Unable to read file");
// This line took 97% of the running time
let document = Html::parse_document(&data);
}
println!("The program took {}", now.elapsed().as_secs());
}
As the example, the following code works correctly.
use scraper::{Html, Selector};
fn main() {
let html = r#"
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</ul>
"#;
let fragment = Html::parse_fragment(html);
let selector = Selector::parse("ul").unwrap();
assert!(fragment.select(&selector).next().is_some());
}
Although the following code is almost same as the above one, it doesn't work.
use scraper::{Html, Selector};
fn main() {
let html = r#"
<tr>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</tr>
"#;
let fragment = Html::parse_fragment(html);
let selector = Selector::parse("tr").unwrap();
assert!(fragment.select(&selector).next().is_some());
}
LocalName::new
called for something that isn't in https://github.com/servo/html5ever/blob/master/markup5ever/local_names.txt locks a global Mutex. As LocalName::new
is called for every class in Element::new
, and most classes are unlikely to be in that list, this means that multithreading is much less of a win for HTML parsing than it should be. While it's a breaking change and probably not the best approach, I've switched to using Strings locally, which gives me about a 10% performance improvement. My program's performance is still dominated by HashSet
construction, though. It might be faster to intern the entire HashSet<String>
s per-document.
Some returned Result
s returned by scraper functions use error types from the cssparser
crate.
Unfortunately in my project, I cannot handle them (e.g. using thiserror
) without adding cssparser
to my dependencies as well, even though I don't need it anywhere else.
Please consider reexporting cssparser, or alternatively define error types for the scraper crate and export those.
Let's define a common error struct for scraper
so that crates using it can integrate it in their error handling.
For a project I need to modify an url of a href. Maybe we could add a version of attr which return a mutable variable on the content of the HashMap
When we have few tags that need to be removed before selecting a tag for example
fn main() {
let selector = Selector::parse("body").unwrap();
let html = r#"
<!DOCTYPE html>
<body>
Hello World
<script type="application/json" data-selector="settings-json">
{"test":"json"}
</script>
</body>
"#;
let document = Html::parse_document(html);
let body = document.select(&selector).next().unwrap();
let text = body.text().collect::<Vec<_>>();
println!("{:?}", text);
}
Output
["\n Hello World\n ", "\n {\"test\":\"json\"}\n ", "\n \n"]
The output will have the value from the script tags, Is there any way we can remove those?
Please add the ability to get an ElementRef
of an ElementRef
's parent (and any ancestor/sibling), so that one can call .html()
(and other ElementRef
methods) on the parent :)
I find it helps a lot to quickly judge a library on how it's intended to be used if the author presents a simple example on the README.
I noticed that the library is missing the ability to edit the documents. This feature will be very useful as we can remove ads and unwanted elements from document before parsing the data.
Are there plans to add this feature?
Such an error has occurred.
rs:31:68: 31:79 error: mismatched types:
expected `&mut cssparser::parser::Parser<'_, '_>`,
found `&mut cssparser::parser::Parser<'_, '_>`
(expected struct `cssparser::parser::Parser`,
found a different struct `cssparser::parser::Parser`) [E0308]
Hey
Because it is a common task to create the innerHTML representation, I would recommend an copy'n'paste example for users like me and #3. It takes time to get behind NodeRef, ElementRef, Node and Element. Since I don't wanted to do anything else than getting the Html, a simple example would help.
My optimizable (buf &str not String, how to convert scraper::node::Text into String?) try:
use scraper::{Html, Selector};
use scraper::element_ref::ElementRef;
fn get_html(elem : &ElementRef) -> String {
let mut buf = Vec::<String>::with_capacity(1000);
get_html_rec(&elem, &mut buf);
let mut size = 0;
for s in buf.iter() {
size += s.len();
}
let mut mstr = String::with_capacity(size);
for s in buf.iter() {
mstr.push_str(&s);
}
mstr
}
fn get_html_rec(elem : &ElementRef, mut buf : &mut Vec<String>) {
buf.push(format!("{:?}", elem.value()));
for c in elem.children() {
// c : NodeRef<Node>
let n = c.value();
if n.is_document() {
panic!("Unimplemented");
} else if n.is_fragment() {
panic!("Unimplemented");
} else if n.is_doctype() {
panic!("Unimplemented");
} else if n.is_comment() {
} else if n.is_text() {
let t = n.as_text().unwrap();
buf.push(String::from_utf8_lossy(&t.as_bytes()).into_owned());
} else if n.is_element() {
//let e = c.as_element().unwrap();
let e = ElementRef::wrap(c).unwrap();
get_html_rec(&e, &mut buf);
}
}
buf.push(format!("</{}>", elem.value().name()));
}
async fn get_links() -> impl std::iter::Iterator {
let res = reqwest::get("http://site").await.unwrap();
let doc = Html::parse_document(&res.text().await.unwrap());
let sel: Selector = Selector::parse("a").expect("can't parse selector");
doc.select(&sel)
---- `sel` is borrowed here
--- `doc` is borrowed here
}
maybe I'm doing something wrong or need to be added to the library to fix the problem?
It looks like that only Html can own the document or its part. I need several separate HTML fragments. How do I create Html from ElementRef? Is there only one way?
let post_body = Selector::parse(".post_body").ok()?;
Html::parse_fragment(&element.select(&post_body).next()?.inner_html())
I'm writing a robot to fetch cn.etherscan.com's token data.
On their site the transfers
section has content: 939,005
while using the following code it gives me different thing:
let transfers_selector = Selector::parse(
".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns",
)
.unwrap();
if let Some(overview) =
fragment.select(&overview_selector).next()
{
dbg!(&overview
.select(&transfers_selector)
.next()
.unwrap()
.html());
}
Right now, I can use attributes in the query string, but when I locate the node that I want, I have no chance to read its attributes or iterate over them. Not even HTML node ID can be read.
Or am I missing something?
(BTW thanks for the great work!)
I think we need a trait, its name maybe ElemRef
, like this:
trait ElemRef {
fn value(&self) -> &'a Element;
fn select<'b>(&self, selector: &'b Selector) -> Select<'a, 'b>;
fn html(&self) -> String;
fn inner_html(&self) -> String;
fn text(&self) -> Text<'a>;
}
And we can impl it for Html
.
There are some reasons:
Html
and ElementRef
have method fn select<'b>(&self, selector: &'b Selector) -> Select<'a, 'b>
but neither of them comes from a trait. Sometimes I only want to select
but cannot write code Like fn get_elem_ref<T: ElemRef>(select: T, selector: &str)
.Html
document or fragment.We can also impl as_ref<ElemRef>
for Html
assuming I'm using the html from here: https://en.wiktionary.org/wiki/pes#Czech and I have it loaded as a string into the variable res, this code panics even though the h2 definitely has siblings (#Czech selects a span, and I get it's parent h2 successfully, but the h2 apparently has no siblings even though this is wrong):
`let doc = Html::parse_document(&res);
let h2_selector = Selector::parse(&format!("#Czech")).unwrap();
let h2 = doc.select(&h2_selector).next().unwrap().parent().unwrap();
println!("{}", scraper::ElementRef::wrap(h2).unwrap().html());
let mut element = h2.next_sibling().unwrap();
scraper::ElementRef::wrap(element).unwrap().html(); //error here even though h2 should have a next_sibling`
Environment:
Problem:
When calling rt.spawn(my_job::run(...));
I receive these compile errors
generator cannot be sent between threads safely
withinego_tree::Node<scraper::node::Node>
, the traitSync
is not implemented forCell<NonZeroUsize>
rustc
mod.rs(381, 25): required by a bound inRuntime::spawn
generator cannot be sent between threads safely
withinego_tree::Node<scraper::node::Node>
, the traitSync
is not implemented forUnsafeCell<tendril::tendril::Buffer>
rustc
mod.rs(381, 25): required by a bound inRuntime::spawn
generator cannot be sent between threads safely
withinego_tree::Node<scraper::node::Node>
, the traitSync
is not implemented for*mut tendril::fmt::UTF8
rustc
mod.rs(381, 25): required by a bound inRuntime::spawn
generator cannot be sent between threads safely
withinego_tree::Node<scraper::node::Node>
, the traitSync
is not implemented forCell<tendril::tendril::PackedUsize>
rustc
mod.rs(381, 25): required by a bound inRuntime::spawn
UPDATE: The error only appears when I run this code within my_job.rs
.
download::run(...).await.unwrap();
If I remove that line everything compiles. Can someone explain why?
Code:
main.rs
use tokio::runtime::Runtime;
...
let mut rt = Runtime::new().unwrap();
rt.spawn(my_job::run(...))
my_job.rs
let title_selector = scraper::Selector::parse("title").unwrap();
let title = document.select(&title_selector).next().unwrap().inner_html();
download::run("", "./Downloads", "").await.unwrap();
Question:
It appears scraper might not be thread safe according to Rust compiler. Are there any work arounds to this so I can still use scraper crate?
I haven't been actively using or developing this project for quite a while now and haven't had time to respond to new issues and PRs. I would like to find someone more qualified than me to continue maintenance and/or development of this project which seems to be quite popular. If anyone would like to help out, please comment here or send me an email.
I haven't found any function like this in the Documentation. When dealing with <br> elements, it would be really handy if there was a way to just get the innerText or at least innerHTML of a NodeRef.
In the methods for ElementRef
it's possible to find siblings and descendant nodes using the Deref<Target = NodeRef>
, however NodeRef
doesn't have some of the convenience methods that ElemetRef
has like inner_html()
. Is it possible to add the ability to return an ElementRef
when searching for nodes related to the currently selected ElementRef
node instead of NodeRef
?
The intermediate parsing of the Selector
itself is a bit annoying. I think it would be neat if it were part of the select()
.
Current:
let fragment = Html::parse_fragment(html);
let selector = Selector::parse("li").unwrap();
for element in fragment.select(&selector) {}
Proposed a):
let fragment = Html::parse_fragment(html);
for element in fragment.select("li".into()?) {}
Proposed b):
let fragment = Html::parse_fragment(html);
for element in fragment.select("li")? {}
I would like to know what changed from version 0.11 to 0.12 :)
Inspired by this blog post which mentions some API struggles with writing a Rust backend program, and one of the mentioned things is that Selector::parse('...').unwrap()
is not ergonomic when the selectors are constants and you are writing a lot of them.
How about a selector!()
macro to define a selector from a string, which removes the boilerplate for this case? There is precedent for this in the form of vec![]
provided by the language and e.g. the regex!()
macro provided by the regex crate.
I want to implement a function similar to the following:
fn count_div(html: &str) -> anyhow::Result<i32> {
let fragment = Html::parse_fragment(html).root_element();
let selector = Selector::parse("div")?;
let mut count = 0;
for e in fragment.children() {
count += selector.matches(&e);
}
return Ok(count)
}
I think ElementRef
just a wrapper of NodeRef
, so is it possible to add such a conversion to support matches
?
error[E0308]: mismatched types
--> src\parser\mod.rs:110:26
|
110 | selector.matches(&e);
| ^^ expected struct `scraper::ElementRef`, found struct `ego_tree::NodeRef`
|
= note: expected reference `&scraper::ElementRef<'_>`
found reference `&ego_tree::NodeRef<'_, scraper::Node>`
I'm trying to parse the price information from an amazon page:
https://www.amazon.de/Lenovo-Chromebook-1366x768-Ultraslim-UHD-Grafik/dp/B08CZWFCH7?ref_=Oct_DLandingS_D_4cba4dd4_61&smid=A3JWKAKR8XB7XF
There seems to be a with class priceBlockBuyingPriceString
.
So I tried:
let html_text = get_html("https://www.amazon.de/Lenovo-Chromebook-1366x768-Ultraslim-UHD-Grafik/dp/B08CZWFCH7?ref_=Oct_DLandingS_D_4cba4dd4_61&smid=A3JWKAKR8XB7XF")
.await
.with_context(|| format!("Failed to parse {}", url)).ok()?
.text()
.await
.with_context(|| "Failed read html text").ok()?;
let document: String = Html::parse_document(&html_text);
let selector_element = Selector::parse("span.priceBlockBuyingPriceString").unwrap();
let price_field = document.select(&selector_element).next()?;
println!("Element {:?}", price_field.text());
But the result is not the price, I expected. The result isn't the value of the expected field:
I think I'm using the crate the wrong way...
Here a part of the result:
T
I figure I can't use this library to parse a .css
file right?
What is the recommended way in which I could parse and modify a css file and the selectors inside of it?
For ex, if I have a .css file with this inside of it,
.bc-ui2 {
background: url('/img/[email protected]') no-repeat;
background-size: 100px 200px;
}
I want to be able to query for the class 'bc-ui2' and get its properties.
Just discovered this care today, really nice!
One question though: Is it possible to combine selectors?
I wrote some code today to extract content from a wordpress page. First selected the <article>
tag, then all <p>
. Which basically does the trick. Yet, titles and possibly images are missing.
So I was wondering, if it was somehow possible to create a selector, which selects p
| h1
| h2
and so on?
Or if that isn't possible, what would be the recommended approach?
Im currently working on a project that requires compilation to wasm, so i would like to know if it is supported.
Hi,
It's possible to fetch page that are rendered dynamically like Angular or React? In my case I want to wait until the page loads completely but one table is loaded dynamically after receive the result of an API call.
Best Regards
This selector doesn't match the intended element (using Html::select
):
"body > div.container.container-main > div.row:nth-child(2) > div.col-md-10 > a"
But this one matches:
"div.container.container-main > div.row:nth-child(2) > div.col-md-10 > a"
The only difference is that body >
was removed.
It should also match with body >
because this div.container.container-main
node is a direct descendant of the body
element:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div id="bggrad"></div>
<div class="container container-header"></div>
<div class="container container-main">
<nav class="navbar navbar-default navbar-static-top"></nav>
<div class="row">
<div class="col-xs-12"></div>
<div class="col-xs-12"></div>
<div class="col-md-10">
<a href="#">foo</a>
</div>
</div>
</div>
</body>
</html>
Greetings!
First of all I hope that you'll find maintainer because the library is awesome.
Unfortunatelly I can't use all possible attribute selectors:
fn main() {
use scraper::{Html, Selector};
let doc = Html::parse_fragment(r#"<a href="https://google.com""#);
let sel = Selector::parse(r#"[href=~"goo"]"#).unwrap();
println!("{:?}", doc.select(&sel).next());
}
The error is ParseError { kind: Custom(BadValueInAttr(Delim('~'))), location: SourceLocation { line: 0, column: 7 } }
. And the same error appears with [href=*"goo"]
.
If I have this html:
<div id="content">
<div class="navigation">
<div class="navigation">
<span class="pages">Page 1 of 15</span>
<span aria-current='page' class='page-numbers current'>1</span>
<a class='page-numbers' href='http://example.com/page/2'>2</a>
<a class='page-numbers' href='http://example.com/page/3'>3</a>
<a class='page-numbers' href='http://example.com/page/4'>4</a>
<span class="page-numbers dots">…</span>
<a class='page-numbers' href='http://example.com/page/15'>15</a>
<a class="next page-numbers" href="http://example.com/page/2">»</a>
</div>
</div>
</div>
Then:
#content > div.navigation
matches the INNER <div class="navigation">
, NOT the outer one! It should match the outer one because >
means "direct descendant".#content > div.navigation > div
matches NOTHING (same as #content > div.navigation > div.navigation
)! It should match the inner navigation
div.This seems to be a bug!
(Maybe related to #41 ?)
I'm trying to get all p
tags within a website, and get all unique direct parent DOM elements. For example:
let selector = Selector::parse("p").unwrap();
let paragraphs = dom.select(&selector).map(|p| p.parent().unwrap());
// TODO: get all unique parent nodes for the paragraphs
This is my best shot yet, but I'm still not there. Do you have any idea about how I can achieve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.