causal-agent / scraper Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 94.0 4.38 MB

HTML parsing and querying with CSS selectors

Home Page: https://docs.rs/scraper

License: ISC License

Rust 98.01% Roff 1.99%

hacktoberfest rust

scraper's People

Contributors

Stargazers

Watchers

Forkers

eddyb svmk mbrubeck yoursvivek guangie88 rutgerrauws cih-y2k ontonator nuxeh mm4nn liby99 gradientpub joelwurtz mlang txuritan gfreezy firstprayer rust-webcrawling importcjj spegoraro lyuji282 josesaribeiro iovxw fattenap leo60228 mitchellmcmillan adamrmelnyk yiblet danloh prejrdev glasspangolin phosphorus15 nikhilraj44 godofdream vishalsodani masonwr flauntingspade4 mainrs teymour-aldridge paulhsu mkroman ki-dof kbingman gitter-badger indefini deckchairlabs ajunlonglive standardgalactic kirinse visualbasist stormrust stormasm sugurusasaki j2ghz dblouis mattjurenka hotate29 x0f5c3 mipl-group tonalidadehidrica quesurifn fourbytes kiwifuit arctic-penguin icodein mohe2015 maxcountryman jaboatman amirulandalib bugblocker ericrobolson davemo88 asandax6 bobthesponge10 kekonen zacharywulven matthewgapp conqp xmakro estebanborai gfaster j-mendez isgasho gkuo06 aparo yakiyo g2p davidalphafox tsoubry aumetra jimmy33kb yasamoka

scraper's Issues

Selector doesn't work with newline after

document.select with Selector::parse is not working when there's a newline directly after the the tag.

Code:

let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
    //...
}

HTML example that triggers this:

<a
                            href="...")"

When printing these affected elements:

Element(<a\n href="\\\"/...

Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

How do i remove certain tags/nodes before selecting a text?

When we have few tags that need to be removed before selecting a tag for example

fn main() {
let selector = Selector::parse("body").unwrap();
    let html = r#"
    <!DOCTYPE html>
   <body>
   Hello World
   <script type="application/json" data-selector="settings-json">
   {"test":"json"}
   </script>
   </body>
"#;
    let document = Html::parse_document(html);
    let body = document.select(&selector).next().unwrap();
    let text = body.text().collect::<Vec<_>>();
    println!("{:?}", text);
}

Output

["\n Hello World\n ", "\n {\"test\":\"json\"}\n ", "\n \n"]

The output will have the value from the script tags, Is there any way we can remove those?

Fetch Dynamic Pages

Hi,

It's possible to fetch page that are rendered dynamically like Angular or React? In my case I want to wait until the page loads completely but one table is loaded dynamically after receive the result of an API call.

Best Regards

Self closing elements are not rendered correct

Dear maintainer,

I just tried to serialize a <br> tag by creating the according Node, appending it to a root node of a Html and call root_element().html() on it.

But the result gives me <br></br>.

Is this intended? Or do I need to set some configuration options for the serialization to make this serialize to <br>?

Best regards,
Lewin

How to select all elements with an attribute

It would be good to get all elements with a certain attribute.

Question: Can this be used in wasm?

Im currently working on a project that requires compilation to wasm, so i would like to know if it is supported.

method select should be implemented as a trait

Both Html & ElementRef support the same method select. Having that method implemented as a Trait would allow to create a function that can accept both, and would allow to call that method generically, which doesn't seem to be currently possible.

My current workaround is to call root_element() on Html, to get its ElementRef, which fortunately works even if it's just an html fragment without the <html> tag... Although its description didn't seem to indicate so.

HashSet<LocalName> for classes is slow

LocalName::new called for something that isn't in https://github.com/servo/html5ever/blob/master/markup5ever/local_names.txt locks a global Mutex. As LocalName::new is called for every class in Element::new, and most classes are unlikely to be in that list, this means that multithreading is much less of a win for HTML parsing than it should be. While it's a breaking change and probably not the best approach, I've switched to using Strings locally, which gives me about a 10% performance improvement. My program's performance is still dominated by HashSet construction, though. It might be faster to intern the entire HashSet<String>s per-document.

Consider exporting common errors thrown by this crate

Some returned Results returned by scraper functions use error types from the cssparser crate.
Unfortunately in my project, I cannot handle them (e.g. using thiserror) without adding cssparser to my dependencies as well, even though I don't need it anywhere else.

~~Please consider reexporting cssparser, or alternatively define error types for the scraper crate and export those.~~

Let's define a common error struct for scraper so that crates using it can integrate it in their error handling.

Support to build ElementRef from NodeRef

I want to implement a function similar to the following:

fn count_div(html: &str) -> anyhow::Result<i32> {
    let fragment = Html::parse_fragment(html).root_element();
    let selector = Selector::parse("div")?;
    let mut count = 0;
    for e in fragment.children() {
        count += selector.matches(&e);
    }
    return Ok(count)
}

I think ElementRef just a wrapper of NodeRef, so is it possible to add such a conversion to support matches?

error[E0308]: mismatched types
   --> src\parser\mod.rs:110:26
    |
110 |         selector.matches(&e);
    |                          ^^ expected struct `scraper::ElementRef`, found struct `ego_tree::NodeRef`
    |
    = note: expected reference `&scraper::ElementRef<'_>`
               found reference `&ego_tree::NodeRef<'_, scraper::Node>`

Consider adding support for reading attributes from ElementRef

Right now, I can use attributes in the query string, but when I locate the node that I want, I have no chance to read its attributes or iterate over them. Not even HTML node ID can be read.
Or am I missing something?

(BTW thanks for the great work!)

Descendant of body selector doesn't match

This selector doesn't match the intended element (using Html::select):

"body > div.container.container-main > div.row:nth-child(2) > div.col-md-10 > a"

But this one matches:

       "div.container.container-main > div.row:nth-child(2) > div.col-md-10 > a"

The only difference is that body > was removed.
It should also match with body > because this div.container.container-main node is a direct descendant of the body element:

<!DOCTYPE html>
<html lang="en">
    <head></head>

    <body>
        <div id="bggrad"></div>
        <div class="container container-header"></div>
        <div class="container container-main">
            <nav class="navbar navbar-default navbar-static-top"></nav>
            <div class="row">
                <div class="col-xs-12"></div>
                <div class="col-xs-12"></div>
                <div class="col-md-10">
                    <a href="#">foo</a>
                </div>
            </div>
        </div>
    </body>
</html>

Add a mutable version of attr

For a project I need to modify an url of a href. Maybe we could add a version of attr which return a mutable variable on the content of the HashMap

Parsing a stylesheet

I figure I can't use this library to parse a .css file right?
What is the recommended way in which I could parse and modify a css file and the selectors inside of it?

For ex, if I have a .css file with this inside of it,

.bc-ui2 {
    background: url('/img/[email protected]') no-repeat;
    background-size: 100px 200px;
}

I want to be able to query for the class 'bc-ui2' and get its properties.

Allow configuration of html5ever::driver::ParseOpts

Currently, it doesn’t appear to be possible to configure how html5ever’s parser is configured.

This is mostly okay but if you want to override scripting_enabled or other such settings you can’t right now!

Question: how can I use prev_sibling_element?

Is there a way to access method prev_sibling_element()?
It seems to be defined inside a private module so it is not accessible.

I need to traverse HTML where the elements are organized in a long list, pairs of <h4> with title and <div> with content, both of which I need to read.

Performance issues

I am trying to parse a large number of HTML documents, and I have noticed that the parsing took most of the time, around 97% of the program. Is there any way to speed up the parsing process?

To give you a perspective, the average parsing time is around 9ms per document.

Code example

fn main() {
    let now = Instant::now();
    // I have 10_000 HTML documents
    let paths = fs::read_dir("../data").unwrap();
    let mut reports: Vec<HashMap<String, Value>> = Vec::new();
    for path in paths {
        let data = fs::read_to_string(path.unwrap().path()).expect("Unable to read file");
        // This line took 97% of the running time
        let document = Html::parse_document(&data);
        }
    println!("The program took {}", now.elapsed().as_secs());
}

Q: Combine selectors

Just discovered this care today, really nice!
One question though: Is it possible to combine selectors?

I wrote some code today to extract content from a wordpress page. First selected the <article> tag, then all <p>. Which basically does the trick. Yet, titles and possibly images are missing.

So I was wondering, if it was somehow possible to create a selector, which selects p | h1 | h2 and so on?

Or if that isn't possible, what would be the recommended approach?

selector accidently edited html

I'm writing a robot to fetch cn.etherscan.com's token data.

On their site the transfers section has content: 939,005

while using the following code it gives me different thing:

    let transfers_selector = Selector::parse(
        ".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns",
    )
    .unwrap();

    if let Some(overview) =
        fragment.select(&overview_selector).next()
    {
        dbg!(&overview
            .select(&transfers_selector)
            .next()
            .unwrap()
            .html());
    }

Cannot parse "frame"

Hello,

Following code cannot find frame using scrapper 0.12.0

let fragment = Html::parse_fragment("<frameset><frame src='src1'></frameset>");
let selector = Selector::parse("frame").unwrap();
fragment.select(&selector).next().unwrap();

thread 'main' panicked at 'called Option::unwrap() on a None value'

Am I missing something?

Thanks

rust 1.6 compile error

Such an error has occurred.

rs:31:68: 31:79 error: mismatched types:
 expected `&mut cssparser::parser::Parser<'_, '_>`,
    found `&mut cssparser::parser::Parser<'_, '_>`
(expected struct `cssparser::parser::Parser`,
    found a different struct `cssparser::parser::Parser`) [E0308]

A way to get innerHTML or innerText?

I haven't found any function like this in the Documentation. When dealing with <br> elements, it would be really handy if there was a way to just get the innerText or at least innerHTML of a NodeRef.

A way to crate Html fragment from a ElementRef?

It looks like that only Html can own the document or its part. I need several separate HTML fragments. How do I create Html from ElementRef? Is there only one way?

let post_body = Selector::parse(".post_body").ok()?;
Html::parse_fragment(&element.select(&post_body).next()?.inner_html())

Example

Hey

Because it is a common task to create the innerHTML representation, I would recommend an copy'n'paste example for users like me and #3. It takes time to get behind NodeRef, ElementRef, Node and Element. Since I don't wanted to do anything else than getting the Html, a simple example would help.

My optimizable (buf &str not String, how to convert scraper::node::Text into String?) try:

use scraper::{Html, Selector};
use scraper::element_ref::ElementRef;

fn get_html(elem : &ElementRef) -> String {
    let mut buf = Vec::<String>::with_capacity(1000);
    get_html_rec(&elem, &mut buf);

    let mut size = 0;
    for s in buf.iter() {
        size += s.len();
    }
    let mut mstr = String::with_capacity(size);
    for s in buf.iter() {
        mstr.push_str(&s);
    }
    mstr
}

fn get_html_rec(elem : &ElementRef, mut buf : &mut Vec<String>) {

    buf.push(format!("{:?}", elem.value()));

    for c in elem.children() {
        // c : NodeRef<Node>
        let n = c.value();
        if n.is_document() {
            panic!("Unimplemented");
        } else if n.is_fragment() {
            panic!("Unimplemented");
        } else if n.is_doctype() {
            panic!("Unimplemented");
        } else if n.is_comment() {

        } else if n.is_text() {
            let t = n.as_text().unwrap();
            buf.push(String::from_utf8_lossy(&t.as_bytes()).into_owned());
        } else if n.is_element() {
            //let e = c.as_element().unwrap();
            let e = ElementRef::wrap(c).unwrap();
            get_html_rec(&e, &mut buf);
        }
    }
    buf.push(format!("</{}>", elem.value().name()));
}

Failed to parse descendants

Can not pass the test below:

    #[test]
    fn test_render_with_parent() {
        let test_html_source = r#"<p class='abc'><div class='c1'>c1 result</div></p><div class='c1'>at parent layer</div>"#;       

        let fragment = Html::parse_fragment(test_html_source);
        let p_selector = Selector::parse("p").unwrap();
        let c_selector = Selector::parse("div").unwrap();

        let p = fragment.select(&p_selector).next().unwrap();
        let mut element_count = 0i32;
        for element in p.select(&c_selector) {
            assert_eq!("div", element.value().name());
            element_count += 1;
        }
        assert!(element_count > 0);
    }

Could you help?

Some methods of ElementRef should be implemented for Html

I think we need a trait, its name maybe ElemRef, like this:

trait ElemRef {
    fn value(&self) -> &'a Element;
    fn select<'b>(&self, selector: &'b Selector) -> Select<'a, 'b>;
    fn html(&self) -> String;
    fn inner_html(&self) -> String;
    fn text(&self) -> Text<'a>;
}

And we can impl it for Html.
There are some reasons:

we should regard a html document or fragment as an element or a potential element
both of Html and ElementRef have method fn select<'b>(&self, selector: &'b Selector) -> Select<'a, 'b> but neither of them comes from a trait. Sometimes I only want to select but cannot write code Like fn get_elem_ref<T: ElemRef>(select: T, selector: &str).
I cannot get attr/innerText/html straightly from a Html document or fragment.

We can also impl as_ref<ElemRef> for Html

get all element with matching inner Html

Is a way to get all elements with matching inner Html or innerText or the option is to iterate through the fragment?

Ability to get ElementRef of parent of ElementRef

Please add the ability to get an ElementRef of an ElementRef's parent (and any ancestor/sibling), so that one can call .html() (and other ElementRef methods) on the parent :)

Element which should have siblings returns None on wrapping result of next_sibling

assuming I'm using the html from here: https://en.wiktionary.org/wiki/pes#Czech and I have it loaded as a string into the variable res, this code panics even though the h2 definitely has siblings (#Czech selects a span, and I get it's parent h2 successfully, but the h2 apparently has no siblings even though this is wrong):

`let doc = Html::parse_document(&res);

let h2_selector = Selector::parse(&format!("#Czech")).unwrap();

let h2 = doc.select(&h2_selector).next().unwrap().parent().unwrap();

println!("{}", scraper::ElementRef::wrap(h2).unwrap().html());

let mut element = h2.next_sibling().unwrap();

scraper::ElementRef::wrap(element).unwrap().html(); //error here even though h2 should have a next_sibling`

Html::root_element panics if the document has a Doctype node

This code panics (has a doctype):

let url = "https://stackoverflow.com/";
let html = reqwest::get(url).unwrap().text().unwrap();
let doc = Html::parse_document(&html);
doc.root_element();

while this does not (does not have doctype):

let url = "https://news.ycombinator.com/";
let html = reqwest::get(url).unwrap().text().unwrap();
let doc = Html::parse_document(&html);
doc.root_element();

After taking a quick look I think the problem is that root_element() assumes that the first element is always Element, so it panics when it is Doctype.

Anchor selector to current element ref

Suppose I have the following HTML:

<body>
<article id="a">
  <div>First</div>
  <div>Before<div>Nested</div>After</div>
  <div>Last</div>
</article>
<article id="b">
  <div>Some other article</div>
</article>
</body>

My goal is to scrape the first article and get its id and number of direct-child <div>s. Ignoring error handling, I could use something like:

pub fn scrape(doc: ElementRef) -> (String, usize) {
  let first_article = doc.select(&Selector::parse("article").unwrap()).next().unwrap();
  let id = first_article.value().attr("id").unwrap().to_string();

  let content_count = doc.select(&Selector::parse("article:first-of-type > div").unwrap()).count();

  (id, content_count) // ("a", 3)
}

There are a few issues with this code. The first one is that article:first-of-type is not the same as getting the first article match. In this case it's ok, but in general it's more brittle. The next one is that I am traversing more nodes that needed: I already have a reference to the first article and should be able to start the traversal from there with first_article.select.

The problem is that the > direct-child operator needs a left-side. How can I reference the current scope? More precisly, what should be the value of ??? in the following code?

  let content_count = first_article.select(&Selector::parse("??? > div").unwrap()).count();

I tried to use the following values:

:root: Document root
:host: Shadow-element root
&: Sass sigil referring to the current scope

Is it possible to reference the current scope in the selector or is this behavior unsupported at all?

Selector can not capture <tr></tr>.

As the example, the following code works correctly.

use scraper::{Html, Selector};
fn main() {
    let html = r#"
    <ul>
        <li>Foo</li>
        <li>Bar</li>
        <li>Baz</li>
    </ul>
"#;

    let fragment = Html::parse_fragment(html);
    let selector = Selector::parse("ul").unwrap();
    assert!(fragment.select(&selector).next().is_some());
}

Although the following code is almost same as the above one, it doesn't work.

use scraper::{Html, Selector};
fn main() {
    let html = r#"
    <tr>
        <li>Foo</li>
        <li>Bar</li>
        <li>Baz</li>
    </tr>
"#;

    let fragment = Html::parse_fragment(html);
    let selector = Selector::parse("tr").unwrap();
    assert!(fragment.select(&selector).next().is_some());
}

API idea: selector!() macro for Selector literals

Inspired by this blog post which mentions some API struggles with writing a Rust backend program, and one of the mentioned things is that Selector::parse('...').unwrap() is not ergonomic when the selectors are constants and you are writing a lot of them.

How about a selector!() macro to define a selector from a string, which removes the boilerplate for this case? There is precedent for this in the form of vec![] provided by the language and e.g. the regex!() macro provided by the regex crate.

Put an example in the README

I find it helps a lot to quickly judge a library on how it's intended to be used if the author presents a simple example on the README.

unimplemented "get_template_contents" function

Hi there

For a small personal project I'm working on I'd like to extract the recipe on this recipe website using scraper.
When I try to parse the website however, I get a "not yet implemented" error on, I think, the get_template_contents function.

Is this error expected? I couldn't find anything obvious in the documentation about features missing, or that this kind of error might come up. I'm not sure whether there is something I am doing wrong, or something I can do to avoid this error.

If there's anything I can do to test further or help out to fix it, please let me know. I'm keen to help.

Thanks in advance!

A minimal way to reproduce the error is:

extern crate reqwest;
extern crate scraper;

use scraper::Html;

fn main() {
    let url_text = reqwest::get("https://www.foodnetwork.com/recipes/alton-brown/peanut-brittle-recipe-1914388").unwrap().text().unwrap();          
    let _doc = Html::parse_document(url_text.as_str());
}

Backtrace when running RUST_BACKTRACE=1 cargo run is:

######@########:~/code/test_select$ RUST_BACKTRACE=1 cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `target/debug/test_select`
thread 'main' panicked at 'not yet implemented', /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/tree_sink.rs:186:9
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
	     at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
	     at libstd/sys_common/backtrace.rs:71
	     at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
	     at libstd/panicking.rs:211
   3: std::panicking::default_hook
	     at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
	     at libstd/panicking.rs:511
   5: std::panicking::begin_panic
	     at /checkout/src/libstd/panicking.rs:445
   6: scraper::html::tree_sink::<impl markup5ever::interface::tree_builder::TreeSink for scraper::html::Html>::get_template_contents
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/tree_sink.rs:186
   7: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::appropriate_place_for_insertion
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:373
   8: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::insert_element
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:1177
   9: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::insert_element_for
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:1210
  10: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::step
	     at ./target/debug/build/html5ever-3104504867a7d440/out/rules.rs:353
  11: <html5ever::tree_builder::TreeBuilder<Handle, Sink>>::process_to_completion
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:312
  12: <html5ever::tree_builder::TreeBuilder<Handle, Sink> as html5ever::tokenizer::interface::TokenSink>::process_token
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tree_builder/mod.rs:474
  13: <html5ever::tokenizer::Tokenizer<Sink>>::process_token
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:232
  14: <html5ever::tokenizer::Tokenizer<Sink>>::emit_current_tag
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:425
  15: <html5ever::tokenizer::Tokenizer<Sink>>::step
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:628
  16: <html5ever::tokenizer::Tokenizer<Sink>>::run
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:361
  17: <html5ever::tokenizer::Tokenizer<Sink>>::feed
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/tokenizer/mod.rs:219
  18: <html5ever::driver::Parser<Sink> as tendril::stream::TendrilSink<tendril::fmt::UTF8>>::process
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.22.3/src/driver.rs:88
  19: tendril::stream::TendrilSink::one
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/tendril-0.4.0/src/stream.rs:47
  20: scraper::html::Html::parse_document
	     at /home/########/.cargo/registry/src/github.com-1ecc6299db9ec823/scraper-0.6.0/src/html/mod.rs:55
  21: test_select::main
	     at src/main.rs:8
  22: std::rt::lang_start::{{closure}}
	     at /checkout/src/libstd/rt.rs:74
  23: std::panicking::try::do_call
	     at libstd/rt.rs:59
	     at libstd/panicking.rs:310
  24: __rust_maybe_catch_panic
	     at libpanic_unwind/lib.rs:105
  25: std::rt::lang_start_internal
	     at libstd/panicking.rs:289
	     at libstd/panic.rs:374
	     at libstd/rt.rs:58
  26: std::rt::lang_start
	     at /checkout/src/libstd/rt.rs:74
  27: main
  28: __libc_start_main
  29: _start

Error using scraper in spawned process via Tokio runtime

Environment:

Linux
Rust 1.60.0-nightly
Scraper 0.12.0

Problem:

When calling rt.spawn(my_job::run(...)); I receive these compile errors

generator cannot be sent between threads safely
within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for Cell<NonZeroUsize>rustc
mod.rs(381, 25): required by a bound in Runtime::spawn
generator cannot be sent between threads safely
within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for UnsafeCell<tendril::tendril::Buffer>rustc
mod.rs(381, 25): required by a bound in Runtime::spawn
generator cannot be sent between threads safely
within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for *mut tendril::fmt::UTF8rustc
mod.rs(381, 25): required by a bound in Runtime::spawn
generator cannot be sent between threads safely
within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for Cell<tendril::tendril::PackedUsize>rustc
mod.rs(381, 25): required by a bound in Runtime::spawn

UPDATE: The error only appears when I run this code within my_job.rs.

download::run(...).await.unwrap();

If I remove that line everything compiles. Can someone explain why?

Code:

main.rs

use tokio::runtime::Runtime;
...
let mut rt = Runtime::new().unwrap();
rt.spawn(my_job::run(...))

my_job.rs

let title_selector = scraper::Selector::parse("title").unwrap();
let title = document.select(&title_selector).next().unwrap().inner_html();
download::run("", "./Downloads", "").await.unwrap();

Question:

It appears scraper might not be thread safe according to Rust compiler. Are there any work arounds to this so I can still use scraper crate?

how to get parent of elements?

I'm trying to get all p tags within a website, and get all unique direct parent DOM elements. For example:

let selector = Selector::parse("p").unwrap();
let paragraphs = dom.select(&selector).map(|p| p.parent().unwrap());
// TODO: get all unique parent nodes for the paragraphs

This is my best shot yet, but I'm still not there. Do you have any idea about how I can achieve this?

HTML serialization

select.rs is able to serialize nodes back into HTML. It does this using the (undocumented, thanks Servo) html5ever::serialize module (node.rs). Scraper should be able to do the same.

ElementRef::html
ElementRef::inner_html
How to unify with ElementRef::text?

how return iterator from function

async fn get_links() -> impl std::iter::Iterator {
    let res = reqwest::get("http://site").await.unwrap();
    let doc = Html::parse_document(&res.text().await.unwrap());
    let sel: Selector = Selector::parse("a").expect("can't parse selector");
    doc.select(&sel)
                ---- `sel` is borrowed here
    --- `doc` is borrowed here
}

maybe I'm doing something wrong or need to be added to the library to fix the problem?

Document manipulation is missing?

I noticed that the library is missing the ability to edit the documents. This feature will be very useful as we can remove ads and unwanted elements from document before parsing the data.

Are there plans to add this feature?

Improve ergonomics by allowing to use a string input for fragment selections

The intermediate parsing of the Selector itself is a bit annoying. I think it would be neat if it were part of the select().

Current:

let fragment = Html::parse_fragment(html);
let selector = Selector::parse("li").unwrap();

for element in fragment.select(&selector) {}

Proposed a):

let fragment = Html::parse_fragment(html);
for element in fragment.select("li".into()?) {}

Proposed b):

let fragment = Html::parse_fragment(html);
for element in fragment.select("li")? {}

Get a value from a nested html tag

I'm trying to parse the price information from an amazon page:
https://www.amazon.de/Lenovo-Chromebook-1366x768-Ultraslim-UHD-Grafik/dp/B08CZWFCH7?ref_=Oct_DLandingS_D_4cba4dd4_61&smid=A3JWKAKR8XB7XF

There seems to be a with class priceBlockBuyingPriceString.

So I tried:

    let html_text = get_html("https://www.amazon.de/Lenovo-Chromebook-1366x768-Ultraslim-UHD-Grafik/dp/B08CZWFCH7?ref_=Oct_DLandingS_D_4cba4dd4_61&smid=A3JWKAKR8XB7XF")
            .await
            .with_context(|| format!("Failed to parse {}", url)).ok()?
            .text()
            .await
            .with_context(|| "Failed read html text").ok()?;

    let document: String = Html::parse_document(&html_text);

    let selector_element = Selector::parse("span.priceBlockBuyingPriceString").unwrap();
    let price_field = document.select(&selector_element).next()?;
    println!("Element {:?}", price_field.text());

But the result is not the price, I expected. The result isn't the value of the expected field:

I think I'm using the crate the wrong way...
Here a part of the result:

Bump html5ever for removed time methods

Bug Info

The current of version of html5ever is utilizing sinnce-removed method (precise_time_ns):

servo/html5ever#453
https://github.com/time-rs/time/blob/main/CHANGELOG.md#removed
Removed

v0.1 APIs, previously behind an enabled-by-default feature flag
...
- precise_time_ns

error[E0425]: cannot find function `precise_time_ns` in crate `time`
   --> ...../cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/macros.rs:27:26
    |
27  |         let t0 = ::time::precise_time_ns();
    |                          ^^^^^^^^^^^^^^^ not found in `time`
    |
   ::: ...../.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/tokenizer/mod.rs:230:27
    |
230 |             let (_, dt) = time!(self.sink.process_token(token));
    |                           ------------------------------------- in this macro invocation
    |
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

edit: looks like

html5ever = "0.25.1"

is still currently the latest published release of html5ever still :\

Looking for maintainer(s)

I haven't been actively using or developing this project for quite a while now and haven't had time to respond to new issues and PRs. I would like to find someone more qualified than me to continue maintenance and/or development of this project which seems to be quite popular. If anyone would like to help out, please comment here or send me an email.

Bug: Selector doesn't match if nested divs have same class

If I have this html:

<div id="content">
	<div class="navigation">
		<div class="navigation">
			<span class="pages">Page 1 of 15</span> 
			<span aria-current='page' class='page-numbers current'>1</span>
			<a class='page-numbers' href='http://example.com/page/2'>2</a>
			<a class='page-numbers' href='http://example.com/page/3'>3</a>
			<a class='page-numbers' href='http://example.com/page/4'>4</a>
			<span class="page-numbers dots">&hellip;</span>
			<a class='page-numbers' href='http://example.com/page/15'>15</a>
			<a class="next page-numbers" href="http://example.com/page/2">&raquo;</a>
		</div>
	</div>
</div>

Then:

The selector #content > div.navigation matches the INNER <div class="navigation">, NOT the outer one! It should match the outer one because > means "direct descendant".
The selector #content > div.navigation > div matches NOTHING (same as #content > div.navigation > div.navigation)! It should match the inner navigation div.

This seems to be a bug!

(Maybe related to #41 ?)

Returning ElementRef instead of NodeRef

In the methods for ElementRef it's possible to find siblings and descendant nodes using the Deref<Target = NodeRef>, however NodeRef doesn't have some of the convenience methods that ElemetRef has like inner_html(). Is it possible to add the ability to return an ElementRef when searching for nodes related to the currently selected ElementRef node instead of NodeRef?

HTML serialization is not deterministic.

The scraper::node::Element struct uses an HashMap to store the list of attributes, which means that the attribute order is not preserved, and that serializing an Element is not deterministic.

An easy fix for this would be to use an IndexMap for storing attributes in Elements.

Having deterministic serialization is pretty useful for things like unit tests, so this would be a welcome change.

Suggestion: Add Changelog

I would like to know what changed from version 0.11 to 0.12 :)

can't select with ~ and *

Greetings!
First of all I hope that you'll find maintainer because the library is awesome.

Unfortunatelly I can't use all possible attribute selectors:

fn main() {
    use scraper::{Html, Selector};

    let doc = Html::parse_fragment(r#"<a href="https://google.com""#);
    let sel = Selector::parse(r#"[href=~"goo"]"#).unwrap();
    println!("{:?}", doc.select(&sel).next());
}

The error is ParseError { kind: Custom(BadValueInAttr(Delim('~'))), location: SourceLocation { line: 0, column: 7 } }. And the same error appears with [href=*"goo"].

Ability to specify attribute constraints in selectors

Python's BeautifulSoup library allows to match based on both the tag as well as attributes. For example, in this HTML fragment

<a href="/contact" target="_blank">Contact</a>
<a href="/about">About</a>

Using Selector::parse("a").unwrap() as the selector would return both of these elements, but BeautifulSoup's soup.find_all("a", attrs={"target":"_blank"}) would only return the first one. This can probably be handled at the user end but a proper API within the crate, if possible, would greatly alleviate friction in porting Python scraping code over to Rust.

causal-agent / scraper Goto Github PK

scraper's People

Contributors

Stargazers

Watchers

Forkers

scraper's Issues

Code example

Bug Info

Removed

Recommend Projects

Recommend Topics

Recommend Org