Coder Social home page Coder Social logo

Comments (8)

adumbidiot avatar adumbidiot commented on June 17, 2024

I can't get that html to even parse. Are you sure that's what you used to trigger the issue?

from scraper.

David-OConnor avatar David-OConnor commented on June 17, 2024

That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.

Example link it finds:

<a href="https://github.com">

Example link it doesn't find:

<a
    href="https://github.com">

from scraper.

adumbidiot avatar adumbidiot commented on June 17, 2024

That seems to work.

main.rs:

fn main() {
    let html = r#"<a
href="https://github.com">"#;

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Output:

Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>

from scraper.

David-OConnor avatar David-OConnor commented on June 17, 2024

Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with

from scraper.

David-OConnor avatar David-OConnor commented on June 17, 2024

Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:

https://www.anyleaf.org/blog

It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.

from scraper.

adumbidiot avatar adumbidiot commented on June 17, 2024

I can't reproduce that.
main.rs:

fn main() {
    let url = "https://www.anyleaf.org/blog";
    let html = ureq::get(url).call().unwrap().into_string().unwrap();

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(&html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Cargo.toml:

[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"

[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"

Output:

Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n\n    <sc
ript type=\"module\">\n        document.documentElement.classList.remove('no-js');\n        document.documentElement.classList.add('js');\n    </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n    <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n    <meta property=\"og:locale\" content=\"en_US\">\n    <meta property=\"og:type\" content=\"website\">\n    <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n    <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n    \n    <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n    \n    \n    <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n    \n    <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
   \n    <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <meta property=\"og:title\" content=\
"\">\n    <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n    <div id=\"menu\">\n        <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n        <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n        <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n        <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
      <a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n        <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n        <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n        <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n        <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
 <a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n        <a class=\"menu-item\" href=\"mailto:[email protected]\"><h3 class=\"m
enu-header\">Contact</h3></a>\n    </div>\n</div>\n\n\n\n\n    <div class=\"home-body\">\n        <div style=\"text-align: center;\">\n        <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n        </div>\n\n        <h1>AnyLeaf Blog</h1>\n\n        <h2>Misc:</h2>\n        <ul>\
n            <li style=\"margin-bottom: 40px;\">\n                <a\n                        href=\"/filter-design\"\n                        style=\"font-size
: 1.5em;\"\n                >Digital filter design and response\n                </a>\n            </li>\n        </ul>\n\n        <h2>Articles:</h2>\n        <
ul>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n                            style=\"font-size: 1.5em\">\n                        Parts you need for a quadcopter in 2022\n
  </a> - Feb. 24, 2022, 7:46 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n
                 href=\"/blog/writing-embedded-firmware-using-rust\"\n                            style=\"font-size: 1.5em\">\n                        Writing e
mbedded firmware using Rust\n                    </a> - Sept. 25, 2021, 5:45 p.m.\n                </li>\n            \n                <li style=\"margin-botto
m: 40px;\">\n                    <a\n                            href=\"/blog/measuring-ph-on-raspberry-pi\"\n                            style=\"font-size: 1.5
em\">\n                        Measuring pH on Raspberry Pi\n                    </a> - Feb. 6, 2021, 9:47 a.m.\n                </li>\n            \n
      <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/the-essence-of-embedded-computers\"\n
             style=\"font-size: 1.5em\">\n                        The essence of embedded computers\n                    </a> - Sept. 6, 2020, 7:09 p.m.\n
          </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n                            style=\"font-size: 1.5em\">\n                        Electrical Conductivity (EC) for Hydroponi
cs\n                    </a> - Aug. 22, 2020, 4 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n
    <a\n                            href=\"/blog/project:-building-an-automatic-ph-doser\"\n                            style=\"font-size: 1.5em\">\n
             Project: Building an automatic pH doser\n                    </a> - July 21, 2020, 7:33 p.m.\n                </li>\n            \n
<li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/ph-measurement-for-hydroponics\"\n
    style=\"font-size: 1.5em\">\n                        pH Measurement for Hydroponics\n                    </a> - July 19, 2020, 3:43 p.m.\n                </
li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/how-to-calibrate-ph-sen
sors\"\n                            style=\"font-size: 1.5em\">\n                        How to Calibrate pH Sensors\n                    </a> - July 17, 2020,
1:23 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"
/blog/temperature-sensors:-a-comparison\"\n                            style=\"font-size: 1.5em\">\n                        Temperature sensors: A comparison\n
                   </a> - July 15, 2020, 6:42 p.m.\n                </li>\n            \n        </ul>\n    </div>\n\n\n\n<div id=\"footer\">\n    <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n    <div style=\"margin-bottom: 30px\">\n        <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n        <a class=fineprint href=\"/terms\">Terms and conditions</a>\n    </div>\n    <div style=\"display: flex; flex-directio
n: column\">\n        <h5 class=\"fineprint\">\n            All AnyLeaf products comply with the\n            <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n                Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n        <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n    </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:[email protected]" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
                </a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
                        Parts you need for a quadcopter in 2022
                    </a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
                        Writing embedded firmware using Rust
                    </a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
                        Measuring pH on Raspberry Pi
                    </a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
                        The essence of embedded computers
                    </a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
                        Electrical Conductivity (EC) for Hydroponics
                    </a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
                        Project: Building an automatic pH doser
                    </a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
                        pH Measurement for Hydroponics
                    </a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
                        How to Calibrate pH Sensors
                    </a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
                        Temperature sensors: A comparison
                    </a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
                Restriction of Hazardous Substances (RoHS) Directive</a>

from scraper.

David-OConnor avatar David-OConnor commented on June 17, 2024

Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.

from scraper.

teymour-aldridge avatar teymour-aldridge commented on June 17, 2024

I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.

from scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.