Comments (8)
I can't get that html to even parse. Are you sure that's what you used to trigger the issue?
from scraper.
That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.
Example link it finds:
<a href="https://github.com">
Example link it doesn't find:
<a
href="https://github.com">
from scraper.
That seems to work.
main.rs
:
fn main() {
let html = r#"<a
href="https://github.com">"#;
println!("Raw HTML: {:?}", html);
let document = scraper::Html::parse_document(html);
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
println!("{}", el.html());
}
}
Output:
Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>
from scraper.
Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with
from scraper.
Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:
It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.
from scraper.
I can't reproduce that.
main.rs
:
fn main() {
let url = "https://www.anyleaf.org/blog";
let html = ureq::get(url).call().unwrap().into_string().unwrap();
println!("Raw HTML: {:?}", html);
let document = scraper::Html::parse_document(&html);
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
println!("{}", el.html());
}
}
Cargo.toml
:
[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"
[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"
Output:
Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"utf-8\">\n <meta name=\"viewport\" content=\"width=device-width\">\n\n <sc
ript type=\"module\">\n document.documentElement.classList.remove('no-js');\n document.documentElement.classList.add('js');\n </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n <meta property=\"og:locale\" content=\"en_US\">\n <meta property=\"og:type\" content=\"website\">\n <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n \n <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n \n \n <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n \n <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
\n <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n <meta property=\"og:title\" content=\
"\">\n <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n <div id=\"menu\">\n <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
<a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
<a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n <a class=\"menu-item\" href=\"mailto:[email protected]\"><h3 class=\"m
enu-header\">Contact</h3></a>\n </div>\n</div>\n\n\n\n\n <div class=\"home-body\">\n <div style=\"text-align: center;\">\n <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n </div>\n\n <h1>AnyLeaf Blog</h1>\n\n <h2>Misc:</h2>\n <ul>\
n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/filter-design\"\n style=\"font-size
: 1.5em;\"\n >Digital filter design and response\n </a>\n </li>\n </ul>\n\n <h2>Articles:</h2>\n <
ul>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n style=\"font-size: 1.5em\">\n Parts you need for a quadcopter in 2022\n
</a> - Feb. 24, 2022, 7:46 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n
href=\"/blog/writing-embedded-firmware-using-rust\"\n style=\"font-size: 1.5em\">\n Writing e
mbedded firmware using Rust\n </a> - Sept. 25, 2021, 5:45 p.m.\n </li>\n \n <li style=\"margin-botto
m: 40px;\">\n <a\n href=\"/blog/measuring-ph-on-raspberry-pi\"\n style=\"font-size: 1.5
em\">\n Measuring pH on Raspberry Pi\n </a> - Feb. 6, 2021, 9:47 a.m.\n </li>\n \n
<li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/the-essence-of-embedded-computers\"\n
style=\"font-size: 1.5em\">\n The essence of embedded computers\n </a> - Sept. 6, 2020, 7:09 p.m.\n
</li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n style=\"font-size: 1.5em\">\n Electrical Conductivity (EC) for Hydroponi
cs\n </a> - Aug. 22, 2020, 4 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n
<a\n href=\"/blog/project:-building-an-automatic-ph-doser\"\n style=\"font-size: 1.5em\">\n
Project: Building an automatic pH doser\n </a> - July 21, 2020, 7:33 p.m.\n </li>\n \n
<li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/ph-measurement-for-hydroponics\"\n
style=\"font-size: 1.5em\">\n pH Measurement for Hydroponics\n </a> - July 19, 2020, 3:43 p.m.\n </
li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/how-to-calibrate-ph-sen
sors\"\n style=\"font-size: 1.5em\">\n How to Calibrate pH Sensors\n </a> - July 17, 2020,
1:23 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"
/blog/temperature-sensors:-a-comparison\"\n style=\"font-size: 1.5em\">\n Temperature sensors: A comparison\n
</a> - July 15, 2020, 6:42 p.m.\n </li>\n \n </ul>\n </div>\n\n\n\n<div id=\"footer\">\n <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n <div style=\"margin-bottom: 30px\">\n <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n <a class=fineprint href=\"/terms\">Terms and conditions</a>\n </div>\n <div style=\"display: flex; flex-directio
n: column\">\n <h5 class=\"fineprint\">\n All AnyLeaf products comply with the\n <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:[email protected]" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
</a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
Parts you need for a quadcopter in 2022
</a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
Writing embedded firmware using Rust
</a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
Measuring pH on Raspberry Pi
</a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
The essence of embedded computers
</a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
Electrical Conductivity (EC) for Hydroponics
</a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
Project: Building an automatic pH doser
</a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
pH Measurement for Hydroponics
</a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
How to Calibrate pH Sensors
</a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
Temperature sensors: A comparison
</a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
Restriction of Hazardous Substances (RoHS) Directive</a>
from scraper.
Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.
from scraper.
I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.
from scraper.
Related Issues (20)
- Remove Cargo.lock HOT 2
- Disable logging? HOT 2
- Select inside <noscript> HOT 3
- how to remove the node? HOT 1
- How to select in pairs? HOT 1
- How to get the number from NodeID HOT 6
- same element? HOT 3
- Allow `Selector` to be const-creatable HOT 3
- Save source code `ElementRef::html()` HOT 3
- Malformed HTML parsed differently from browsers
- Make element traversal more convenient HOT 2
- [Feature Request] Find by Text HOT 1
- Dom Nodes closes prematurely on recursion
- How to select contains and start with? HOT 1
- any way to scrape in a stream? HOT 2
- Support for `:has()` selector HOT 4
- Implement Send for ElementRef HOT 16
- Convert <br> to '\n' in `text`? HOT 1
- future created by async block is not `Send` HOT 2
- Upgrade ahash
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scraper.