Comments (2)
This is a good idea. I'm not sure which solution I prefer. Maybe the last time the robots.txt got crawled could be cached? That way one could implement a behavior that is in between/ a combination of the proposed two solutions. But I'm not sure if that is worth the increase of complexity.
Probing the robots.txt when checking the config seems like a behavior we should have, regardless of the other behaviors in discussion. Not only do we check the robots.txt at least once, we can also leverage this to check if the domains/hosts are actually reachable - a nice extra ux-candy for free. (This ofc. adds the deployment-dependency of being run on a(n) connected / online machine, but given the nature of the tool, this is acceptable imho.)
from website-stalker.
Okay, I think our best bet is robotstxt. It has zero dependencies and the code looks well commented. An alternative could be robotparser-rs. It depends on url
and percent-encoding
, has slightly less "used-by" but seems to be under more active development when looking at the git hist.
I also considered writing a parser ourselves as I wanted to learn more about parser-generators but all rust related projects seemed either over-kill or not matching the use case (or what I imagined would be the use case for a parser-generator 🤡). Examples could be pest, which has the very neat ability to take PEG (think of it as (A)BNF) and generate a matching parser for that. Since the robots.txt RFC specifies the BNF for parsing, this could be doable also I didn't investigate the BNF <-> PEG conversion. I believe pest is over-kill.
Here an alternative could be nom which is a parser-combinator. We would break down the ABNF from the rfc into small parsers for each single case and nom would combine them to a complete and proper parser. The downside is that a lot more work is needed and we lose the 1:1 "verification" of the correct implementation according to the ABNF.
from website-stalker.
Related Issues (20)
- consider switching to `git-repository` crate HOT 2
- Current main can't parse it's own example config HOT 1
- Replace integrated git support by usage of the existing git executable HOT 1
- Overriding filename? HOT 7
- Allow override of 'sites/' HOT 7
- Commit per Domain or all HOT 1
- Create issue for the changes with link for the commit HOT 2
- Latest version breaks Github Action HOT 8
- Sorting/Ordering is changed, but no change in content HOT 11
- One commit per host HOT 1
- Remove `check` sub-command HOT 2
- command editor
- New Editor Idea: `html_unlink`
- Commit message based on changed site
- Allow grouping of links HOT 2
- Use environment variable for `from` HOT 2
- Default extension HOT 1
- html_prettify: sort class and style attributes HOT 1
- Select parts of a json
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from website-stalker.