Coder Social home page Coder Social logo

Check robots.txt about website-stalker HOT 2 OPEN

edjopato avatar edjopato commented on September 24, 2024
Check robots.txt

from website-stalker.

Comments (2)

Teufelchen1 avatar Teufelchen1 commented on September 24, 2024 1

This is a good idea. I'm not sure which solution I prefer. Maybe the last time the robots.txt got crawled could be cached? That way one could implement a behavior that is in between/ a combination of the proposed two solutions. But I'm not sure if that is worth the increase of complexity.

Probing the robots.txt when checking the config seems like a behavior we should have, regardless of the other behaviors in discussion. Not only do we check the robots.txt at least once, we can also leverage this to check if the domains/hosts are actually reachable - a nice extra ux-candy for free. (This ofc. adds the deployment-dependency of being run on a(n) connected / online machine, but given the nature of the tool, this is acceptable imho.)

from website-stalker.

Teufelchen1 avatar Teufelchen1 commented on September 24, 2024

Okay, I think our best bet is robotstxt. It has zero dependencies and the code looks well commented. An alternative could be robotparser-rs. It depends on url and percent-encoding, has slightly less "used-by" but seems to be under more active development when looking at the git hist.

I also considered writing a parser ourselves as I wanted to learn more about parser-generators but all rust related projects seemed either over-kill or not matching the use case (or what I imagined would be the use case for a parser-generator 🤡). Examples could be pest, which has the very neat ability to take PEG (think of it as (A)BNF) and generate a matching parser for that. Since the robots.txt RFC specifies the BNF for parsing, this could be doable also I didn't investigate the BNF <-> PEG conversion. I believe pest is over-kill.
Here an alternative could be nom which is a parser-combinator. We would break down the ABNF from the rfc into small parsers for each single case and nom would combine them to a complete and proper parser. The downside is that a lot more work is needed and we lose the 1:1 "verification" of the correct implementation according to the ABNF.

from website-stalker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.