Coder Social home page Coder Social logo

rep-cpp's Introduction

Robots Exclusion Protocol Parser for C++

Build Status

Supports the 1996 RFC, as well as some modern conventions, including:

  • wildcard matching (* and $)
  • sitemap listing
  • crawl-delay

This library deals in UTF-8-encoded strings.

Matching

A path may match multiple directives. For example, /some/path/page.html matches all of these rules:

Allow: /some/
Disallow: /some/path/
Allow: /*/page.html

Each directive is given a priority, and the highest-priority matching directive is used. We choose the length of the expression to be that priority. In the above example, the priorities are:

Allow: /some/            (priority = 6)
Disallow: /some/path/    (priority = 11)
Allow: /*/page.html      (priority = 12)

Classes

A Robots object is the result of parsing a single robots.txt file. It has a mapping of agent names to Agent objects, as well as a vector of the sitemaps listed in the file. An Agent object holds the crawl-delay and Directives associated with a particular user-agent.

Parsing and Querying

Here's an example of parsing a robots.txt file:

#include "robots.h"

std::string content = "...";
Rep::Robots robots = Rep::Robots(content);

// Is this path allowed to the provided agent?
robots.allowed("/some/path", "my-agent");

// Is this URL allowed to the provided agent?
robots.url_allowed("http://example.com/some/path", "my-agent");

If a client is interested only in the exclusion rules of a single agent, then:

Rep::Agent agent = Rep::Robots(content).agent("my-agent");

// Is this path allowed to this agent?
agent.allowed("/some/path");

// Is this URL allowed to this agent?
agent.url_allowed("http://example.com/some/path");

Building

This library depends on url-cpp, which is included as a submodule. We provide two main targets, {debug,release}/librep.o:

git submodule update --init --recursive
make release/librep.o

Development

Environment

To launch the vagrant image, we only need to vagrant up (though you may have to provide a --provider flag):

vagrant up

With a running vagrant instance, you can log in and run tests:

vagrant ssh
cd /vagrant

make test

Running Tests

Tests are run with the top-level Makefile:

make test

PRs

These are not all hard-and-fast rules, but in general PRs have the following expectations:

  • pass Travis -- or more generally, whatever CI is used for the particular project
  • be a complete unit -- whether a bug fix or feature, it should appear as a complete unit before consideration.
  • maintain code coverage -- some projects may include code coverage requirements as part of the build as well
  • maintain the established style -- this means the existing style of established projects, the established conventions of the team for a given language on new projects, and the guidelines of the community of the relevant languages and frameworks.
  • include failing tests -- in the case of bugs, failing tests demonstrating the bug should be included as one commit, followed by a commit making the test succeed. This allows us to jump to a world with a bug included, and prove that our test in fact exercises the bug.
  • be reviewed by one or more developers -- not all feedback has to be accepted, but it should all be considered.
  • avoid 'addressed PR feedback' commits -- in general, PR feedback should be rebased back into the appropriate commits that introduced the change. In cases, where this is burdensome, PR feedback commits may be used but should still describe the changed contained therein.

PR reviews consider the design, organization, and functionality of the submitted code.

Commits

Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.

  • updating / upgrading dependencies -- this is especially true for invocations like bundle update or berks update.
  • introducing a new dependency -- often preceeded by a commit updating existing dependencies, this should only include the changes for the new dependency.
  • refactoring -- these commits should preserve all the existing functionality and merely update how it's done.
  • utility components to be used by a new feature -- if introducing an auxiliary class in support of a subsequent commit, add this new class (and its tests) in its own commit.
  • config changes -- when adjusting configuration in isolation
  • formatting / whitespace commits -- when adjusting code only for stylistic purposes.

New Features

Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).

Bug Fixes

In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.

Tagging and Versioning

Whenever the version included in setup.py is changed (and it should be changed when appropriate using http://semver.org/), a corresponding tag should be created with the same version number (formatted v<version>).

git tag -a v0.1.0 -m 'Version 0.1.0

This release contains an initial working version Rep::Robots.'

git push origin

rep-cpp's People

Contributors

b4hand avatar dlecocq avatar tuxnco avatar tdufala avatar

Stargazers

 avatar Konstantin Kl. avatar Fayyaz Usman avatar  avatar Max avatar Lorenzo Franceschini avatar Ali Bakhtiar avatar mark padgham avatar boB Rudis avatar  avatar Wellington Torrejais da Silva avatar  avatar

Watchers

David Joslin avatar Anish Kumar avatar ankan avatar  avatar James Cloos avatar  avatar GP Wang avatar Greg Lindahl avatar  avatar Shawn Edwards avatar KSHITIJ GAUTAM avatar  avatar  avatar Lindsey avatar  avatar  avatar Tammy Bailey avatar  avatar Scott E avatar  avatar

rep-cpp's Issues

Ignore initial BOM in robots.txt

seomoz/reppy#50 reported an issue with a robots.txt file that contained an initial BOM. This causes the first line User-Agent directive to be ignored which then later causes the file to be treated as invalid. Instead, we should ignore any initial BOM even though technically they are not supposed to be present as far as I can tell.

Ignore absoluteURI for Allow/Disallow directives for out of context domains

While technically, Allow and Disallow directives should always be relative paths according to the spec, it's not entirely uncommon for some sites to include absolute URIs in these directives.

Traditionally, reppy and rep-cpp has considered absolute URIs to be equivalent to their corresponding relative URL as though they were for the current domain; however, there are several examples on the web where sites list an absolute URI for an external domain that doesn't match the domain of the parsed robots.txt file. It's unclear what the intended meaning of such an absolute URI is. Google's spec for robots.txt indicates only path elements for Disallow and Allow directives, and it's unclear how they handle absolute URIs in this context, but I can't imagine that Google would respect a Disallow directive for an external site since that would mean arbitrary external sites could block crawling for any site.

One simple option for handling this case would be to simply discard any directives with absolute URIs. However, this means that previously matching domain absolute URIs would now be ignored as well, including Disallow directives. I'm reluctant to permit this case, so instead, I propose we only ignore or discard directives where the domain doesn't match the requested domain for robots.txt. However, rep-cpp currently doesn't have this contextual information of the requested robots.txt URL, so it currently can't determine which directives are ignorable or not.

Path allowed despite Disallow for *

Hey,

I have written an R based robots.txt parser (https://github.com/ropenscilabs/robotstxt). @hrbrmstr wrapped this library and suggested using it for big-x speedup (https://github.com/hrbrmstr/spiderbar).

Related issue: hrbrmstr/spiderbar#2

Now I have run my test cases against my implementation and against those wrapping rep-cpp and found a divergence which I think is a bug on your side. Consider the following example:

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

In the example there are some directories forbidden for all robots e.g. /temp/ but when using rep-cpp for permission checking the path is indicated as ok when checking for bot mein-Robot which I am quite sure should not be the case. (rep-cpp is used for those function calls where check_method="spiderbar")

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.