Coder Social home page Coder Social logo

robotstxt's Introduction

Google Robots.txt Parser and Matcher Library

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++14).

About the library

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.

For webmasters, we included a small binary in the project that allows testing a single URL and user-agent against a robots.txt.

Building the library

Quickstart

We included with the library a small binary to test a local robots.txt against a user-agent and URL. Running the included binary requires:

  • A compatible platform (e.g. Windows, macOS, Linux, etc.). Most platforms are fully supported.
  • A compatible C++ compiler supporting at least C++14. Most major compilers are supported.
  • Git for interacting with the source code repository. To install Git, consult the Set Up Git guide on GitHub.
  • Although you are free to use your own build system, most of the documentation within this guide will assume you are using Bazel. To download and install Bazel (and any of its dependencies), consult the Bazel Installation Guide

Building with Bazel

Bazel is the official build system for the library, which is supported on most major platforms (Linux, Windows, MacOS, for example) and compilers.

To build and run the binary:

$ git clone https://github.com/google/robotstxt.git robotstxt
Cloning into 'robotstxt'...
...
$ cd robotstxt/
bazel-robots$ bazel test :robots_test
...
/:robots_test                                                      PASSED in 0.1s

Executed 1 out of 1 test: 1 test passes.
...
bazel-robots$ bazel build :robots_main
...
Target //:robots_main up-to-date:
  bazel-bin/robots_main
...
bazel-robots$ bazel run robots_main -- ~/local/path/to/robots.txt YourBot https://example.com/url
  user-agent 'YourBot' with url 'https://example.com/url' allowed: YES

Building with CMake

CMake is the community-supported build system for the library.

To build the library using CMake, just follow the steps below:

$ git clone https://github.com/google/robotstxt.git robotstxt
Cloning into 'robotstxt'...
...
$ cd robotstxt/
...
$ mkdir c-build && cd c-build
...
$ cmake .. -DROBOTS_BUILD_TESTS=ON
...
$ make
...
$ make test
Running tests...
Test project robotstxt/c-build
    Start 1: robots-test
1/1 Test #1: robots-test ......................   Passed    0.02 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.02 sec
...
$ robots ~/local/path/to/robots.txt YourBot https://example.com/url
  user-agent 'YourBot' with url 'https://example.com/url' allowed: YES

Notes

Parsing of robots.txt files themselves is done exactly as in the production version of Googlebot, including how percent codes and unicode characters in patterns are handled. The user must ensure however that the URI passed to the AllowedByRobots and OneAgentAllowedByRobots functions, or to the URI parameter of the robots tool, follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, the matching will be done according to the REP specification.

Also note that the library, and the included binary, do not handle implementation logic that a crawler might apply outside of parsing and matching, for example: Googlebot-Image respecting the rules specified for User-agent: Googlebot if not explicitly defined in the robots.txt file being tested.

License

The robots.txt parser and matcher C++ library is licensed under the terms of the Apache license. See LICENSE for more information.

Links

To learn more about this project:

robotstxt's People

Contributors

alanyee avatar anubhavp28 avatar devsdmf avatar dwsmart avatar edwardbetts avatar epere4 avatar fridex avatar garyillyes avatar happyxgang avatar korilakkuma avatar lucasassisrosa avatar luchaninov avatar lvandeve avatar naveenarun avatar tomanthony avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

robotstxt's Issues

I think this comment is misleading

// Returns true iff 'url' is allowed to be fetched by any member of the

It says returns true iff any user agent in the vector is allowed to crawl. In fact, what it appears to me that it does is effectively collapse all rules that apply to any of the user agents in the vector into a single ruleset and then evaluate against that. That isn't always the same as any in the list being allowed.

e.g.

robots.txt:

User-agent: googlebot
Disallow: /foo/

if we call this method against the url /foo/ with a vector containing both googlebot and otherbot, it will return FALSE even though clearly otherbot is allowed to crawl /foo/ because (as I understand it) it's doing the equivalent of finding all rules that apply to either ua and collapsing into a single ruleset like:

User-agent: googlebot
User-agent: otherbot
Disallow: /foo/

So I think the comment is misleading, but would appreciate more eyes on the question!

Consider a WASM build

Noticing this is getting ported to Golang, Rust ... would it be worth integrating a WASM (web assembly) build into the process?

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot

For example Googlebot gets blocked by following robots.txt (check it in google testing tool):

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

If you remove Crawl-delay directive Googlebot will pass. This works:

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

And this too:

# Disallow: Badbot
User-agent: badbot
Disallow: /

If you would like to use Crawl-delay directive and to not block Googlebot you must add Allow directive:

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

# allow explicitly all other bots (supported only by google and bing)
User-agent: *
Allow: /

Both Crawl-delay and Allow are unofficial directives. Crawl-delay is widely supported (except of Googlebot). Allow is supported only by Googlebot and Bingbot (AFAIK). Normally Googlebot should be allowed by all above mentioned robots.txt. E.g. if you choose Adsbot-Google in mentioned google tool it will pass for all. All other google bots will fail in the same way. For first time we have noticed this unexpected behaviour at the end of 2021.

Is this a mistake in parsing of robots.txt by Googlebot or do I just miss something?

Special characters * and $ not matched in URI

Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters * and $. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.

* and $ are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.

diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
     EXPECT_FALSE(
         IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
   }
+  {
+    const absl::string_view robotstxt =
+        "User-agent: FooBot\n"
+        "Disallow: /path/file-with-a-%2A.html\n"
+        "Disallow: /path/foo-%24\n"
+        "Allow: /\n";
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/file-with-a-*.html"));
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/foo-$"));
+  }
 }
 
 // Google-specific: "index.html" (and only that) at the end of a pattern is

bazel test failed with `bazelisk`: Repository '@bazel_skylib' is not defined

I installed Bazeliskfor macOS with brew https://bazel.build/install/bazelisk and ran bazel test :robots_test. Received the following error:

ERROR: Analysis of target '//:robots_test' failed; build aborted: error loading package '@com_google_absl//absl': Unable to find package for @bazel_skylib//lib:selects.bzl: The repository '@bazel_skylib' could not be resolved: Repository '@bazel_skylib' is not defined.
INFO: Elapsed time: 0.742s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
    currently loading: @com_google_absl//absl
    Fetching @local_config_xcode; fetching

Docker image for robots

Docker image

I have created an alpine linux-based Docker image that includes a binary of google/robotstxt. If you are consider to publish it officially, please let me know. I believe I can contribute to this project.

Dockerfile: https://github.com/peaceiris/docker-images/tree/main/images/robotstxt

Usage

docker run

$ wget https://github.com/robots.txt
$ docker run --rm -v ./robots.txt:/root/robots.txt peaceiris/robotstxt:v1.0.0 /root/robots.txt Googlebot "https://github.com/google"
user-agent 'Googlebot' with URI 'https://github.com/google': ALLOWED

GitHub Actions

name: robots.txt

on:
  workflow_dispatch:
  pull_request:

jobs:
  validate:
    runs-on: ubuntu-22.04
    timeout-minutes: 3
    permissions:
      contents: read
    container:
      image: ghcr.io/peaceiris/robotstxt:v1.0.0
    steps:
      - uses: actions/checkout@v4
      - name: Validate
        run: |
          # Validate a local robots.txt
          robots ./assets/robots.txt Googlebot "https://github.com/"

          # Download and validate
          curl -s https://github.com/robots.txt --output robots.txt
          robots robots.txt Googlebot "https://github.com/google"

CMAKE compilation not working on Ubuntu 16.04

I'm on Ubuntu 16.04.7 LTS and I tried compiling this project by using the CMAKE compiler.

But I had this problematic output when doing the cmake command:

deploy@yamada:~/robotstxt/c-build⟫ cmake .. -DROBOTS_BUILD_TESTS=ON
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/deploy/robotstxt/c-build/libs
Scanning dependencies of target googletest
[  5%] Creating directories for 'googletest'
[ 11%] Performing download step (git clone) for 'googletest'
-- Avoiding repeated git clone, stamp file is up to date: '/home/deploy/robotstxt/c-build/libs'
[ 16%] No patch step for 'googletest'
[ 22%] Performing update step for 'googletest'
fatal: Needed a single revision
invalid upstream GIT_PROGRESS/master
No rebase in progress?
CMake Error at /home/deploy/robotstxt/c-build/libs/googletest-prefix/tmp/googletest-gitupdate.cmake:105 (message):


  Failed to rebase in: '/'.

  You will have to resolve the conflicts manually


CMakeFiles/googletest.dir/build.make:95: recipe for target 'googletest-prefix/src/googletest-stamp/googletest-update' failed
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/googletest.dir/all' failed
make[2]: *** [googletest-prefix/src/googletest-stamp/googletest-update] Error 1
make[1]: *** [CMakeFiles/googletest.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
CMake Error at CMakeLists.txt:73 (MESSAGE):
  Failed to download dependencies: 2


-- Configuring incomplete, errors occurred!
See also "/home/deploy/robotstxt/c-build/CMakeFiles/CMakeOutput.log".

Note: My version of git is 2.7.4.

Do you have an idea why this is not working?

Create git tag for the first release

I think it would be useful to specify which version users are building the robots tool with for the sake of reproducibility. Could you create a Git tag or GitHub release to indicate the release timing?

Update README build requirements

The README says that we should have installed a compatible C++ compiler supporting at least C++11, but once I ran tests, I got an error saying that C++ versions less than C++14 are not supported.

Genetic Cloning

Let's say a rogue nation state uses this as "entropy" for human cloning.

I do a text stream, and "fuzz" the output to your input.

:(

???

Allow wider range of chars for valid user-agent identifier / 'product' token

Hi,

I've just fixed an issue reported against my Go port of this library — according to the specs used (by the Google library here), valid chars for a user-agent string are "a-zA-Z_-", but RFC7231 defines the 'product' part of user-agent as being a 'token', defined as:

  token          = 1*tchar

  tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" 
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

I think it's important to fix this, because, according to Wikipedia's robots.txt, there are bots in the wild that are using user-agent strings with characters outside of the characters permitted by the current RobotsMatcher::ExtractUserAgent implementation - which means that 'disallow' directives that would otherwise match are in fact failing to match. (Examples include: MJ12bot and k2spider)

See jimsmart/grobotstxt#4 for further details.

HTH

Issues with Bazel build

I get the following errors, which seem to refer to changes with Bazel (in particular referring to this issue ):

ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: in alias rule @bazel_tools//platforms:windows: Constraints from @bazel_tools//platforms have been removed. Please use constraints from @platforms repository embedded in Bazel, or preferably declare dependency on https://github.com/bazelbuild/platforms. See https://github.com/bazelbuild/bazel/issues/8622 for details.
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/bazel_tools/platforms/BUILD:89:6: Analysis of target '@bazel_tools//platforms:windows' failed
ERROR: /private/var/tmp/_bazel_willcritchlow/04efdbe0626597a3c8cf0aa15fc82ba3/external/com_google_googletest/BUILD.bazel:57:11: errors encountered resolving select() keys for @com_google_googletest//:gtest
ERROR: Analysis of target '//:robots_test' failed; build aborted: 
INFO: Elapsed time: 0.378s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 1 target configured)
    currently loading: @com_google_absl//absl
ERROR: Couldn't start the build. Unable to run tests

An encoding test does not appear to match the RFC?

The first ID_Encoding test caught me by surprise, since it does not appear to match the RFC:

  // /foo/bar?baz=http://foo.bar stays unencoded.
  {
    const absl::string_view robotstxt =
        "User-agent: FooBot\n"
        "Disallow: /\n"
        "Allow: /foo/bar?qux=taz&baz=http://foo.bar?tar&par\n";
    EXPECT_TRUE(IsUserAgentAllowed(
        robotstxt, "FooBot",
        "http://foo.bar/foo/bar?qux=taz&baz=http://foo.bar?tar&par"));
  }

However, section 2.2.2 of the REP RFC seems to indicate that /foo/bar?baz=http://foo.bar should be encoded as /foo/bar?baz=http%3A%2F%2Ffoo.bar.

I can't decide if I'm mis-reading the RFC or if the test intentionally deviates from the RFC in this case.

Thanks!

Library ported to Go

Hi,

Just to let folk know that I have ported this library from its original C++ into Go.

https://github.com/jimsmart/grobotstxt

My conversion includes 100% of the original library's functionality, and 100% of the tests.

Because I approached this conversion function-by-function, some of the resulting code is not necessarily idiomatic Go — but, in places, I have made some cleanups, including renaming a few things.

But otherwise my package is a faithful reproduction of the code in this here repo.

I have licensed my code with Apache 2.0, as per this repo, and I have preserved existing copyright notices and licensing within the appropriate source files.

— Regarding this last matter, as my code is technically a derivative of this repo's code, would someone here please care to check my project with regards to the above-mentioned licensing requirements, to ensure that what I have done is correct? Many thanks.

/Jim

Crawl-Delay support?

There's a commonly-supported optional field, Crawl-Delay, which indicates the requested minimum time between bot requests on a site. It would be really nice if this library could parse that and provide a function to query the crawl delay for a specified user-agent.

A rust ported robotstxt

I have ported a Rust version of this library recently. The Rust version keeps the same behavior of the original library, provides a consistent API and passed 100% C++ testing cases via Rust FFI.

Also, Rust version licensed with Apache 2.0 too, and I have preserved existing copyright notices and licensing within the appropriate source files.

Here is my repository: https://github.com/Folyd/robotstxt

Installation for robots using Homebrew (brew)

I have created a Homebrew Formula to install robots on macOS and Linux using Homebrew. If you are considering adding it to homebrew-core, please let me know. I believe I can be of assistance.

brew install peaceiris/tap/robots
$ curl -s "https://github.com/robots.txt" --output robots.txt
$ robots robots.txt Googlebot "https://github.com/google"
user-agent 'Googlebot' with URI 'https://github.com/google': ALLOWED

The tap source: https://github.com/peaceiris/homebrew-tap

User-agent names in test ID_UserAgentValueCaseInsensitive to follow the standard

The robots.txt in test "ID_UserAgentValueCaseInsensitive" (robots_test.cc, line 200) uses user-agent names violating the standard as they include a white space (FOO BAR, foo bar, FoO bAr). User-agent names in the test are matched up to the first white space - a Google-specific feature following the comments in test GoogleOnly_AcceptUserAgentUpToFirstSpace. What about using standard-conform names in the user-agent directives? Eg. FOO or FOOBAR?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.