temoto / robotstxt Goto Github PK

View Code? Open in Web Editor NEW

267.0 10.0 55.0 97 KB

The robots.txt exclusion protocol implementation for Go language

License: MIT License

Go 96.53% Shell 3.47%

go golang golang-library robots-txt web production-ready go-library status-active

robotstxt's Introduction

What

This is a robots.txt exclusion protocol implementation for Go language (golang).

Build

To build and run tests run go test in source directory.

Contribute

Warm welcome.

If desired, add your name in README.rst, section Who.
Run script/test && script/clean && echo ok
You can ignore linter warnings, but everything else must pass.
Send your change as pull request or just a regular patch to current maintainer (see section Who).

Thank you.

Usage

As usual, no special installation is required, just

import "github.com/temoto/robotstxt"

run go get and you're ready.

1. Parse

First of all, you need to parse robots.txt data. You can do it with functions FromBytes(body []byte) (*RobotsData, error) or same for string:

robots, err := robotstxt.FromBytes([]byte("User-agent: *\nDisallow:"))
robots, err := robotstxt.FromString("User-agent: *\nDisallow:")

As of 2012-10-03, FromBytes is the most efficient method, everything else is a wrapper for this core function.

There are few convenient constructors for various purposes:

FromResponse(*http.Response) (*RobotsData, error) to init robots data

from HTTP response. It does not call response.Body.Close():

robots, err := robotstxt.FromResponse(resp)
resp.Body.Close()
if err != nil {
    log.Println("Error parsing robots.txt:", err.Error())
}

FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) or

FromStatusAndString if you prefer to read bytes (string) yourself. Passing status code applies following logic in line with Google's interpretation of robots.txt files:

status 2xx -> parse body with FromBytes and apply rules listed there.

status 4xx -> allow all (even 401/403, as recommended by Google).

other (5xx) -> disallow all, consider this a temporary unavailability.

2. Query

Parsing robots.txt content builds a kind of logic database, which you can query with (r *RobotsData) TestAgent(url, agent string) (bool).

Explicit passing of agent is useful if you want to query for different agents. For single agent users there is an efficient option: RobotsData.FindGroup(userAgent string) returns a structure with .Test(path string) method and .CrawlDelay time.Duration.

Simple query with explicit user agent. Each call will scan all rules.

allow := robots.TestAgent("/", "FooBot")

Or query several paths against same user agent for performance.

group := robots.FindGroup("BarBot")
group.Test("/")
group.Test("/download.mp3")
group.Test("/news/article-2012-1")

Who

Honorable contributors (in undefined order):

Ilya Grigorik (igrigorik)

Martin Angers (PuerkitoBio)

Micha Gorelick (mynameisfiber)

Initial commit and other: Sergey Shepelev [email protected]

Flair

https://travis-ci.org/temoto/robotstxt.svg?branch=master

https://goreportcard.com/badge/github.com/temoto/robotstxt

robotstxt's People

Contributors

Stargazers

Watchers

Forkers

newblue mna mynameisfiber wathiede pkulak jshi-git niltonkummer philsong ujjwalt wuxiao356 mohae oblank tarunlalwani brainm jaguilar wngmnheiko cliqz-oss gigovich nielsole whothef kaxap sergebezborodov bitmagic lordnorthern lytics fq400 adambuckland deanlj awdrius marcelofossrj paulvollmer jehiah misterpilou gluru motemen tylitianrui jakesylvestre ezoic 13266522010 djsousuo nj-eka muflhi01 lotdeef mayhemheroes lgtm-migrator diver-dev performline dashbikash masonlouchart airplayx spyderorg

robotstxt's Issues

fix '//' formatting in README

Incorrect usage of $ symbols

Hi, I hit a problem when trying to adhere to https://developer.mozilla.org/robots.txt.

It contains lines like Disallow: /*$history, but this pkg still allows links like /en-US/docs/Web/HTML/Element/blink$history when it shouldn't.

I see you're following the google recommendations and correctly parses and uses the $ as a regexp end symbol, so obviously I blame MDN for incorrect usage of it in the robots file.

Not sure how to handle this and if it's of any relevance to you at all?

No error is returned by FromResponse() when the file is incorrect

I've been working on a project which makes use of this library. I thank you greatly for its usefulness. However, I've been encountering a problem. example.com/robots.txt is, as it turns out, a redirect. Rather than throwing an error, FromResponse() seems to produce an object from which FindGroup().Test() causes a runtime panic.

To illustrate:

package main

import (
    "net/http"
    "github.com/temoto/robotstxt.go"
)

func main() {
    resp, err := http.Get("http://example.com/robots.txt")
    if err != nil {
        //This is not reached.
        println(err.Error())
    }
    robots, err := robotstxt.FromResponse(resp)
    if err != nil {
        //This is also not reached. This is the problem!
        println(err.Error())
    }

    //group will be nil after this line.
    group := robots.FindGroup("MyBot")

    //This will cause a panic.
    print(group.Test("/"))
}

Parse rules for a given user agent

I'm wondering if it's worth optimizing memory-usage a bit by parsing just a single ruleset for a given user agent, so the signature might be one of:

rules, err := robotstxt.ForAgent(buf, "mybot")
rules, err := robotstxt.ParseAgent(buf, "mybot")

The parser would skip all non-matching user-agents (except for *), if a ruleset for mybot was found, it would return its ruleset, otherwise it would return the default ruleset *.

The method could accept multiple useragents, so for a example a search engine crawler might do:

rules, err := robotstxt.Parse(buf, "Searchbot", "Googlebot")
// rules is searchbot
// fallback to Googlebot
// fallback to *

LMK if you would consider a PR that implements this feature.

Add support for Crawl-Delay

Just a feature request

Parser not getting sitemap from robots.txt

For this URL - https://www.zendesk.com/robots.txt, it contains a sitemap, but the data is not getting collected.

Here's my code

resp, err := http.Get("https://www.zendesk.com/robots.txt")
if err != nil {
	log.Fatal(err)
}

robots, err := robotstxt.FromResponse(resp)
resp.Body.Close()
if err != nil {
	log.Fatal(err)
}

Collected Info
&{map[] true false []}

Please advice

compatible for https://github.com/google/robotstxt

Separate DNS resolve from connect

Separate resolve timeout
Limit concurrent resolves
Maybe cache IP addresses. Think about it, test a prototype. May have very low hit rate, so inefficient. Running local caching DNS server may be better.

Groups are not handled correctly

Based on my readings on robots.txt, only one group (User-agent: line followed by one or many allow/disallow lines) can be applied for a given user-agent string. From google's Robots.txt specs:

"Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of records by finding the group with the most specific user-agent that still matches. All other groups of records are ignored by the crawler. The user-agent is non-case-sensitive."

This is coherent with examples on robotstxt.org:

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

This case doesn't work as expected with the library (TestAgent("Google", "/") returns false). From a quick glance at the code, this would be because it keeps looking for matching rules (regardless of specificity of match - i.e. "*" or a lengthy part of the user-agent string) until a disallow is found.

I will try to send a pull request your way in the coming days.

Resources:

http://www.robotstxt.org/robotstxt.html
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

TestAgent and Test (for the same user-agent) gives different results in case of temporary error when fetching the robots.txt file

diff --git a/robotstxt_test.go b/robotstxt_test.go
index 6ccb730..6cbda57 100644
--- a/robotstxt_test.go
+++ b/robotstxt_test.go
@@ -291,3 +291,38 @@ func newHttpResponse(code int, body string) *http.Response {
 		ContentLength: int64(len(body)),
 	}
 }
+
+func TestDisallowAll(t *testing.T) {
+	r, err := FromStatusAndBytes(500, nil) // We got a 500 response => Disallow all
+	require.NoError(t, err)
+
+	a := r.TestAgent("/", "*")
+	assert.False(t, a) // Resource access NOT allowed (EXPECTED)
+
+	b := r.FindGroup("*").Test("/")
+	assert.True(t, b) // Resource access allowed (UNEXPECTED)
+
+	assert.Equal(t, a, b) // Results for test on Agent and Group are differents...
+
+	/*
+		It's because the `disallowAll` is checked by `TestAgent` but not `Test`.
+
+		Because `TestAgent` also calls `FindGroup` internally but obfuscates the
+		value of `CrawlDelay`, users of this library might prefer to use
+		(`FindGroup` + `Test`) to have access to the `CrawlDelay` value in case the
+		path is allowed.
+
+		FindGroup -> Test (ok) -> check CarwlDelay
+
+		Unfortunately, the `Test` method does not use the `disallowAll` member set
+		on response with status in the range [500; 599]. This behavior is unexpected
+		and can lead to involuntary politeness policy violation.
+
+		Unless we resign to call `TestAgent` and `FindGroup` to get the `CrawlDelay`
+		value.
+
+		TestAgent (ok) -> FindGroup -> check CrawlDelay
+
+		This way, `FindGroup` has been called twice.
+		Is there a way to avoid it without risking politeness policy violation?
+	*/
+}

Run:

go test ./... -run TestDisallowAll

Reduce memory footprint of multiple logically same RobotsData acquired from different inputs

Currently FromResponseBytes will return singleton RobotsData of allow-all for 404 status code and disallow-all for 401/403 status codes. For any other input, unique RobotsData will be created even though they could share subset of or all rules. Sharing all rules is equivalent to having another singleton RobotsData.

Plan:

Get representative subset of all robots.txt files (excluding non-200 responses since those are already covered)
Parse
Find clusters of unique rules sets, analyse distribution
Wild guess is that most (say 95%) fall into one of "allow all" or "disallow all".

If hypothesis is confirmed, some normalization technique could be applied to reduce memory footprint and cache locality of real world web crawlers using robotstxt.go library.

Possible normalizations:

Post-process array of rules after parsing, try to reduce to predefined RobotsData singletons; Easy to implement but useful only for predefined unique rule sets, relies on singletons; TODO: analyse output distribution, benchmark
Export a unique value representing parsed rules; Does not reduce memory footprint by itself, rather provides an instrument for that, but useful for arbitrary repeating rule sets, application may implement arbitrary cache. One possible candidate for such value is input text pre-processing: remove comments, white space; TODO: analyse output distribution, benchmark

These two techniques do not even conflict, post-processing parsed rules seems a worthy optimisation anyway, exporting unique value could further allow to cache non-trivial but still popular rules.

Even in unlikely event that rule sets distribution is closer to uniform, distribution of individual rules definitely must exhibit large spikes around agent=* and url=/. For that case, library can return singleton popular rules. Now that i think of it, maintaining a few extremely popular individual Rule singletons could be a worthy optimisation on its own. TODO: benchmark.

sitemaps?

is there a way to find sitemap URLs using this tool

Crash with unicode symbols

It panics on http://perche.vanityfair.it/robots.txt . I see that there is a BOM character in the text, can it be that?

panic: runtime error: invalid memory address or nil pointer dereference

During crawling with https://github.com/gocolly/colly this would periodically be thrown:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x796daf]
goroutine 245473 [running]:
github.com/temoto/robotstxt.(*RobotsData).FindGroup(0x0, 0x9121e2, 0x28, 0xc42f2f4dd8)
	/mnt/hgfs/test-repo/go/src/github.com/temoto/robotstxt/robotstxt.go:166 +0x6f

I locally modified lines 166-180 of robotstxt.go as follows, and it appears to work okay now:

	if ret != nil {
		if ret == r.groups["*"] {
			// Weakest match possible
			prefixLen = 1
		}

		for a, g := range r.groups {
			if a != "*" && strings.HasPrefix(agent, a) {
				if l := len(a); l > prefixLen {
					prefixLen = l
					ret = g`
				}}}}

parse non-empty robots.txt

List allow & disallow

Is it currently possible to just list allow and disallow paths along with their user agent without specifying a particular user agent?

Change import statement or change package name

The discrepancy between the package name and the suggested import statement is potentially confusing and problematic.

The suggested import path is "github.com/temoto/robotstxt.go", which implies that the go get call should be for the same path.

This becomes a bit of a problem when testing packages individually because go test robotstxt.go will think we're referencing a file when we're really referencing a directory. Workarounds include using a get and import path that reflect the real package name i.e. keep the dash or to append a slash to the test command i.e. go test robotstxt.go/

Is there a particular reason you chose to suggest importing the package as "robotstxt.go" instead of "robotstxt-go"?

Ultimately, I would suggest getting rid of the dash (and any reference to "go") from the package name altogether to avoid issues like this. That appears to be the more idiomatic approach to naming packages.

Different behavior on Google Webmaster Tools robots.txt checker and robotstxt-go

I noticed that on Google Webmaster Tools robots.txt checker, the following robots.txt:

User-agent: *
Allow: /
Allow: /blog/*
Disallow: /*/*

will allow website.com/blog/article, as well as website.com/blog/article/.

However, when tested against robotstxt-go, only website.com/blog/article is allowed through, and not website.com/blog/article/. I must add an additional line for robotstxt-go to allow the second URL through, so my robots.txt looks more like:

User-agent: *
Allow: /
Allow: /blog/*
Allow: /blog/*/
Disallow: /*/*

I'm running robotstxt-go as the GoogleBot user-agent. Any other thoughts on whether this is expected behavior / why this might be happening?

Thanks!