gocolly / colly Goto Github PK

View Code? Open in Web Editor NEW

22.2K 22.2K 1.7K 8.3 MB

Elegant Scraper and Crawler Framework for Golang

Home Page: https://go-colly.org/

License: Apache License 2.0

Go 99.37% HTML 0.63%

crawler crawling framework go golang scraper scraping spider

colly's People

Contributors

Stargazers

Watchers

Forkers

alexzeitgeist derickwarshaw infinitedevelopment saityucel fokion tampajohn macinnir javarange dmreiland gummiboll pjebs dtsukiyama growingabit onyxhat secsecsec gordroid upbeta01 tomsquest kkkmmu kawakami-o3 puzanov duyvhh gophersgang number0 brendan-munro benjamesbabala taylr thearchiver rubythonode zanjs meehow agdolla looz38 envomer nilslice harderturn kvch priestd09 mikalv lyw007 vkandola sunicorn2011 abel-wanglei leverly gotoolkit mycodemylife hhy5277 jimmeryzou azureplus hoangpq brenol zhanglizhe magicdogs honeyflyfish hielfx orientalson smileboywtu vosmith hondajojo geekhuyang mreey danvideo ivanfan admpub rayleyva valutac caizezhi crazyjvm hj3938 etng wenerme pathcl 92bondstreet xzj675 schmorrison yourhe roscopecoltran haarlike eeeeeeeeeeeeeeeeeeeieeeeeeeeeeeeeeeeee robbiewu003 hapiman iaodiary opencollective zhangpeihao husttaowen o1egl commentfuckery dumbomir avaisa ferhatelmas llonchj shaked jithinraj agamdua sadiqmmm seansanker elijahlynn indy9000 chosen1 bittenbydog

colly's Issues

Warnings from golint

ID is part of the list of common initialisms used by the golint tool and thus we get warnings like this:

struct field Id should be ID
func parameter requestId should be requestID
func parameter collectorId should be collectorID

Unfortunately fixing this problem would change the exported types Collector and Request. Would a change like this even be considered?

Data race

Example code:

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Println(link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	c.Visit("https://en.wikipedia.org/")
}

Execution log:

==================
WARNING: DATA RACE
Write at 0x00c420083950 by main goroutine:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:598 +0x0
  net/textproto.MIMEHeader.Set()
      /usr/local/go/src/net/textproto/header.go:22 +0x60
  net/http.Header.Set()
      /usr/local/go/src/net/http/header.go:31 +0x60
  net/http.(*Request).AddCookie()
      /usr/local/go/src/net/http/request.go:385 +0x37c
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:170 +0x115
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:602 +0x513
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Do()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:154 +0x105
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Cache()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:110 +0x9e
  crawler/vendor/github.com/asciimoo/colly.(*Collector).scrape()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:226 +0x461
  crawler/vendor/github.com/asciimoo/colly.(*Collector).Visit()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:157 +0x9b
  main.main()
      /home/skruglov/Projects/go/src/crawler/main.go:19 +0xe4

Previous read at 0x00c420083950 by goroutine 24:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:208 +0x0
  net/http.http2isConnectionCloseRequest()
      /usr/local/go/src/net/http/h2_bundle.go:8652 +0xae
  net/http.(*http2clientConnReadLoop).endStreamError()
      /usr/local/go/src/net/http/h2_bundle.go:8288 +0xe2
  net/http.(*http2clientConnReadLoop).endStream()
      /usr/local/go/src/net/http/h2_bundle.go:8277 +0x54
  net/http.(*http2clientConnReadLoop).processData()
      /usr/local/go/src/net/http/h2_bundle.go:8267 +0x1ce
  net/http.(*http2clientConnReadLoop).run()
      /usr/local/go/src/net/http/h2_bundle.go:7896 +0x737
  net/http.(*http2ClientConn).readLoop()
      /usr/local/go/src/net/http/h2_bundle.go:7788 +0x11c

Goroutine 24 (running) created at:
  net/http.(*http2Transport).newClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:7053 +0xe1a
  net/http.(*http2Transport).NewClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:6991 +0x55
  net/http.(*http2addConnCall).run()
      /usr/local/go/src/net/http/h2_bundle.go:835 +0x55
==================

#mw-head
#p-search
/wiki/Wikipedia
...

I just tested colly, and loved how fast it performed. I have a question about encodings. In the docs it says that it has automatic encoding of non unicode responses. Can this be customized? I tried to grab content from https://www.nsd.ru/ru/db/news/ndcpress/ , which is windows-1251 encoded, and the contents came back unreadable, so I was wondering how can I set up colly to grab content specifying the encoding.

Do you have any plane managing dependencies.

Failed to login LinkedIn

Hi,

Doesn't get jobs response seems login not successful. Not sure what I missed. Please share me the right way to do it. Thanks.

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector()

	err := c.Post("https://www.linkedin.com/uas/login-submit", map[string]string{"session_key": "EMAIL", "session_password": "PASSWORD"})
	if err != nil {
		log.Fatal(err)
	}
	c.AllowedDomains = []string{"www.linkedin.com"}

	// attach callbacks after login
	c.OnResponse(func(r *colly.Response) {
		log.Println("response received", r.StatusCode)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Println("element:", e)
		if strings.Contains(e.Attr("href"), "/jobs/view") {
			fmt.Println("replaced:", strings.Replace(e.Attr("href"), "https://www.linkedin.com/", "", -1))
			e.Request.Visit(e.Attr("href"))
		}
	})

	// start scraping
	c.Visit("https://www.linkedin.com/jobs/")
}

Do you have any plan to support proxy and encode/decode?

Expose context for visit/post

colly.Context should embed context.Context, or just use context.Context
collector.Visit/Post* should accept a context param, to bypass some info.
- e.g. the id for post request, need some info to postprocess

Add some basic OnError support

Probably should break out OnRequestError/OnResponseError, but adding a basic OnError that receives the request, response, and error seems to make sense as a first pass.

Pull request on its way shortly

JavaScript

Does it handle spa sites where JavaScript gens the site client side ?

Stop Colly visiting other attributes with break statement

Getting final request when there is a page redirect

When there is a page redirect, colly automatically follows the redirect. In that case, I get a Request object in the OnHTML callback. It seems that colly provides the original Request and not the one after the redirect. Since I want to follow all the links on the html site, I use the Request object to get the absolute URL. However, in that case this doesn't work as expected, since the Request Object has the wrong URL. The example below illustrates the problem:

package main

import (
	"fmt"
	"net/http"
	"time"

	"github.com/gocolly/colly"
)

func main() {
	go func() {
		http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			http.Redirect(w, r, "/r/", http.StatusSeeOther)

		}))
		http.Handle("/r/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprintf(w, `<a href="test">test</a>`)
		}))
		http.ListenAndServe("127.0.0.1:9999", nil)
	}()
	time.Sleep(500 * time.Millisecond)
	c := colly.NewCollector()
	c.AllowedDomains = []string{"127.0.0.1:9999"}
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Println(e.Request.AbsoluteURL(e.Attr("href")))
	})
	c.Visit("http://127.0.0.1:9999/")
	c.Wait()
	time.Sleep(1000 * time.Hour)
}

The example gives "http://127.0.0.1:9999/test". However when I go to "http://127.0.0.1" via firefox and click on the link, I get redirected to "http://127.0.0.1:9999/r/test".

Is there a better way to mimic the behavior of the browser in this case?

Feature Request: Ability to submit forms

Are you planning to support the ability to submit forms, keep the cookies while visiting a different page, and some basic user interaction actions?

Thanks

Runtime Error: invalid memory address or nil pointer dereference

Hey mate!

I'm loving colly so far. I'm new to the Go programming language and I've just been messing around with your scraping library and found a weird bug.

I was just testing out scraping my website, and then allowing the scraper to scrape Medium. I end up with this error:

(I'm using Go v 1.9 on Linux x86).

This is the code:

package main
import (
    "fmt"
    "github.com/asciimoo/colly"
)

func main() {
    scraper := colly.NewCollector()
    scraper.AllowedDomains = []string{"onslow.io", "medium.com"}

    scraper.OnHTML("a[href]", func(element *colly.HTMLElement) {
        link := element.Attr("href")
        // Print link
		fmt.Printf("Link found: %q -> %s\n", element.Text, link)
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		go scraper.Visit(element.Request.AbsoluteURL(link))
    })

    scraper.OnError(func(request *colly.Response, err error) {
        fmt.Println("Request URL:", request.Request.URL, "failed with response:", request, "\nError:", err)
    })

    scraper.OnRequest(func(request *colly.Request) {
        fmt.Println("Visiting", request.URL.String())
    })

    scraper.Visit("http://onslow.io")
    scraper.Wait()


}

From what I've gathered, it has to do with the Goroutines possibly not syncing properly?

If you have any other ideas on the cause of this, it'd be great to hear them!

Cheers

How to ignore expired SSL certificates in Colly?

Since I need to visit an unsafe website over https, the Post / Get method will return an error: x509: certificate has expired or is not yet valid.
I know I can use InsecureSkipVerify: true when start a request using net / http package, but what should I do if I want to skip SSL certificate check in Colly?

Suggestion: Use httptest instead of manually running HTTP server

Currently there is code in the tests manually running a HTTP server to run test against, this is what the net/http/httptest package is meant to be used for.

Limit MaxBodySize may cause some page not fully download

Hello,

https://github.com/asciimoo/colly/blob/7a13f4120d95f515c82f3e79b204ab96aab12156/colly.go#L123

Is the limit for process performance?

Why the program finished running, c.wait () did not release, has been waiting, this may be a bug

package main

import (

"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"

"time"

)

func main() {

urls := []string{"https://weibo.cn/repost/FBrYpiw8h?uid=1153760245&rl=1", "https://weibo.cn/repost/FBrXSqrIl?uid=2137005731&rl=1", "https://weibo.cn/repost/FBrXOlMmQ?uid=5131689041&rl=1", "https://weibo.cn/repost/FBrXJBCQs?uid=1701023441&rl=1", "https://weibo.cn/repost/FBrXg4ZuX?uid=5999431007&rl=1", "https://weibo.cn/repost/FBrXcuadg?uid=5819066338&rl=1", "https://weibo.cn/repost/FBrWEgEor?uid=3517902151&rl=1","https://weibo.cn/repost/FBrWmuTYh?uid=2974402113&rl=1", "https://weibo.cn/repost/FBrVZtT1p?uid=5533885122&rl=1",
"https://weibo.cn/repost/FBrVrqA5T?uid=1613781965&rl=1", "https://weibo.cn/repost/FBrYpiw8h?uid=1153760245&rl=1", "https://weibo.cn/repost/FBrXSqrIl?uid=2137005731&rl=1", "https://weibo.cn/repost/FBrXOlMmQ?uid=5131689041&rl=1", "https://weibo.cn/repost/FBrXJBCQs?uid=1701023441&rl=1", "https://weibo.cn/repost/FBrXg4ZuX?uid=5999431007&rl=1", "https://weibo.cn/repost/FBrXcuadg?uid=5819066338&rl=1", "https://weibo.cn/repost/FBrWEgEor?uid=3517902151&rl=1", "https://weibo.cn/repost/FBrWmuTYh?uid=2974402113&rl=1",
"https://weibo.cn/repost/FBrVZtT1p?uid=5533885122&rl=1", "https://weibo.cn/repost/FBrVrqA5T?uid=1613781965&rl=1", "https://weibo.cn/repost/FBrUXncEG?uid=5046939400&rl=1"}

// Instantiate default collector
c := colly.NewCollector(
	// Turn on asynchronous requests
	colly.Async(true),
	// Attach a debugger to the collector
	colly.Debugger(&debug.LogDebugger{}),

)
c.SetRequestTimeout(2*time.Second)
// Limit the number of threads started by colly to two
// when visiting links which domains' matches "*httpbin.*" glob


// Start scraping in five threads on https://httpbin.org/delay/2
for _,v := range urls{
	c.Visit(v)
}

c.Wait()
}

Implement a callback at the end of each scrape

ChildAttr only returns one child, not all children

[Not really an issue]
Hey mate

I've been using Colly for a small scraping project and I've come across a weird bit of behaviour.

The e.ChildText() function returns the text in all of the children as one string. However, using e.ChildAttr() only returns the first match. I read through the code in colly.go and understand this is the intended behaviour, but I was wondering why you wouldn't want to return all child attributes?

Loving this package though, it's been a lot of fun to use. Thank you for keeping it up to date!
Cheers

Feature Request: JavaScript execution

Any plan to add JavaScript engines to this framework. A few projects that might help are

https://github.com/robertkrimen/otto
https://github.com/lazytiger/go-v8
https://github.com/dop251/goja

I'm not sure if this aligns with your goals for the project, but it could be almost necessary on alot of modern websites.

Colly can't read a URL without protocol

SSH support

Hello,

What is ssh support?

Thanks.!

Feature Request: retry request when error

Is there a way to do request retry inside colly? Right now I can only resend the request in onError

Suggestion: Functional options for NewCollector

Implementing functional options for the NewCollector constructor function would allow the user to setup the collector without manually setting field values on *Collector

This would be a non breaking change since the options would be a variadic argument to NewCollector

Examples of options could be to add a domain to the AllowedDomains field: NewCollector(AllowDomain("example.com"))

What do you think?

fixCharset() also has a BUG. the rseponse.header[Content-Type]=' text/html', not contain charset. The real charset is gbk.

package main

import (
"fmt"

"github.com/gocolly/colly"

)

func main() {
c := colly.NewCollector()

// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
	link := e.Attr("href")
	// Print link
	fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.Headers)
})
c.OnResponse(func(r *colly.Response) {
	fmt.Println("Visited", r.Headers)
})

c.Visit("https://weibo.cn/repost/FByvKgel6?uid=6049100503&rl=1")

}
Visiting &map[User-Agent:[colly - https://github.com/gocolly/colly]]
Visited &map[Content-Type:[text/html] Connection:[keep-alive] Vary:[Accept-Encoding] Expires:[Sat, 26 Jul 1997 05:00:00 GMT] Dpool_header:[luna139] Pragma:[no-cache] Sina-Lb:[aGEuMjAyLmcxLnloZy5sYi5zaW5hbm9kZS5jb20=] Server:[nginx/1.6.1] Date:[Thu, 28 Dec 2017 13:42:51 GMT] Cache-Control:[no-cache, must-revalidate] Sina-Ts:[N2FiMjljY2UgMCAxIDEgMiA2Cg==]]
Link found: "\xb9ر\xd5" -> javascript:history.go(-1);
Link found: "" -> javascript:;
Link found: "\xbb\xbbһ\xd5\xc5" -> javascript:;
Link found: "\xb5\xc7¼" -> javascript:;
Link found: "\xb5\xda\xc8\xfd\xb7\xbd\xd5ʺ\xc5" -> https://passport.weibo.cn/signin/other?r=http%3A%2F%2Fweibo.cn
Link found: "ע\xb2\xe1\xd5ʺ\xc5" -> http://m.weibo.cn/reg/index?&vt=4&wm=3349&wentry=&backURL=http%3A%2F%2Fweibo.cn
Link found: "\xcd\xfc\xbc\xc7\xc3\xdc\xc2\xeb" -> https://passport.weibo.cn/forgot/forgot?entry=wapsso&from=0
Link found: "ȡ\xcf\xfb" -> javascript:;
Link found: "\xd1\xe9֤\xc2\xeb\xb5\xc7¼" -> javascript:;
Link found: "\xb9ر\xd5" -> javascript:history.go(-1);
Link found: "ȷ\xc8\xcf" -> javascript:;
Link found: "ʹ\xd3\xc3\xc6\xe4\xcb\xfb\xd5ʺŵ\xc7¼" -> javascript:;

Runtime error: invalid memory address or nil pointer dereference (occurs on x32 architecture)

package main

import (
	"github.com/asciimoo/colly"
)

func main() {
	c := colly.NewCollector()
	c.Visit("https://www.google.com")
}

If build with 386 architecture, it crashes:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x4012bc]

goroutine 1 [running]:
sync/atomic.AddUint64(0x1284e304, 0x1, 0x0, 0x128c74c0, 0x0)
d:/soft/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/asciimoo/colly.(*Collector).scrape(0x1284e280, 0x73acb0, 0x16, 0x72f0c5, 0x3, 0x1, 0x0, 0x0, 0x128212f8, 0x0, ...)
d:/go/src/github.com/asciimoo/colly/colly.go:244 +0x244
github.com/asciimoo/colly.(*Collector).Visit(0x1284e280, 0x73acb0, 0x16, 0x69320b, 0x693524)
d:/go/src/github.com/asciimoo/colly/colly.go:175 +0x6b
main.main()
d:/go/src/playground/main.go:9 +0x31

From https://golang.org/pkg/sync/atomic/

On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

Error: gzip: invalid header

package main

import (
	"fmt"

	"github.com/gocolly/colly"
	"github.com/gocolly/colly/debug"
	"time"
)

func main() {
	c := colly.NewCollector(
		colly.UserAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"),
		colly.AllowedDomains("irby.kz"),
		colly.Async(true),
		colly.Debugger(&debug.LogDebugger{}),
	)
	c.DisableCookies()

	c.Limit(&colly.LimitRule{
		Parallelism: 2,
		Delay:       1 * time.Second,
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	// Set error handler
	c.OnError(func(r *colly.Response, err error) {
		fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
	})

	// Set HTML callback
	// Won't be called if error occurs
	c.OnHTML("*", func(e *colly.HTMLElement) {
		fmt.Println(e)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL.String())
	})

	c.Visit("http://irby.kz/ru/catalog/dlya_devochek/?SHOW_ALL=Y")
	c.Visit("http://irby.kz/ru/catalog/dlya_malchikov_1/?SHOW_ALL=Y")

	c.Wait()
}

How to use colly for a second time in a code?

Do I have to initialize a new collector? It doesn't work after this.

Multipart support

Do you have any plan to support a multipart POST request?
I will code this feature later.

URL Filter to exclude

Currently you can specify a URLFIlter to include URL, is there any way to exclude urls ?

robots.txt support

Does it make sense to support abiding by the website robots.txt?

Custome same request detect

Current visited url just use the url to detect same request, collector should accept a custom unique request map(e.g. func(r *Request)string), because post request use id in form to do this, not just url.

BTW, If you accept prs, I can contribute some work.

Context should put interface{} not string only

I want to put a storage in a context, but the context only accept string.

I think c.Limit also can be used to filter unmatched urls . Should support regex before request in anywhere.

The purpose is to filter out the mismatched urls

for example
package main

import (
"github.com/asciimoo/colly"
"fmt"
"time"
)

func main() {
urls := []string{"https://httpbin.org/hello", "https://httpbin.org/123", "https://httpbin.org/12xyz"}

// Instantiate default collector
c := colly.NewCollector()


// when visiting links which domains' matches "*httpbin.*" glob
c.Limit(&colly.LimitRule{
	DomainGlob: "*https://httpbin.org/[a-z]+",
})
c.OnResponse(func(r *colly.Response) {
	fmt.Println("Finished", r.Request.URL, time.Now())
})

for _, i := range urls {
	c.Visit(i)
}



 c.Wait()

}
Starting https://httpbin.org/hello",
Starting "https://httpbin.org/123",
Starting "https://httpbin.org/12xyz
Finished https://httpbin.org/hello

visitedURLs should be invalidate some how

For batch craw, the memory will grow very fast, colly become very slow

trying to measure request timing

I'm doing some timing during a crawl recording the start time in Ctx in OnRequest and calculating a duration in OnResponse. It all works very well until I try to throttle the crawler with a Limit with Delay since the sleep is called by the backend after the OnRequest callbacks

Is there another callback that can be used, or would you consider moving the sleep to before the OnRequest is called ?

Randomized Delay

Is it possible to create randomized delays, i.e., per-request delays selected from some range or based on some random factor? I couldn't think of a good way to do this, other than maybe cycling through several Collectors with different limit sets, which seems sub-optimal.

It seems like having LimitRule.DelayRange or LimitRule.RandomFactor options would be quite helpful.

How to set up only one request to use the proxy?

I found

rp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338")
	if err != nil {
		log.Fatal(err)
	}
c.SetProxyFunc(rp)

but if i have ten urls need request, this way is all request to use proxy, i only want one request to use the proxy, What should I do?

Add iteration api for context

When I use colly, I have a case to to iterate context elements when I put something in it with multiple OnHTML callback on different html elements.
This is the simple function I wrote.

// ForEach iterate context
func (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{} {
	c.lock.RLock()
	defer c.lock.RUnlock()

	ret := make([]interface{}, 0, len(c.contextMap))
	for k, v := range c.contextMap {
		cur := fn(k, v)
		ret = append(ret, cur)
	}

	return ret
}

Hope this can help someone when they also need to iterate context.

set AllowedDomains as the Limit way may be better

There are many subdomains for some sites, it's convenient to set the AllowedDomains by match, like this

// c.AllowedDomains = []string{"hackerspaces.org", "wiki.hackerspaces.org"}
c.AllowedDomains = []string{"*hackerspaces.org"}

Extend basic example with more callbacks

There are five main callbacks in colly and they are:

OnRequest
OnError
OnResponse
OnHTML
OnScraped

We want to show reader these callbacks at the very beginning, so how about we extend the basic example with all those callbacks?

c.Visit'ing with URL, not only string

how to auth(login) before request?

any example? thanks！:-)

Passing context between collectors

I was wondering how it's possible to pass context between collectors.

My use case is that I have a collector that is collecting links and the triggers another collector to visit this link. On the second collector, I'd like to pass some sort of context (like a parent category name that is not present on the child page HTML) but I didn't find a way to achieve this because Context seems only for request/responses within the same collector.

Context Get method should return interface type

Currently, you can put anything in the context. Which was allowed after the fix for this issue.
But it will type assert any value to a string nevertheless.

I propose to allow to retrieve an interface instead. Which would be a breaking change

Allow for passing a custom http.Transport for the backend to use.

Great project here, would love to be able to pass a custom http.Transport for the backend's http.Client to use.

Why?:

Proxy support
Custom TLS config
etc

Will be submitting a PR shortly

Race condition due to incorrect sync.WaitGroup usage

Hi,

It seems that sync.WaitGroup is not used correctly in func (c *Collector) scrape(...) error method.

// colly.go:307
func (c *Collector) scrape(...) error {
    c.wg.Add(1)
    defer c.wg.Done()
    ...

Consider the following example (similar to http://go-colly.org/docs/examples/rate_limit/):

for _, url := range urls {
    go c.Visit(url)
}

c.Wait()

Here we call Visit (scrape wrapper) in a separate goroutine for each url and wait for their completion.

The problem is that it's not guaranteed that all goroutines will be finished after c.Wait() returns since "goroutine to wait" count is incremented inside each new goroutine - c.wg.Add(1).

So, it's up to scheduler if c.wg.Add(1) will be called before or after c.Wait().

I think there are two ways how that issue could be fixed:

Let caller handle sync.WaitGroup. Just provide a pointer to it as a param.
Provide an async API for scrape and it's wrappers.

Could look like that:

func (c *Collector) scrapeAsync(...) chan<- error {
    errChan := make(chan error)
    c.wg.Add(1)
    go func() {
        defer c.wg.Done()

        errChan <- c.scrape(...)
    }()
    return errChan
}

I can do a PR if you want.

go get issue on n.Attr

On Mac with go version 1.9

Fils:collyIndexer dfils$ go get github.com/asciimoo/colly
# github.com/asciimoo/colly
../../../../github.com/asciimoo/colly/colly.go:302:16: cannot use n.Attr (type []"code.google.com/p/go.net/html".Attribute) as type []"golang.org/x/net/html".Attribute in field value
Fils:collyIndexer dfils$ go version
go version go1.9 darwin/amd64

Proxy Support

What are your thoughts on adding proxy support to the Collector? I see that one could just create a custom Transport and set the collector to use it, but it would be nice to have a SetProxy(url) method or something similar.

Minor suggestions

Hello @asciimoo , good job with the lib!

I just wanted to say that I did open a PR 5 days ago to add your project inside the "awesome-go" directory.

The README doesn't contain any quality references, like report card, so I took the liberty to complete them inside the PR text, please add them as badges to the README.md of this repository as well, the links are:

godoc.org: https://godoc.org/github.com/asciimoo/colly
goreportcard.com: https://goreportcard.com/report/github.com/asciimoo/colly

However in order (the PR) to be acceptable we have to complete some other links as well, like the coverage service link, it looks like this: https://cover.run/go/github.com/asciimoo/colly.svg
but I couldn't find any tests (_test.go files) of "colly" so the PR was marked as "pending".

Please keep watching this and do your bests to complete some test files, those details matters there.

Thank you!