Coder Social home page Coder Social logo

gocolly / colly Goto Github PK

View Code? Open in Web Editor NEW
22.2K 22.2K 1.7K 8.3 MB

Elegant Scraper and Crawler Framework for Golang

Home Page: https://go-colly.org/

License: Apache License 2.0

Go 99.37% HTML 0.63%
crawler crawling framework go golang scraper scraping spider

colly's People

Contributors

asalih avatar asciimoo avatar asood123 avatar dependabot[bot] avatar ferhatelmas avatar gsoec avatar guessi avatar gummiboll avatar hondajojo avatar i25959341 avatar jbaxter-va avatar jlr52 avatar johnzhao1208 avatar kawakami-o3-2nd avatar kvch avatar llonchj avatar meehow avatar nange avatar peterhellberg avatar pyjac avatar rongyi avatar saladinobelisario avatar sdab avatar sharmi avatar sky126 avatar smileboywtu avatar twiny avatar vosmith avatar wgh- avatar zyt312074545 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

colly's Issues

Warnings from golint

ID is part of the list of common initialisms used by the golint tool and thus we get warnings like this:

  • struct field Id should be ID
  • func parameter requestId should be requestID
  • func parameter collectorId should be collectorID

Unfortunately fixing this problem would change the exported types Collector and Request. Would a change like this even be considered?

Data race

Example code:

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Println(link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	c.Visit("https://en.wikipedia.org/")
}

Execution log:

==================
WARNING: DATA RACE
Write at 0x00c420083950 by main goroutine:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:598 +0x0
  net/textproto.MIMEHeader.Set()
      /usr/local/go/src/net/textproto/header.go:22 +0x60
  net/http.Header.Set()
      /usr/local/go/src/net/http/header.go:31 +0x60
  net/http.(*Request).AddCookie()
      /usr/local/go/src/net/http/request.go:385 +0x37c
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:170 +0x115
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:602 +0x513
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Do()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:154 +0x105
  crawler/vendor/github.com/asciimoo/colly.(*httpBackend).Cache()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/http_backend.go:110 +0x9e
  crawler/vendor/github.com/asciimoo/colly.(*Collector).scrape()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:226 +0x461
  crawler/vendor/github.com/asciimoo/colly.(*Collector).Visit()
      /home/skruglov/Projects/go/src/crawler/vendor/github.com/asciimoo/colly/colly.go:157 +0x9b
  main.main()
      /home/skruglov/Projects/go/src/crawler/main.go:19 +0xe4

Previous read at 0x00c420083950 by goroutine 24:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:208 +0x0
  net/http.http2isConnectionCloseRequest()
      /usr/local/go/src/net/http/h2_bundle.go:8652 +0xae
  net/http.(*http2clientConnReadLoop).endStreamError()
      /usr/local/go/src/net/http/h2_bundle.go:8288 +0xe2
  net/http.(*http2clientConnReadLoop).endStream()
      /usr/local/go/src/net/http/h2_bundle.go:8277 +0x54
  net/http.(*http2clientConnReadLoop).processData()
      /usr/local/go/src/net/http/h2_bundle.go:8267 +0x1ce
  net/http.(*http2clientConnReadLoop).run()
      /usr/local/go/src/net/http/h2_bundle.go:7896 +0x737
  net/http.(*http2ClientConn).readLoop()
      /usr/local/go/src/net/http/h2_bundle.go:7788 +0x11c

Goroutine 24 (running) created at:
  net/http.(*http2Transport).newClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:7053 +0xe1a
  net/http.(*http2Transport).NewClientConn()
      /usr/local/go/src/net/http/h2_bundle.go:6991 +0x55
  net/http.(*http2addConnCall).run()
      /usr/local/go/src/net/http/h2_bundle.go:835 +0x55
==================

#mw-head
#p-search
/wiki/Wikipedia
...

Question: Handle encodings

I just tested colly, and loved how fast it performed. I have a question about encodings. In the docs it says that it has automatic encoding of non unicode responses. Can this be customized? I tried to grab content from https://www.nsd.ru/ru/db/news/ndcpress/ , which is windows-1251 encoded, and the contents came back unreadable, so I was wondering how can I set up colly to grab content specifying the encoding.

Failed to login LinkedIn

Hi,

Doesn't get jobs response seems login not successful. Not sure what I missed. Please share me the right way to do it. Thanks.

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector()

	err := c.Post("https://www.linkedin.com/uas/login-submit", map[string]string{"session_key": "EMAIL", "session_password": "PASSWORD"})
	if err != nil {
		log.Fatal(err)
	}
	c.AllowedDomains = []string{"www.linkedin.com"}

	// attach callbacks after login
	c.OnResponse(func(r *colly.Response) {
		log.Println("response received", r.StatusCode)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Println("element:", e)
		if strings.Contains(e.Attr("href"), "/jobs/view") {
			fmt.Println("replaced:", strings.Replace(e.Attr("href"), "https://www.linkedin.com/", "", -1))
			e.Request.Visit(e.Attr("href"))
		}
	})

	// start scraping
	c.Visit("https://www.linkedin.com/jobs/")
}

Expose context for visit/post

  • colly.Context should embed context.Context, or just use context.Context
  • collector.Visit/Post* should accept a context param, to bypass some info.
    • e.g. the id for post request, need some info to postprocess

Add some basic OnError support

Probably should break out OnRequestError/OnResponseError, but adding a basic OnError that receives the request, response, and error seems to make sense as a first pass.

Pull request on its way shortly

JavaScript

Does it handle spa sites where JavaScript gens the site client side ?

Getting final request when there is a page redirect

When there is a page redirect, colly automatically follows the redirect. In that case, I get a Request object in the OnHTML callback. It seems that colly provides the original Request and not the one after the redirect. Since I want to follow all the links on the html site, I use the Request object to get the absolute URL. However, in that case this doesn't work as expected, since the Request Object has the wrong URL. The example below illustrates the problem:

package main

import (
	"fmt"
	"net/http"
	"time"

	"github.com/gocolly/colly"
)

func main() {
	go func() {
		http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			http.Redirect(w, r, "/r/", http.StatusSeeOther)

		}))
		http.Handle("/r/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprintf(w, `<a href="test">test</a>`)
		}))
		http.ListenAndServe("127.0.0.1:9999", nil)
	}()
	time.Sleep(500 * time.Millisecond)
	c := colly.NewCollector()
	c.AllowedDomains = []string{"127.0.0.1:9999"}
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		fmt.Println(e.Request.AbsoluteURL(e.Attr("href")))
	})
	c.Visit("http://127.0.0.1:9999/")
	c.Wait()
	time.Sleep(1000 * time.Hour)
}

The example gives "http://127.0.0.1:9999/test". However when I go to "http://127.0.0.1" via firefox and click on the link, I get redirected to "http://127.0.0.1:9999/r/test".

Is there a better way to mimic the behavior of the browser in this case?

Runtime Error: invalid memory address or nil pointer dereference

Hey mate!

I'm loving colly so far. I'm new to the Go programming language and I've just been messing around with your scraping library and found a weird bug.

I was just testing out scraping my website, and then allowing the scraper to scrape Medium. I end up with this error:
image
(I'm using Go v 1.9 on Linux x86).

This is the code:

package main
import (
    "fmt"
    "github.com/asciimoo/colly"
)

func main() {
    scraper := colly.NewCollector()
    scraper.AllowedDomains = []string{"onslow.io", "medium.com"}

    scraper.OnHTML("a[href]", func(element *colly.HTMLElement) {
        link := element.Attr("href")
        // Print link
		fmt.Printf("Link found: %q -> %s\n", element.Text, link)
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		go scraper.Visit(element.Request.AbsoluteURL(link))
    })

    scraper.OnError(func(request *colly.Response, err error) {
        fmt.Println("Request URL:", request.Request.URL, "failed with response:", request, "\nError:", err)
    })

    scraper.OnRequest(func(request *colly.Request) {
        fmt.Println("Visiting", request.URL.String())
    })

    scraper.Visit("http://onslow.io")
    scraper.Wait()


}

From what I've gathered, it has to do with the Goroutines possibly not syncing properly?

If you have any other ideas on the cause of this, it'd be great to hear them!

Cheers

How to ignore expired SSL certificates in Colly?

Since I need to visit an unsafe website over https, the Post / Get method will return an error: x509: certificate has expired or is not yet valid.
I know I can use InsecureSkipVerify: true when start a request using net / http package, but what should I do if I want to skip SSL certificate check in Colly?

Why the program finished running, c.wait () did not release, has been waiting, this may be a bug

package main

import (

"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"

"time"

)

func main() {

urls := []string{"https://weibo.cn/repost/FBrYpiw8h?uid=1153760245&rl=1", "https://weibo.cn/repost/FBrXSqrIl?uid=2137005731&rl=1", "https://weibo.cn/repost/FBrXOlMmQ?uid=5131689041&rl=1", "https://weibo.cn/repost/FBrXJBCQs?uid=1701023441&rl=1", "https://weibo.cn/repost/FBrXg4ZuX?uid=5999431007&rl=1", "https://weibo.cn/repost/FBrXcuadg?uid=5819066338&rl=1", "https://weibo.cn/repost/FBrWEgEor?uid=3517902151&rl=1","https://weibo.cn/repost/FBrWmuTYh?uid=2974402113&rl=1", "https://weibo.cn/repost/FBrVZtT1p?uid=5533885122&rl=1",
"https://weibo.cn/repost/FBrVrqA5T?uid=1613781965&rl=1", "https://weibo.cn/repost/FBrYpiw8h?uid=1153760245&rl=1", "https://weibo.cn/repost/FBrXSqrIl?uid=2137005731&rl=1", "https://weibo.cn/repost/FBrXOlMmQ?uid=5131689041&rl=1", "https://weibo.cn/repost/FBrXJBCQs?uid=1701023441&rl=1", "https://weibo.cn/repost/FBrXg4ZuX?uid=5999431007&rl=1", "https://weibo.cn/repost/FBrXcuadg?uid=5819066338&rl=1", "https://weibo.cn/repost/FBrWEgEor?uid=3517902151&rl=1", "https://weibo.cn/repost/FBrWmuTYh?uid=2974402113&rl=1",
"https://weibo.cn/repost/FBrVZtT1p?uid=5533885122&rl=1", "https://weibo.cn/repost/FBrVrqA5T?uid=1613781965&rl=1", "https://weibo.cn/repost/FBrUXncEG?uid=5046939400&rl=1"}

// Instantiate default collector
c := colly.NewCollector(
	// Turn on asynchronous requests
	colly.Async(true),
	// Attach a debugger to the collector
	colly.Debugger(&debug.LogDebugger{}),

)
c.SetRequestTimeout(2*time.Second)
// Limit the number of threads started by colly to two
// when visiting links which domains' matches "*httpbin.*" glob


// Start scraping in five threads on https://httpbin.org/delay/2
for _,v := range urls{
	c.Visit(v)
}

c.Wait()
}

ChildAttr only returns one child, not all children

[Not really an issue]
Hey mate

I've been using Colly for a small scraping project and I've come across a weird bit of behaviour.

The e.ChildText() function returns the text in all of the children as one string. However, using e.ChildAttr() only returns the first match. I read through the code in colly.go and understand this is the intended behaviour, but I was wondering why you wouldn't want to return all child attributes?

Loving this package though, it's been a lot of fun to use. Thank you for keeping it up to date!
Cheers

Suggestion: Functional options for NewCollector

Implementing functional options for the NewCollector constructor function would allow the user to setup the collector without manually setting field values on *Collector

This would be a non breaking change since the options would be a variadic argument to NewCollector

Examples of options could be to add a domain to the AllowedDomains field: NewCollector(AllowDomain("example.com"))

What do you think?

fixCharset() also has a BUG. the rseponse.header[Content-Type]=' text/html', not contain charset. The real charset is gbk.

package main

import (
"fmt"

"github.com/gocolly/colly"

)

func main() {
c := colly.NewCollector()

// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
	link := e.Attr("href")
	// Print link
	fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Println("Visiting", r.Headers)
})
c.OnResponse(func(r *colly.Response) {
	fmt.Println("Visited", r.Headers)
})

c.Visit("https://weibo.cn/repost/FByvKgel6?uid=6049100503&rl=1")

}
Visiting &map[User-Agent:[colly - https://github.com/gocolly/colly]]
Visited &map[Content-Type:[text/html] Connection:[keep-alive] Vary:[Accept-Encoding] Expires:[Sat, 26 Jul 1997 05:00:00 GMT] Dpool_header:[luna139] Pragma:[no-cache] Sina-Lb:[aGEuMjAyLmcxLnloZy5sYi5zaW5hbm9kZS5jb20=] Server:[nginx/1.6.1] Date:[Thu, 28 Dec 2017 13:42:51 GMT] Cache-Control:[no-cache, must-revalidate] Sina-Ts:[N2FiMjljY2UgMCAxIDEgMiA2Cg==]]
Link found: "\xb9ر\xd5" -> javascript:history.go(-1);
Link found: "" -> javascript:;
Link found: "\xbb\xbbһ\xd5\xc5" -> javascript:;
Link found: "\xb5\xc7¼" -> javascript:;
Link found: "\xb5\xda\xc8\xfd\xb7\xbd\xd5ʺ\xc5" -> https://passport.weibo.cn/signin/other?r=http%3A%2F%2Fweibo.cn
Link found: "ע\xb2\xe1\xd5ʺ\xc5" -> http://m.weibo.cn/reg/index?&vt=4&wm=3349&wentry=&backURL=http%3A%2F%2Fweibo.cn
Link found: "\xcd\xfc\xbc\xc7\xc3\xdc\xc2\xeb" -> https://passport.weibo.cn/forgot/forgot?entry=wapsso&from=0
Link found: "ȡ\xcf\xfb" -> javascript:;
Link found: "\xd1\xe9֤\xc2\xeb\xb5\xc7¼" -> javascript:;
Link found: "\xb9ر\xd5" -> javascript:history.go(-1);
Link found: "ȷ\xc8\xcf" -> javascript:;
Link found: "ʹ\xd3\xc3\xc6\xe4\xcb\xfb\xd5ʺŵ\xc7¼" -> javascript:;

Runtime error: invalid memory address or nil pointer dereference (occurs on x32 architecture)

package main

import (
	"github.com/asciimoo/colly"
)

func main() {
	c := colly.NewCollector()
	c.Visit("https://www.google.com")
}

If build with 386 architecture, it crashes:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x4012bc]

goroutine 1 [running]:
sync/atomic.AddUint64(0x1284e304, 0x1, 0x0, 0x128c74c0, 0x0)
d:/soft/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/asciimoo/colly.(*Collector).scrape(0x1284e280, 0x73acb0, 0x16, 0x72f0c5, 0x3, 0x1, 0x0, 0x0, 0x128212f8, 0x0, ...)
d:/go/src/github.com/asciimoo/colly/colly.go:244 +0x244
github.com/asciimoo/colly.(*Collector).Visit(0x1284e280, 0x73acb0, 0x16, 0x69320b, 0x693524)
d:/go/src/github.com/asciimoo/colly/colly.go:175 +0x6b
main.main()
d:/go/src/playground/main.go:9 +0x31

From https://golang.org/pkg/sync/atomic/

On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

Error: gzip: invalid header

package main

import (
	"fmt"

	"github.com/gocolly/colly"
	"github.com/gocolly/colly/debug"
	"time"
)

func main() {
	c := colly.NewCollector(
		colly.UserAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"),
		colly.AllowedDomains("irby.kz"),
		colly.Async(true),
		colly.Debugger(&debug.LogDebugger{}),
	)
	c.DisableCookies()

	c.Limit(&colly.LimitRule{
		Parallelism: 2,
		Delay:       1 * time.Second,
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	// Set error handler
	c.OnError(func(r *colly.Response, err error) {
		fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
	})

	// Set HTML callback
	// Won't be called if error occurs
	c.OnHTML("*", func(e *colly.HTMLElement) {
		fmt.Println(e)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL.String())
	})

	c.Visit("http://irby.kz/ru/catalog/dlya_devochek/?SHOW_ALL=Y")
	c.Visit("http://irby.kz/ru/catalog/dlya_malchikov_1/?SHOW_ALL=Y")

	c.Wait()
}

Multipart support

Do you have any plan to support a multipart POST request?
I will code this feature later.

URL Filter to exclude

Currently you can specify a URLFIlter to include URL, is there any way to exclude urls ?

Custome same request detect

Current visited url just use the url to detect same request, collector should accept a custom unique request map(e.g. func(r *Request)string), because post request use id in form to do this, not just url.

BTW, If you accept prs, I can contribute some work.

I think c.Limit also can be used to filter unmatched urls . Should support regex before request in anywhere.

The purpose is to filter out the mismatched urls

for example
package main

import (
"github.com/asciimoo/colly"
"fmt"
"time"
)

func main() {
urls := []string{"https://httpbin.org/hello", "https://httpbin.org/123", "https://httpbin.org/12xyz"}

// Instantiate default collector
c := colly.NewCollector()


// when visiting links which domains' matches "*httpbin.*" glob
c.Limit(&colly.LimitRule{
	DomainGlob: "*https://httpbin.org/[a-z]+",
})
c.OnResponse(func(r *colly.Response) {
	fmt.Println("Finished", r.Request.URL, time.Now())
})

for _, i := range urls {
	c.Visit(i)
}



 c.Wait()

}
Starting https://httpbin.org/hello",
Starting "https://httpbin.org/123",
Starting "https://httpbin.org/12xyz
Finished https://httpbin.org/hello

trying to measure request timing

I'm doing some timing during a crawl recording the start time in Ctx in OnRequest and calculating a duration in OnResponse. It all works very well until I try to throttle the crawler with a Limit with Delay since the sleep is called by the backend after the OnRequest callbacks

Is there another callback that can be used, or would you consider moving the sleep to before the OnRequest is called ?

Randomized Delay

Is it possible to create randomized delays, i.e., per-request delays selected from some range or based on some random factor? I couldn't think of a good way to do this, other than maybe cycling through several Collectors with different limit sets, which seems sub-optimal.

It seems like having LimitRule.DelayRange or LimitRule.RandomFactor options would be quite helpful.

How to set up only one request to use the proxy?

I found

rp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338")
	if err != nil {
		log.Fatal(err)
	}
c.SetProxyFunc(rp)

but if i have ten urls need request, this way is all request to use proxy, i only want one request to use the proxy, What should I do?

Add iteration api for context

When I use colly, I have a case to to iterate context elements when I put something in it with multiple OnHTML callback on different html elements.
This is the simple function I wrote.

// ForEach iterate context
func (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{} {
	c.lock.RLock()
	defer c.lock.RUnlock()

	ret := make([]interface{}, 0, len(c.contextMap))
	for k, v := range c.contextMap {
		cur := fn(k, v)
		ret = append(ret, cur)
	}

	return ret
}

Hope this can help someone when they also need to iterate context.

set AllowedDomains as the Limit way may be better

There are many subdomains for some sites, it's convenient to set the AllowedDomains by match, like this

// c.AllowedDomains = []string{"hackerspaces.org", "wiki.hackerspaces.org"}
c.AllowedDomains = []string{"*hackerspaces.org"}

Extend basic example with more callbacks

There are five main callbacks in colly and they are:

  1. OnRequest
  2. OnError
  3. OnResponse
  4. OnHTML
  5. OnScraped

We want to show reader these callbacks at the very beginning, so how about we extend the basic example with all those callbacks?

Passing context between collectors

I was wondering how it's possible to pass context between collectors.

My use case is that I have a collector that is collecting links and the triggers another collector to visit this link. On the second collector, I'd like to pass some sort of context (like a parent category name that is not present on the child page HTML) but I didn't find a way to achieve this because Context seems only for request/responses within the same collector.

Race condition due to incorrect sync.WaitGroup usage

Hi,

It seems that sync.WaitGroup is not used correctly in func (c *Collector) scrape(...) error method.

// colly.go:307
func (c *Collector) scrape(...) error {
    c.wg.Add(1)
    defer c.wg.Done()
    ...

Consider the following example (similar to http://go-colly.org/docs/examples/rate_limit/):

for _, url := range urls {
    go c.Visit(url)
}

c.Wait()

Here we call Visit (scrape wrapper) in a separate goroutine for each url and wait for their completion.

The problem is that it's not guaranteed that all goroutines will be finished after c.Wait() returns since "goroutine to wait" count is incremented inside each new goroutine - c.wg.Add(1).

So, it's up to scheduler if c.wg.Add(1) will be called before or after c.Wait().

I think there are two ways how that issue could be fixed:

  1. Let caller handle sync.WaitGroup. Just provide a pointer to it as a param.
  2. Provide an async API for scrape and it's wrappers.

Could look like that:

func (c *Collector) scrapeAsync(...) chan<- error {
    errChan := make(chan error)
    c.wg.Add(1)
    go func() {
        defer c.wg.Done()

        errChan <- c.scrape(...)
    }()
    return errChan
}

I can do a PR if you want.

go get issue on n.Attr

On Mac with go version 1.9

Fils:collyIndexer dfils$ go get github.com/asciimoo/colly
# github.com/asciimoo/colly
../../../../github.com/asciimoo/colly/colly.go:302:16: cannot use n.Attr (type []"code.google.com/p/go.net/html".Attribute) as type []"golang.org/x/net/html".Attribute in field value
Fils:collyIndexer dfils$ go version
go version go1.9 darwin/amd64

Proxy Support

What are your thoughts on adding proxy support to the Collector? I see that one could just create a custom Transport and set the collector to use it, but it would be nice to have a SetProxy(url) method or something similar.

Minor suggestions

Hello @asciimoo , good job with the lib!

I just wanted to say that I did open a PR 5 days ago to add your project inside the "awesome-go" directory.

The README doesn't contain any quality references, like report card, so I took the liberty to complete them inside the PR text, please add them as badges to the README.md of this repository as well, the links are:

However in order (the PR) to be acceptable we have to complete some other links as well, like the coverage service link, it looks like this: https://cover.run/go/github.com/asciimoo/colly.svg
but I couldn't find any tests (_test.go files) of "colly" so the PR was marked as "pending".

Please keep watching this and do your bests to complete some test files, those details matters there.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.