xcrawler, a light-weight web crawler framework

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. The downloader of the engine is implemented with the requests library. There are two different thread pools in the crawler's engine, one is for the downloader and the other for the processors (to extract data and so on).

I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

Features

Very simple;
Very easy to customize your own spider;
Process multiple requests and responses simultaneously.

TO-DO

Use priority queue instead;
Add more use cases;
Add docs and tests.

Examples

class BaiduNewsSpider(BaseSpider):
    name = 'baidu_news_spider'
    start_urls = ['http://news.baidu.com/']
    default_headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/50.0.2661.102 Safari/537.36'
    }

    def spider_started(self):
        self.file = open('items.jl', 'w')

    def spider_stopped(self):
        self.file.close()

    def spider_idle(self):
        # you can add new requests to the engine
        print('I am in idle mode')
        # self.crawler.crawl(new_request, spider=self)

    def make_requests_from_url(self, url):
        return Request(url, headers=self.default_headers)

    def parse(self, response):
        root = fromstring(response.content, base_url=response.base_url)
        for element in root.xpath('//a[@target="_blank"]'):
            title = self._extract_first(element, 'text()')
            link = self._extract_first(element, '@href').strip()
            if title:
                if link.startswith('http://') or link.startswith('https://'):
                    yield {'title': title, 'link': link}
                    yield Request(link, headers=self.default_headers, callback=self.parse_news,
                                  meta={'title': title})

    def parse_news(self, response):
        pass

    def process_item(self, item):
        print(item)
        print(json.dumps(item, ensure_ascii=False), file=self.file)

    @staticmethod
    def _extract_first(element, exp, default=''):
        r = element.xpath(exp)
        if len(r):
            return r[0]

        return default


def main():
    settings = {
        'download_delay': 1,
        'download_timeout': 6,
        'retry_on_timeout': True,
        'concurrent_requests': 16,
        'queue_size': 512
    }
    crawler = CrawlerProcess(settings, 'DEBUG')
    crawler.crawl(BaiduNewsSpider)
    crawler.start()

main()

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

wenyanningwork / xcrawler Goto Github PK

xcrawler's Introduction

xcrawler, a light-weight web crawler framework

Introduction

Features

TO-DO

Examples

License

xcrawler's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent