A light-weight web crawler framework: xcrawler

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

Features

Simple: extremely easy to customize your own spider;
Fast: multiple requests are spawned concurrently with the ThreadPoolDownloader or ProcessPoolDownloader;
Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;
Extensible: write your own extensions to make your crawler much more powerful.

Install

create a virtual environment for your project, then activate it:
```
virtualenv crawlenv
source crawlenv/bin/activate
```

download and install this package:

pip install git+https://github.com/0xE8551CCB/xcrawler.git

Quick start

Define your own spider:

from xcrawler import BaseSpider


class DoubanMovieSpider(BaseSpider):
    name = 'douban_movie'
    custom_settings = {}
    start_urls = ['https://movie.douban.com']

    def parse(self, response):
        # extract items from response
        # yield new requests
        # yield new items
        pass

Define your own extension:

class DefaultUserAgentExtension(object):
    config_key = 'DEFAULT_USER_AGENT'

    def __init__(self):
        self._user_agent = ''

    def on_crawler_started(self, crawler):
        if self.config_key in crawler.settings:
            self._user_agent = crawler.settings[self.config_key]

    def process_request(self, request, spider):
        if not request or 'User-Agent' in request.headers or not self._user_agent:
            return request

        logger.debug('[{}]{} adds default user agent: '
                     '{!r}'.format(spider, request, self._user_agent))
        request.headers['User-Agent'] = self._user_agent
        return request

Define a pipeline to store scraped items:

class JsonLineStoragePipeline(object):
    def __init__(self):
        self._file = None

    def on_crawler_started(self, crawler):
        path = crawler.settings.get('STORAGE_PATH', '')
        if not path:
            raise FileNotFoundError('missing config key: `STORAGE_PATH`')

        self._file = open(path, 'a+')

    def on_crawler_stopped(self, crawler):
        if self._file:
            self._file.close()

    def process_item(self, item, request, spider):
        if item and isinstance(item, dict):
            self._file.writeline(json.dumps(item))

Config the crawler:

settings = {
        'download_timeout': 16,
        'download_delay': .5,
        'concurrent_requests': 10,
        'storage_path': '/tmp/hello.jl',
        'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) '
                              'AppleWebKit/603.3.8 (KHTML, like Gecko) Version'
                              '/10.1.2 Safari/603.3.8',
        'global_extensions': {0: DefaultUserAgentExtension},
        'global_pipelines': {0: JsonLineStoragePipeline}

    }
crawler = Crawler('DEBUG', **settings)
crawler.crawl(DoubanMovieSpider)

Bingo, you are ready to go now:
```
crawler.start()
```

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

hwsyy / xcrawler Goto Github PK

xcrawler's Introduction

A light-weight web crawler framework: xcrawler

Introduction

Features

Install

Quick start

License

xcrawler's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent