A light-weight web crawler framework: xcrawler
Introduction
xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.
Features
- Simple: extremely easy to customize your own spider;
- Fast: multiple requests are spawned concurrently with the
ThreadPoolDownloader
orProcessPoolDownloader
; - Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;
- Extensible: write your own extensions to make your crawler much more powerful.
Install
-
create a virtual environment for your project, then activate it:
virtualenv crawlenv source crawlenv/bin/activate
-
download and install this package:
pip install git+https://github.com/0xE8551CCB/xcrawler.git
Quick start
-
Define your own spider:
from xcrawler import BaseSpider class DoubanMovieSpider(BaseSpider): name = 'douban_movie' custom_settings = {} start_urls = ['https://movie.douban.com'] def parse(self, response): # extract items from response # yield new requests # yield new items pass
-
Define your own extension:
class DefaultUserAgentExtension(object): config_key = 'DEFAULT_USER_AGENT' def __init__(self): self._user_agent = '' def on_crawler_started(self, crawler): if self.config_key in crawler.settings: self._user_agent = crawler.settings[self.config_key] def process_request(self, request, spider): if not request or 'User-Agent' in request.headers or not self._user_agent: return request logger.debug('[{}]{} adds default user agent: ' '{!r}'.format(spider, request, self._user_agent)) request.headers['User-Agent'] = self._user_agent return request
-
Define a pipeline to store scraped items:
class JsonLineStoragePipeline(object): def __init__(self): self._file = None def on_crawler_started(self, crawler): path = crawler.settings.get('STORAGE_PATH', '') if not path: raise FileNotFoundError('missing config key: `STORAGE_PATH`') self._file = open(path, 'a+') def on_crawler_stopped(self, crawler): if self._file: self._file.close() def process_item(self, item, request, spider): if item and isinstance(item, dict): self._file.writeline(json.dumps(item))
-
Config the crawler:
settings = { 'download_timeout': 16, 'download_delay': .5, 'concurrent_requests': 10, 'storage_path': '/tmp/hello.jl', 'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' 'AppleWebKit/603.3.8 (KHTML, like Gecko) Version' '/10.1.2 Safari/603.3.8', 'global_extensions': {0: DefaultUserAgentExtension}, 'global_pipelines': {0: JsonLineStoragePipeline} } crawler = Crawler('DEBUG', **settings) crawler.crawl(DoubanMovieSpider)
-
Bingo, you are ready to go now:
crawler.start()
License
xcrawler is licensed under the MIT license, please feel free and happy crawling!