Coder Social home page Coder Social logo

xcrawler's Introduction

A light-weight web crawler framework: xcrawler

Build Status Coverage Status

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

arch

Features

  • Simple: extremely easy to customize your own spider;
  • Fast: multiple requests are spawned concurrently with the ThreadPoolDownloader or ProcessPoolDownloader;
  • Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;
  • Extensible: write your own extensions to make your crawler much more powerful.

Install

  1. create a virtual environment for your project, then activate it:

    virtualenv crawlenv
    source crawlenv/bin/activate
    
  2. download and install this package:

    pip install git+https://github.com/0xE8551CCB/xcrawler.git
    

Quick start

  1. Define your own spider:

    from xcrawler import BaseSpider
    
    
    class DoubanMovieSpider(BaseSpider):
        name = 'douban_movie'
        custom_settings = {}
        start_urls = ['https://movie.douban.com']
    
        def parse(self, response):
            # extract items from response
            # yield new requests
            # yield new items
            pass
    
  2. Define your own extension:

    class DefaultUserAgentExtension(object):
        config_key = 'DEFAULT_USER_AGENT'
    
        def __init__(self):
            self._user_agent = ''
    
        def on_crawler_started(self, crawler):
            if self.config_key in crawler.settings:
                self._user_agent = crawler.settings[self.config_key]
    
        def process_request(self, request, spider):
            if not request or 'User-Agent' in request.headers or not self._user_agent:
                return request
    
            logger.debug('[{}]{} adds default user agent: '
                         '{!r}'.format(spider, request, self._user_agent))
            request.headers['User-Agent'] = self._user_agent
            return request
    
  3. Define a pipeline to store scraped items:

    class JsonLineStoragePipeline(object):
        def __init__(self):
            self._file = None
    
        def on_crawler_started(self, crawler):
            path = crawler.settings.get('STORAGE_PATH', '')
            if not path:
                raise FileNotFoundError('missing config key: `STORAGE_PATH`')
    
            self._file = open(path, 'a+')
    
        def on_crawler_stopped(self, crawler):
            if self._file:
                self._file.close()
    
        def process_item(self, item, request, spider):
            if item and isinstance(item, dict):
                self._file.writeline(json.dumps(item))
    
  4. Config the crawler:

    settings = {
            'download_timeout': 16,
            'download_delay': .5,
            'concurrent_requests': 10,
            'storage_path': '/tmp/hello.jl',
            'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) '
                                  'AppleWebKit/603.3.8 (KHTML, like Gecko) Version'
                                  '/10.1.2 Safari/603.3.8',
            'global_extensions': {0: DefaultUserAgentExtension},
            'global_pipelines': {0: JsonLineStoragePipeline}
    
        }
    crawler = Crawler('DEBUG', **settings)
    crawler.crawl(DoubanMovieSpider)
    
  5. Bingo, you are ready to go now:

    crawler.start()
    

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

xcrawler's People

Contributors

ifaceless avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.