Coder Social home page Coder Social logo

slybot's Introduction

Slybot crawler

Slybot is a Python web crawler for doing web scraping. It's implemented on top of the Scrapy web crawling framework and the Scrapely extraction library.

Requirements

Quick Usage

Change your working directory to the slybot base module folder (where you can find settings.py file), create a directory called slybot-project, place your slybot project specs there, and run:

scrapy list

for listing the spiders in your project, or:

scrapy crawl <spider name>

for running your spider.

Configuration

In order to run Scrapy with the slybot spider, you need just to have slybot library in your python path, and pass the appropiate settings. In slybot/settings.py you can find a sample settings file:

SPIDER_MANAGER_CLASS = 'slybot.spidermanager.SlybotSpiderManager'
EXTENSIONS = {'slybot.closespider.SlybotCloseSpider': 1}
ITEM_PIPELINES = ['slybot.dupefilter.DupeFilterPipeline']
SLYDUPEFILTER_ENABLED = True
PROJECT_DIR = 'slybot-project'

try:
    from local_slybot_settings import *
except ImportError:
    pass

The first line:

SPIDER_MANAGER_CLASS = 'slybot.spidermanager.SlybotSpiderManager'

is where the magic starts. It says to scrapy to use the slybot spider manager, which is required in order to load and run the slybot spider.

The line:

EXTENSIONS = {'slybot.closespider.SlybotCloseSpider': 1}

is optional, but recommended. As slybot spiders are not absolutely customizable as a common scrapy spider, it can face some unexpected and uncontrollable situations that leads them to a neverending crawling. The specified extension is a safe measure in order to avoid that. It works by checking each fixed period of time, that a minimal number of items has been scraped along the same period. Refer to slybot/closespider.py for details

The also optional DupeFilterPipeline, which is enabled with the lines:

ITEM_PIPELINES = ['slybot.dupefilter.DupeFilterPipeline']
SLYDUPEFILTER_ENABLED = True

filters out duplicate items based on the item version, which is calculated using the version fields of the item definition. It maintains a set of the version of each item issued by the spider, and if the version of a new item is already in the set, it is dropped.

The setting PROJECT_DIR defines where the slybot spider can find the project specifications (item definitions, extractors, spiders). It is a string that defines the path of a folder in your filesystem, with a folder structure that we will define below in this doc.

So, if you know how to use scrapy, you already know the alternatives to pass those settings to the crawler: just use your customized settings module with all the settings you need, or use the slybot.settings module and give the remaining settings in a local_slybot_settings.py file somewhere in your python path, or pass the additional settings in command line. You can right now do a test with our test project in slybot/tests/data/Plants, by running, inside the current folder:

scrapy list -s PROJECT_DIR=slybot/tests/data/Plants

and then use the scrapy crawl command for run one of the available spiders that provides the list.

slybot's People

Contributors

pablohoffman avatar kalessin avatar andresp99999 avatar shaneaevans avatar dangra avatar amferraz avatar

Watchers

Christian Hochfilzer avatar James Cloos avatar

Forkers

asiellb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.