Coder Social home page Coder Social logo

alexmathew / scrapple Goto Github PK

View Code? Open in Web Editor NEW
494.0 24.0 41.0 1.16 MB

A framework for creating semi-automatic web content extractors

Home Page: http://alexmathew.github.io/scrapple

License: MIT License

Python 87.00% HTML 10.95% Shell 2.05%
python css-selector xpath-expression web-scraper web-scraping scrapers scraping scrapy selector extractor

scrapple's Introduction

Scrapple

Codacy Badge Join the chat at https://gitter.im/AlexMathew/scrapple Scrapple on PyPI Build Status Code Climate

Scrapple is a framework for creating web scrapers and web crawlers according to a key-value based configuration file. It provides a command line interface to run the script on a given JSON-based configuration input, as well as a web interface to provide the necessary input.

The primary goal of Scrapple is to abstract the process of designing web content extractors. The focus is laid on what to extract, rather than how to do it. The user-specified configuration file contains selector expressions (XPath expressions or CSS selectors) and the attribute to be selected. Scrapple does the work of running this extractor, without the user worrying about writing a program. Scrapple can also be used to generate a Python script that implements the desired extractor.

Installation

You can install Scrapple by using

$ sudo apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev
$ pip install scrapple

Otherwise, you could clone this repository and install the package.

$ git clone http://github.com/scrappleapp/scrapple scrapple
$ cd scrapple
$ pip install -r requirements.txt
$ python setup.py install

How to use Scrapple

Scrapple provides 4 commands to create and implement extractors.

Scrapple implements the desired extractor on the basis of the user-specified configuration file. There are guidelines regarding how to write these configuration files.

The configuration file is the basic specification of the extractor required. It contains the URL for the web page to be loaded, the selector expressions for the data to be extracted and in the case of crawlers, the selector expression for the links to be crawled through.

The keys used in the configuration file are :

  • project_name : Specifies the name of the project with which the configuration file is associated.
  • selector_type : Specifies the type of selector expressions used. This could be "xpath" or "css".
  • scraping : Specifies parameters for the extractor to be created.
    • url : Specifies the URL of the base web page to be loaded.

    • data : Specifies a list of selectors for the data to be extracted.

      • selector : Specifies the selector expression.
      • attr : Specifies the attribute to be extracted from the result of the selector expression.
      • field : Specifies the field name under which this data is to stored.
      • connector : Specifies a text connector to join text from multiple tags (for eg, <li> tags)
      • default : Specifies the default value to be used if the selector expression fails.
    • table : Specifies a description for scraping tabular data.

      • table_type : Specifies the type of table ("rows" or "columns"). This determines the type of table to be extracted. A row extraction is when there is a single row to be extracted and mapped to a set of headers. A column extraction is when a set of rows have to be extracted, giving a list of header-value mappings.
      • header : Specifies the headers to be used for the table. This can be a list of headers, or a selector that gives the list of headers.
      • prefix : Specifies a prefix to be added to each header.
      • suffix : Specifies a suffix to be added to each header.
      • selector : Specifies the selector for the data. For row extraction, this is a selector that gives the row to be extracted. For column extraction, this is a list of selectors for each column.
      • attr : Specifies the attribute to be extracted from the selected tag.
      • connector : Specifies a text connector to join text from multiple tags (for eg, <li> tags)
      • default : Specifies the default value to be used if the selector does not return any data.
    • next : Specifies the crawler implementation.

      • follow_link : Specifies the selector expression for the <a> tags to be crawled through.

The main objective of the configuration file is to specify extraction rules in terms of selector expressions and the attribute to be extracted. There are certain set forms of selector/attribute value pairs that perform various types of content extraction.

Selector expressions :

  • CSS selector or XPath expressions that specify the tag to be selected.
  • "url" to take the URL of the current page on which extraction is being performed.

Attribute selectors :

  • "text" to extract the textual content from that tag.
  • "href", "src" etc., to extract any of the other attributes of the selected tag.

Tutorials

[For a more detailed tutorial, check out the tutorial in the documentation]

In this simple example for using Scrapple, we'll extract NBA player information from the ESPN website.

To first create the skeleton configuration file, we use the genconfig command.

$ scrapple genconfig nba http://espn.go.com/nba/teams --type=crawler --levels=2

This creates nba.json - a sample Scrapple configuration file for a crawler, which uses XPath expressions as selectors. This can be edited and the required follow link selector, data selectors and attributes can be specified.

{
    "project_name": "nba",
    "selector_type": "xpath",
    "scraping": {
        "url": "http://espn.go.com/nba/teams",
        "data": [
            {
                "field": "",
                "selector": "",
                "attr": "",
                "default": "",
                "connector": ""
            }
        ],
        "next": [
            {
                "follow_link": "//*[@class='mod-content']//a[3]",
                "scraping": {
                    "data": [
                        {
                            "field": "team",
                            "selector": "//h2",
                            "attr": "text",
                            "default": "<no_team>",
                            "connector": ""
                        }
                    ],
                    "next": [
                        {
                            "follow_link": "//*[@class='mod-content']/table[1]//tr[@class!='colhead']//a",
                            "scraping": {
                                "data": [
                                    {
                                        "field": "name",
                                        "selector": "//h1",
                                        "attr": "text",
                                        "default": "<no_name>",
                                        "connector": ""
                                    },
                                    {
                                        "field": "headshot_link",
                                        "selector": "//*[@class='main-headshot']/img",
                                        "attr": "src",
                                        "default": "<no_image>",
                                        "connector": ""
                                    },
                                    {
                                        "field": "number & position",
                                        "selector": "//ul[@class='general-info']/li[1]",
                                        "attr": "text",
                                        "default": "<00> #<GFC>",
                                        "connector": ""
                                    }                                               
                                ],
                                "table": [
                                    {
                                        "table_type": "rows",
                                        "header": "//div[@class='player-stats']//table//th",
                                        "prefix": "season_",
                                        "suffix": "",
                                        "selector": "//div[@class='player-stats']//table//tr[1]/td",
                                        "attr": "text",
                                        "default": "",
                                        "connector": ""
                                    },
                                    {
                                        "table_type": "rows",
                                        "header": "//div[@class='player-stats']//table//th",
                                        "prefix": "career_",
                                        "suffix": "",
                                        "selector": "//div[@class='player-stats']//table//tr[@class='career']/td",
                                        "attr": "text",
                                        "default": "",
                                        "connector": ""
                                    }
                                ]
                            }
                        }
                    ]                   
                }
            }
        ]
    }
}

The extractor can be run using the run command -

$ scrapple run nba nba_players -o json

This creates nba_players.json which contains the extracted data. An example snippet of this data :

{

    "project": "nba",
    "data": [

        # nba_players.json continues

        { 
            "career_APG" : "9.9",
            "career_PER" : "",
            "career_PPG" : "18.6",
            "career_RPG" : "4.4",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/2779.png&w=350&h=254",
            "name" : "Chris Paul",
            "number & position" : "#3 PG",
            "season_APG" : "9.2",
            "season_PER" : "23.49",
            "season_PPG" : "17.6",
            "season_RPG" : "3.5",
            "team" : "Los Angeles Clippers"
        },
        { 
            "career_APG" : "3.6",
            "career_PER" : "",
            "career_PPG" : "20.3",
            "career_RPG" : "5.8",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/662.png&w=350&h=254",
            "name" : "Paul Pierce",
            "number & position" : "#34 SF",
            "season_APG" : "0.9",
            "season_PER" : "7.55",
            "season_PPG" : "5.0",
            "season_RPG" : "2.6",
            "team" : "Los Angeles Clippers"
        },
        { 
            "career_APG" : "2.9",
            "career_PER" : "",
            "career_PPG" : "3.7",
            "career_RPG" : "1.8",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/4182.png&w=350&h=254",
            "name" : "Pablo Prigioni",
            "number & position" : "#9 PG",
            "season_APG" : "1.9",
            "season_PER" : "8.72",
            "season_PPG" : "2.3",
            "season_RPG" : "1.5",
            "team" : "Los Angeles Clippers"
        },
        { 
            "career_APG" : "2.0",
            "career_PER" : "",
            "career_PPG" : "11.1",
            "career_RPG" : "1.9",
            "headshot_link" : "http://a.espncdn.com/combiner/i?img=/i/headshots/nba/players/full/3024.png&w=350&h=254",
            "name" : "J.J. Redick",
            "number & position" : "#4 SG",
            "season_APG" : "1.6",
            "season_PER" : "18.10",
            "season_PPG" : "15.9",
            "season_RPG" : "1.5",
            "team" : "Los Angeles Clippers"
        },

        # nba_players.json continues
    ]

}

The run command can also be used to create a CSV file with the extracted data, using the --output_type=csv argument.

The generate command can be used to generate a Python script that implements this extractor. In essence, it replicates the execution of the run command.

$ scrapple generate nba nba_script -o json

This creates nba_script.py, which extracts the required data and stores the data in a JSON document.

Documentation

You can read the complete documentation for an extensive coverage on the background behind Scrapple, a thorough explanation on the Scrapple package implementation and a complete coverage of tutorials on how to use Scrapple to run your scraper/crawler jobs.

Authors

Scrapple is maintained by Alex Mathew and Harish Balakrishnan.

scrapple's People

Contributors

alexmathew avatar codacy-badger avatar gitter-badger avatar harishb93 avatar onstash avatar scrappleapp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapple's Issues

Update tests

Add new tests with dependable pages (pyvideo is pretty much dead)

Use generators in run command execution

Separate the config reading, scraping and output writing into different methods. This allows the scraping method to be used more generally, for eg, when you'd want the data scraped and loaded into a database.

Connector for text attributes

Provision of an option to specify the connector between attributes. By default it is provided with a space separator while few structures like, for instance, an Unordered or Ordered list, may need a line or a comma separator.

Handle exceptions in commands

The execute_command() method in the command classes should handle exceptions related to the arguments or the input config file.

Use kwargs for all extract methods

Using keyword arguments allows to provide flexibility in passing arguments to the method. For example, if a connector value has to be passed for a text attribute [ #74 ]

Refactor all the implementation classes

2015 me wrote horrible OO code.

Actually properly use inheritance in the classes. Why is this needed?

  • Helps get rid of a major load of code duplication
  • Forces to break functions down - so decreasing the cognitive complexity of the implementation functions.

scrapple.utils.exceptions.handle_exceptions(args) needs fixes

A couple of fixes are possible here -

  • Function name handle_exceptions(..) could be renamed to check_arguments(..) or something similar because usually handling exceptions involve try-except blocks.
  • Gracefully handling TypeError and ValueError exceptions for int(args['--levels']) on #31.
  • Raising an ultra generic Exception / BaseException is an anti-pattern and hence difficult to debug, if we encounter bugs.

If there are multiple fixes, then it can either be raised as different pull-requests and merged in or as a single pull-request and squash the commits before/while merging.

Add tests

Tests for verifying CSV output from run/generate need to be added. #44 and #46

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.