Coder Social home page Coder Social logo

crawlab-mind's Introduction

Crawlab Mind

Crawlab Mind is a project to solve complex issues related to web crawling/scraping with intelligent and smart features.

Installation

pip install crawlab-mind

List Extraction

List extraction is the most common scenario for web crawling, including the list pages of products, news articles and any other items that are displayed in a list fashion.

Crawlab Mind provides several simple API methods to auto-extract list items.

Basic Example

Below is a basic example to extract list items automatically by calling one single method.

from crawlab_mind import extract_list

html_list = extract_list('/path/to/html')
print(html_list)
for item in html_list.items:
    print(item)

Multi-List Extraction

Sometimes there are multiple lists in an HTML page. We can still extract their items.

from crawlab_mind import extract_list_items

items = extract_list_items('/path/to/html')
for item in items:
    print(item)

List Extraction with Different Selection Method

We can extract the desired list items with built-in methods.

from crawlab_mind.constants.list import ListSelectMethod
from crawlab_mind import extract_list

# using "Mean Max Text Length" (MMTL)
html_list = extract_list('/path/to/html', method=ListSelectMethod.MeanMaxTextLength)
for item in html_list.items:
    print(item)

# using "Mean Text Tag Count" (MTTC)
html_list = extract_list('/path/to/html', method=ListSelectMethod.MeanTextTagCount)
for item in html_list.items:
    print(item)

Pagination Extraction

Pagination is also a common element we want to scrape and extract its next links in order to go further.

Again, Crawlab Mind provides a way to auto-identify and extract pagination elements with some simple but smart algorithms.

from crawlab_mind import extract_pagination

html_list = extract_pagination('/path/to/html')
for item in html_list.all_items:
    print(item)

Auto-Extraction Algorithms

The methodology about the auto-extraction functionality is quite simple. It is based on the tree-like data structure of HTML. By converting each HTML Node into a high-dimensional data based on their attributes (tag names and class names), we are able to apply the clustering algorithm to the high-dimensional dataset to get candidate lists. Finally, the extractor will use a selection method to choose the best list element.

Below is the illustration of the algorithm.

crawlab-mind's People

Contributors

tikazyq avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.