Coder Social home page Coder Social logo

ache's Introduction

[Build Status] (https://travis-ci.org/ViDA-NYU/ache)

ACHE Focused Crawler

Introduction

ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property. ACHE differs from other crawlers in the sense the it includes page classifiers that allows it to distingish between relevant and irrelevant pages in a given domain. The page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a sophisticated machine-learned classification model. ACHE also includes link classifiers, which allows it decide the best order in which the links should be downloaded in order to find the relevant content on the web as fast as possible, at the same time it doesn't waste resources downloading irrelevant content.

Installation

You can either build ache from the source code or download the execution using conda

Build from source with Gradle

To build ache from source, you can run the following commands in your terminal:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
./gradlew clean installApp

which will generate an installation package under /build/install/.

Learn more about Gradle: http://www.gradle.org/documentation.

Download with Conda

You can download ache from Binstar [2] with Conda [3] by running:

conda install -c memex ache

NOTE: Only tagged versions are published to Binstar, so ache from Binstart may be outdated. If you want to try the most recent version, please clone the repository, compile the code using instructions below and then start the crawler using ache located in build/install/ache/bin.

Build page classifier for ACHE

To focus on a certain topic ACHE needs to have a page classifier to decide, given a new crawled page, whether it is on-topic or not. A page classifier can be created with ache given positive and negative examples. Each training example conrresponds to a web page whose html content needs to be stored in a plain text file. Assume that you store positive and negative examples in two directories, positive and negative, which reside in training_data directory. Here is how you build a model from these examples:

./build/install/ache/bin/ache buildModel -t <training data path> -o <output path for model> -c <stopwords file path>

<training data path> is the path to the directory containing positive and negative examples.

<output path> is the new directory that you want to save the generated model that consists of two files: pageclassifier.model and pageclassifier.features.

<stopwords file path> is a file with list of words that the classifier should ignore. Example: https://github.com/ViDA-NYU/ache/blob/master/config/sample_config/stoplist.txt

Example of building a page classifier using test data:

./build/install/ache/bin/ache  -c config/sample_config/stoplist.txt -o model_output -t config/sample_training_data

Start ACHE

After you generate a model, you need to prepare the seed file, where each line is a URL. Then to start the crawler, run:

./build/install/ache/bin/ache startCrawl -o <data output path> -c <config path> -s <seed path> -m <model path> [-e <elastic search index name>]

<configuration path> is the path to the config directory.

<seed path> is the seed file.

<model path> is the path to the model directory (containing pageclassifier.model and pageclassifier.features).

<data output path> is the path to the data output directory.

Example of running ACHE:

./build/install/ache/bin/ache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model -e achecrawler

Data Formats

ACHE can store data in different data formats. The data format can be configured by changing the key DATA_FORMAT in the [target storage configuration file] (https://github.com/ViDA-NYU/ache/blob/master/config/sample_config/target_storage/target_storage.cfg). The data formats available now are:

  • FILE (default) -- only raw content is stored in plain text files.
  • CBOR -- raw content and some metadata is stored using CBOR format in files.
  • ELATICSEARCH -- raw content and metadata is indexed in an ElasticSearch index. See ElasticSearch Integration for details about configuration.

More information?

More documentation is availabe in the project's Wiki.

Where to report bugs?

We welcome user feedback. Please submit any suggestions or bug reports using the Github tracker (https://github.com/ViDA-NYU/ache/issues)

Contact?

Kien Pham [[email protected]]

Aecio Santos [[email protected]]

ache's People

Contributors

aecio avatar ahmadia avatar chdoig avatar kienpt avatar rajatiit avatar rexissimus avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.