Coder Social home page Coder Social logo

ecrawler's Introduction

eCrawler

Overview Architecture

Overview

For scalability and to make the most of parallelism, I separately developed independent modules which communicate through a shared memory.

  • There is a program to handle how sites will be scraped (Spider)
  • There is a program to perform the actual downloading (Downloader)
  • There is a program to extract the text from the downloaded PDFs (Text Extractor)
  • There is a program to parse information from the text (Date Extractor)

Requirements

  • Installing requirements packages by running the following command
$ conda create -n myenv python==3.7
$ conda activate myenv
$ pip3 install -r requirements.txt

Run

  • Redis
$ redis-server
  • Spider
 $ python spider.py  [--seeds_path "resources/seeds.json"] [--reset]
  • Downloader
 $ python downloader.py --download_path "resources/downloaded"
  • Text Extractor.py
 $ python text_extractor.py --download_path "resources/downloaded"
  • Date Extractor.py
 $ python date_extractor.py --saved_path "resources/json"

Scheduler

in Linux, Crontab can be used to schedule this program's execution automatically and ensure that the program will be resumed in case of any crash. For example, the following command is executed every 10 minutes to ensure that one instance of crawler is running.

 $   */10 * * * * /usr/bin/flock -n /tmp/lock_spider_1 bash [full-path]/spider_1.sh

ecrawler's People

Contributors

hajipoor avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.