mfan / collie Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 414 KB

A distributed on demand web crawler, using Redis for job and url store, Hbase for page store.

Python 100.00%

collie's People

Stargazers

Watchers

Forkers

so-far-so-good sjl421

collie's Issues

Basic: crawling cache

once a url is crawled, we need to store a simple record in the crawling cache to indicate it's been crawled recently to prevent duplicated crawling and speed up job processing:

design:
store a record in redis,
1. keyed by UID (a md5 digest from normalized url.)
2. crawl-time when it's get crawled
3. conditional-get related info: last-modified-time, ETAG
4. web page side
5. 3+4 will be used to decide if we use ping the url (using head) to avoid actually download the page.

Basic: downloaders

modeling downloading worker threads into slots. slots are grouped into a pack, a process. Each downloading machine can host a couple of processes. Thus, the downloading units are addressed as:
[ machine_id, pack_id, slot_id]
downloaders are stateless.
downloaders are independent, and shall be capable of spread over public clouds, and using proxies.

design:

crawling delay is running at 1 qps (1 url/sec) or below for all slots. using parallel downloading on multiple slots to support faster qps.
no concurrent downloading on the same slot.
use gevent for thread pool and networking.
use python-requests (default), or urllib2 for http.
store downloaded page into hbase.
store the url into crawl cache (redis).

Basic: crawl scheduler

we need to be able to find a way to scheduling all the urls submitted by users, which is grouped by jobs, to crawl efficiently, while obeying the target web sites' crawling policy as much as possible.

basic functionality: [1, 2 are not belongs to this issue, this issue depends on them]

modeling downloading worker threads into slots. slots are grouped into a pack, a process. Each downloading machine can host a couple of processes. Thus, the downloading units are addressed as:
[ machine_id, pack_id, slot_id]
downloaders are stateless.
scheduling urls in batches, the scheduling unit to accommodating the communication overhead between scheduler and downloader, and make download more efficient, through reduced DNS look up, and possibly re-use existing connections.
strict failever mechanism shall be in place guaranteeing a batch will not be skipped due to machine failures, or software failures.
the urls shall be distributed upon all downloading slots in the most effiicent way. That means host mixing shall be done as best as we could.
content store for the downloaded pages, in hbase:
- meta: http headers and failure reasons
- content: web page
- xapp: applicatoin related, reserved column family.
crawl cache in redis. stores all recent crawled urls.

design:

use redis to implement the scheduling logic, which also keeps all the states.
data stores in hbase shall be accessible from other languages, or systems.
crawl cache using UID, a md5 digest from url string, as key. entries are automatically purged in a month.

Basic: Content Store

we need to store downloaded pages for further processing late.

design:

use hbase to store the download page or downloading info:
meta: downloading status and http headers, failure reasons.
content: the raw page data
x-appl: application related info. reserved column family.
use happybase to interface with hbase.

Expansion: XPath Template Store

XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).

Two kinds of templates are stored in the store (redis):

list templates. The templates are used to extract more link urls from the page. The urls are used to crawl depper into more pages. For example, the template could be applied upon "category listing pages", or "related contents" pages, or "most popular items" pages, etc.
content templates. The templates are used to extract one or more entities from the page. The extracted data are structured data and could be add or merged into existing database.

TODO:

microformats shall be supported as one kind of content templates. The parsing of microformat is supported in lxml library. Need to keep tracking how many sites using microformats now.

HPT: Host property table

store host related information to help get rid of duplicated url and optimizing for downloading:

host normalization, e.g. 301 will decide target host is winner. the info will be used for normalize urls and remove dups.
host robots.txt info.
host properties
- friendliness, how the host behavior based upon previous crawling
- stability, ranking helps here, also from previous crawling.
- ranking, from other sources or assigned.
- other info, e.g. ip address caching.
timestamp to decide record freshness.

design:

this table stores in redis
shall sync with the same table in hbase.
might load on demand from hbase to redis.

The implentation will take stages, and this issue will be separated into sub-issues.

Basic: a job store

create and maintaining user submitted jobs. A job is defined as:

batch of urls
user id
user contact info (to send notifications)
priority.
job deadline or cut off time.

design:

job store is in redis.
requires a python class with simple api.

todo:

add a REST job submission and status interface.
job result download mechanisms.

#4 SubIssue: HPT: Normalized Host Name

we need to know if a host is always redirect to another host, e.g. gmail.com => mail.gmail.com

design:

any http 301 will create an entry for host name normalization, we only support host mapping, not the path part.
not support meta-refresh within html page.

Test.

Deleted.

mfan / collie Goto Github PK

collie's People

Stargazers

Watchers

Forkers

collie's Issues

Basic: crawling cache

Basic: downloaders

Basic: crawl scheduler

Basic: Content Store

Expansion: XPath Template Store

HPT: Host property table

Basic: a job store

#4 SubIssue: HPT: Normalized Host Name

Test.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent