Coder Social home page Coder Social logo

collie's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

collie's Issues

Basic: crawling cache

once a url is crawled, we need to store a simple record in the crawling cache to indicate it's been crawled recently to prevent duplicated crawling and speed up job processing:

design:
store a record in redis,
1. keyed by UID (a md5 digest from normalized url.)
2. crawl-time when it's get crawled
3. conditional-get related info: last-modified-time, ETAG
4. web page side
5. 3+4 will be used to decide if we use ping the url (using head) to avoid actually download the page.

Basic: downloaders

  1. modeling downloading worker threads into slots. slots are grouped into a pack, a process. Each downloading machine can host a couple of processes. Thus, the downloading units are addressed as:
    [ machine_id, pack_id, slot_id]
  2. downloaders are stateless.
  3. downloaders are independent, and shall be capable of spread over public clouds, and using proxies.

design:

  1. crawling delay is running at 1 qps (1 url/sec) or below for all slots. using parallel downloading on multiple slots to support faster qps.
  2. no concurrent downloading on the same slot.
  3. use gevent for thread pool and networking.
  4. use python-requests (default), or urllib2 for http.
  5. store downloaded page into hbase.
  6. store the url into crawl cache (redis).

Basic: crawl scheduler

we need to be able to find a way to scheduling all the urls submitted by users, which is grouped by jobs, to crawl efficiently, while obeying the target web sites' crawling policy as much as possible.

basic functionality: [1, 2 are not belongs to this issue, this issue depends on them]

  1. modeling downloading worker threads into slots. slots are grouped into a pack, a process. Each downloading machine can host a couple of processes. Thus, the downloading units are addressed as:
    [ machine_id, pack_id, slot_id]
  2. downloaders are stateless.
  3. scheduling urls in batches, the scheduling unit to accommodating the communication overhead between scheduler and downloader, and make download more efficient, through reduced DNS look up, and possibly re-use existing connections.
  4. strict failever mechanism shall be in place guaranteeing a batch will not be skipped due to machine failures, or software failures.
  5. the urls shall be distributed upon all downloading slots in the most effiicent way. That means host mixing shall be done as best as we could.
  6. content store for the downloaded pages, in hbase:
    • meta: http headers and failure reasons
    • content: web page
    • xapp: applicatoin related, reserved column family.
  7. crawl cache in redis. stores all recent crawled urls.

design:

  1. use redis to implement the scheduling logic, which also keeps all the states.
  2. data stores in hbase shall be accessible from other languages, or systems.
  3. crawl cache using UID, a md5 digest from url string, as key. entries are automatically purged in a month.

Basic: Content Store

we need to store downloaded pages for further processing late.

design:

  1. use hbase to store the download page or downloading info:
    meta: downloading status and http headers, failure reasons.
    content: the raw page data
    x-appl: application related info. reserved column family.
  2. use happybase to interface with hbase.

Expansion: XPath Template Store

XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).

Two kinds of templates are stored in the store (redis):

  • list templates. The templates are used to extract more link urls from the page. The urls are used to crawl depper into more pages. For example, the template could be applied upon "category listing pages", or "related contents" pages, or "most popular items" pages, etc.
  • content templates. The templates are used to extract one or more entities from the page. The extracted data are structured data and could be add or merged into existing database.

TODO:

  • microformats shall be supported as one kind of content templates. The parsing of microformat is supported in lxml library. Need to keep tracking how many sites using microformats now.

HPT: Host property table

store host related information to help get rid of duplicated url and optimizing for downloading:

  1. host normalization, e.g. 301 will decide target host is winner. the info will be used for normalize urls and remove dups.
  2. host robots.txt info.
  3. host properties
    • friendliness, how the host behavior based upon previous crawling
    • stability, ranking helps here, also from previous crawling.
    • ranking, from other sources or assigned.
    • other info, e.g. ip address caching.
  4. timestamp to decide record freshness.

design:

  • this table stores in redis
  • shall sync with the same table in hbase.
  • might load on demand from hbase to redis.

The implentation will take stages, and this issue will be separated into sub-issues.

Basic: a job store

create and maintaining user submitted jobs. A job is defined as:

  1. batch of urls
  2. user id
  3. user contact info (to send notifications)
  4. priority.
  5. job deadline or cut off time.

design:

  1. job store is in redis.
  2. requires a python class with simple api.

todo:

  1. add a REST job submission and status interface.
  2. job result download mechanisms.

#4 SubIssue: HPT: Normalized Host Name

we need to know if a host is always redirect to another host, e.g. gmail.com => mail.gmail.com

design:

  1. any http 301 will create an entry for host name normalization, we only support host mapping, not the path part.
  2. not support meta-refresh within html page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.