mfan / collie Goto Github PK
View Code? Open in Web Editor NEWA distributed on demand web crawler, using Redis for job and url store, Hbase for page store.
A distributed on demand web crawler, using Redis for job and url store, Hbase for page store.
once a url is crawled, we need to store a simple record in the crawling cache to indicate it's been crawled recently to prevent duplicated crawling and speed up job processing:
design:
store a record in redis,
1. keyed by UID (a md5 digest from normalized url.)
2. crawl-time when it's get crawled
3. conditional-get related info: last-modified-time, ETAG
4. web page side
5. 3+4 will be used to decide if we use ping the url (using head) to avoid actually download the page.
design:
we need to be able to find a way to scheduling all the urls submitted by users, which is grouped by jobs, to crawl efficiently, while obeying the target web sites' crawling policy as much as possible.
basic functionality: [1, 2 are not belongs to this issue, this issue depends on them]
design:
we need to store downloaded pages for further processing late.
design:
XPath Template Server is used to drive the expansion (based on list templates), and content extraction (based on content templates).
Two kinds of templates are stored in the store (redis):
TODO:
store host related information to help get rid of duplicated url and optimizing for downloading:
design:
The implentation will take stages, and this issue will be separated into sub-issues.
create and maintaining user submitted jobs. A job is defined as:
design:
todo:
we need to know if a host is always redirect to another host, e.g. gmail.com => mail.gmail.com
design:
Deleted.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.