dbjpanda / advance_crawler Goto Github PK
View Code? Open in Web Editor NEWA nodejs based server to crawl both static and dynamic sites developed as a part of Google Summer of Code 2018 @Drupal
A nodejs based server to crawl both static and dynamic sites developed as a part of Google Summer of Code 2018 @Drupal
npm WARN deprecated [email protected]: CoffeeScript on NPM has moved to "coffeescript" (no hyphen)
npm WARN registry Unexpected warning for https://registry.npmjs.org/: Miscellaneous Warning EINTEGRITY: sha1-QFUCsAfzGcP0cXXER0UnMA8qta0= integrity checksum failed when using sha1: wanted sha1-QFUCsAfzGcP0cXXER0UnMA8qta0= but got sha512-zr6QQnzLt3Ja0t0XI8gws2kn7zV2p0l/D3kreNvS6hFZhVU5g+uY/30l42jbgt0XGcNBEmBDGJR71J692V92tA==. (260 bytes)
npm WARN registry Using stale package data from https://registry.npmjs.org/ due to a request error during revalidation.
/usr/local/bin/pm2 -> /usr/local/lib/node_modules/pm2/bin/pm2
/usr/local/bin/pm2-dev -> /usr/local/lib/node_modules/pm2/bin/pm2-dev
/usr/local/bin/pm2-docker -> /usr/local/lib/node_modules/pm2/bin/pm2-docker
/usr/local/bin/pm2-runtime -> /usr/local/lib/node_modules/pm2/bin/pm2-runtime
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: [email protected] (node_modules/pm2/node_modules/fsevents):
npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for [email protected]: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})
Some of the product items has complete url along with base url like (https://amazon.in/lorem-ipsum-) and some of the items has url like (/lorem-ipsum-) so there is no fixed url form.
To prevent from blocking, We need to add user-agents along with request and user-agents should be different in every request.
The node scraper sits idle when the feeds module parses the HTML responses from the scraper. Performace can be increased if a buffer kind of mechanism can be implemented where the node scraper dumps the HTML response and continue scraping without being idle. On the other side, the feeds module takes the HTML response from the buffer and deletes the entry which it took and parses it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.