monzo-crawler's Introduction

monzo-crawler

Usage

$ make setup # uses glide to install all dependencies to vendor/
$ make build && ./monzo-crawler tomblomfield.com | tee sitemap.txt
$ less sitemap.txt

Testing

$ make test

Known Issues

Doesn't have politeness delay implemented. We'll bombard the site with runtime.NumCPU() * 4 concurrent HTTP requests
HTML page parsing is done using goquery library which is really slow for very big HTML pages (like that of amazon.com)

Problem Statement

We'd like you to write a simple web crawler in a programming language of your choice. Feel free to either choose one you're very familiar with or, if you'd like to learn some Go, you can also make this your first Go program! The crawler should be limited to one domain - so when crawling tomblomfield.com it would crawl all pages within the domain, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should output a site map, showing which static assets each page depends on, and the links between pages.

Ideally, write it as you would a production piece of code. Bonus points for tests and making it as fast as possible!

Recommend Projects

w3ss / monzo-crawler Goto Github PK

monzo-crawler's Introduction

monzo-crawler

Usage

Testing

Known Issues

Problem Statement

monzo-crawler's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent