pixep / crowlet Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 2.0 94 KB

Tiny sitemap crawler for cache warming, and website status monitoring

License: Apache License 2.0

Dockerfile 1.43% Makefile 5.77% Go 92.81%

cache crawler monitoring sitemap test-automation testing

crowlet's Introduction

crowlet's People

Contributors

Stargazers

Watchers

Forkers

sitewards jeroenvermeulen

crowlet's Issues

override-host option

How can I use this option? It seems to be ignored.

memory consumption is not adequate

crawler seem to be using a lot of memory.

I have a website with ~22K URLs that I need to crawl. I was running crawler on AWS EC2 t2.micro, which has ~1 GB of memory.
Crawler was leading this instance to freeze.

I was testing locally with 1GB memory limit, crawler hits it.

I think 1GB of memory is too much for simple crawler.

I'm not a Go expert and I'm not able to identify memory leak in the code.
I assume that it is about not releasing/cleaning the response results for each crawled URL.

Allow checking for external links

ctrl+c Doesn't exit when waiting for wait-intrval

I used this:


docker run -it --rm aleravat/crowlet --forever --crawl-hyperlinks --summary-only  --debug   --wait-interval 1800 https://webapplicationconsultant.com/sitemap.xml

It didn't exit when I pressed Ctrl+c It happens ONLY when it is waiting for 1800 seconds to be over - so when the first crawl is done.

forever not working

hello

docker run -it --rm aleravat/crowlet --forever --wait-interval 5 https://website.com/sitemap.xml

I am running this command, 5 seconds for testing obviously, however the console is not displaying new crawls after the first batch.

Any ideas? Thanks :)

Allow formatting the full summary as a JSON object

Check first level of internal links from the URL crawled

Any internal link, including:

Links to other internal pages
Internal images
...

Total Time is always 9223372036854

Not sure if i am doing anything wrong but regardless of any crawl i perform i just get 9223372036854 as a time :-/

using aleravat/crowlet@sha256:1c23f62c5328f5e79c5b347e0eb4bf10223ebca0bbff80bc38d83206ef79141e

I have a quick question, does it, or can it be implemented to recursively crawl sitemaps/xml files? Or is it only first-level links in the Sitemaps? I'd like to create a Public Sitemap of Sitemaps so I only need to update the container once - if that makes sense.

Thanks!