Coder Social home page Coder Social logo

mycelium's People

Contributors

aaasen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mycelium's Issues

Deploy Redis Datastore

Mycelium currently relies on Redis for task queuing and storing results.

Investigate ways to deploy Redis. Amazon EC2? Google Cloud Compute?
A similar alternative?

Respect robots.txt

Mycelium does not currently respect robots.txt, which means that it shouldn't be deployed in the wild, or really used at all.

Allow depth specification

At the moment, Mycelium will recursively crawl forever.
A depth control should be implemented for tasks like crawling individual websites.

Implement a non-redis task queue

Currently, the only TaskQueue implementation uses Redis. This is a pain to set up, and only really makes sense in distributed settings.

Task ordering

Tasks are currently ordered by system time, with the oldest tasks being resolved first.

The replacement solution should address these problems:

  • directs many crawlers to the same domain at one time
  • potentially wastes time crawling many irrelevant pages from one domain

SQLite data store

A SQLite (or other SQL) datastore would suit large scale crawls better than Redis.

Better logging

The current logging system doesn't give much insight into how the crawler is operating.

A solution to this issue should show the following statistics:

  • number of backlogged tasks in each buffered channel
  • pages per sec
  • number of task requests per sec

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.