Coder Social home page Coder Social logo

crawlnpeek's Introduction

CrawlnPeek | A Micro WebCrawler and Network Visualizer

The program crawls the given website URL by following anchor tags based on Breadth First Search and indexes the website. Then it saves the relevant crawled data in JSON and visualizes the domain's connectivity.

Usage:

python main.py http://www.example.com

Features:

- Creates a list of all pages indexed on a website - Creates a list of indexed pages with their relative depths and respective predecessors - Creates an image of the website network - Saves the indexed URLs to a JSON file - Added robustness to handle complex data parsing and broken hyperlinks - Added limits for maxdepth and maxpages indexed - Added support for relative links i.e. (href = "/source")

Requirements:

- Requests or Requests[Security] to allow true SSL connections - JSON library - Matplotlib to allow plotting the graph - Networkx to create the Graph using the list data

Examples:

An image of codeacademy.com Network:
An image of google.com Network @ 100 pages:

NOTE: Crawling can sometimes a really long time depending on the maxpages specified. It has a default value of 100 pages.

Future:

- Set up a web app to perform crawling on a given user query and then present an interactive visualization. - Use d3.js to visualize the website tree.

Author - bagarwa2

crawlnpeek's People

Contributors

bhanu13 avatar

Stargazers

Tamar Eilam avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.