Coder Social home page Coder Social logo

litlespiderbot's Introduction

Crlawer Bot

  • Version: 0.5.0
  • Java Version: 11+
  • Author: TheNaciaaStrike
  • License: MIT
  • Description: A simple web crawler bot for Java Using Spring Boot and Hibernate
  • Current Test coverage: 54.0% (All tests are in the test folder) (Acording to SonarQube)
  • Databse: Postgres 12

Current Features

  • Crawl a website
  • Save the crawled data in a database
  • get CSV file from crawled data
  • can crawl multiple websites

How to use

  • Clone the project

    • Windows:
      • clone the project
      • run the start.bat file
      • In browser goto 127.0.0.1:8080/crawl
      • If it fails you might need to use a different java version (like java 17)
    • Linux:
      • clone the project
      • run the start.sh file
      • In browser goto 127.0.0.1:8080/crawl
        • If it fails you might need to use start2.sh file
    • Mac:
      • clone the project
      • \shurg/ (I don't have a mac to test it)

API

    • GET /crawl - a basic crawl caller
    • POST /crawl - unleashed the crawler
      • Body request:
        • seed: the seed to crawl
        • url: the starter url of the crawl
        • breath: the breath of the crawl
        • deapth: the deapth of the crawl
    • GET /allcrawls - shows all the crawls in database (not implemented yet)
    • GET /test - a test page

Packages and Classes

frontend

  • Test - a controller for a test page
  • CrawlController - a controller for the crawl page and API
    • public ArrayList dataSorter (seedHitCount, seed, seeddata) a simple function to sort seed data form highest to lowest
    • public ArrayList deepCrawl (urls, death) a function to crawl through websites

DataStores

  • CrawlEntity - a class for the crawl entity keeps all the data and save it to database (not fully implemented yet)
  • CrawlRepository - a repository for the crawl entity (not fully implemented yet)
  • CrawlService - a service for the crawl entity (not fully implemented yet)

backend

  • Yoinker - a set of functions to make crawling simpler
    • public String getHTML(String urlToRead) a function to get the html of a website
    • public String[] getLinks(String html) a function to get all the links from a website
    • public String getTitle(String html) a function to get the title of a website
    • public static boolean validateURL(String url) a function to see if URL is actually valid
    • public int getData(String url, String seed) a function to get the data from a website based on a certain seed

Built With

        spring framework
        postgresql
        commons-validator
        jacoco-maven-plugin
        jsoup
        apache  commons-csv 1.8

litlespiderbot's People

Contributors

thenaciaastrike avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.