Coder Social home page Coder Social logo

web-scraper's Introduction

web-scraper

Breadth first search web scraper written in Java using the JavaFX toolkit.

demo

Features

  • Breadth first search from a starting URL
  • Customizable parsing settings
    • Number of parallel threads
    • Maximum link traversal depth
    • Crawler timeout (lifetime)
    • Delay between requests
    • Optional to clear parsing queue before finishing
      • This will take a long time
  • Keep track of parsing status with simple statistics
    • Total crawling time
    • Number of unique pages saved
    • Number of pages visited / number of pages queued
  • Output scraped data to a JSON file
  • View base url's HTML code to determine selectors
  • Selector view
    • Set the JSON output format by settings variable names and CSS selectors
    • Interactively test your selectors before starting the crawl
  • Graph View
    • Get a deep understanding of the path the crawler took in a visual format
    • Click any node to see the URL and data scraped from it
    • Entertaining to watch

GUI

Settings View

settings view

HTML View

html view

JSON Selectors View

json view

Syntax

  • Special Selectors [type]
    • title - get the page's title
    • url - get the page`s url
  • Data Selectors [css selector]:[type]
    • CSS Selectors
      • Use css syntax to select elements
      • div.class_name > h1 selects an h1 with a parent div of class class_name
    • Types
      • text - get all text between the given element
        • <p>Hello <e>World!</e></p> -> Hello World!
      • owntext - get only the text between the given element
        • <p>Hello <e>World!</e></p> -> Hello
      • href - get link contained in href= attribute

Structure

  • The selector should be a valid JSON object. It must have a data and links tag.
  • Each data element you want to extract has a unique title to identify it
  • The links tag is an array of selectors pointing to anchor tags
    • The crawler will use these links' href attribute to traverse from page to page.
    • If you do not care what links you are selecting just use a to follow any anchor tag link.

Scraped Graph

graph view

Usage

  • Nodes are added to the graph in real time and in the order they are traversed
  • Select a node to view the URL and data associated with it in the dropdown
  • Nodes are colored according to depth. Nodes of the same color were found at the same depth

Libraries

  • Gradle
  • JSoup
  • Guava
  • Lombok
  • Gson
  • JavaFX
  • GraphStream
  • SLF4J / Logback

Development

  • The project uses the Gradle build system. Simply import the project into any IDE and run the "application -> run" task

Usage

  • Download a prebuilt binary to run on any platform with 0 dependencies
  • java -jar [jarfile].jar

Initial Swing GUI

screenshot

web-scraper's People

Contributors

evan-buss avatar

Stargazers

 avatar syddharth avatar  avatar Dylan Dougherty avatar

Watchers

James Cloos avatar  avatar

Forkers

productinfo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.