Coder Social home page Coder Social logo

html-parser's Introduction

HTML parsing using Jsoup library.

o Environment Runtime - JDK 8 Build - Maven HTML parser framework - JSoup

o How to build and run your solution locally. This is a maven project. Some of the commands are as following. On command line terminal go to the folder where pom.xml resides and fire commands. Build project - mvn clean package Run project - mvn spring-boot:run

o How to run application locally? Run this command to run the application - mvn spring-boot:run

The application URL on local machine is http://localhost:8080/

To look at the APIs written for the application http://localhost:8080/swagger-ui.html

o There is an png image file (Html Parser.png) inside the project at root level. Application landing page looks like it.

o The assumptions you made, design decisions you took - Hyperlink validation - While reviewing / analysing the hyperlinks on the page, I have considered unique links only. Does not make sense to test redirection effect for similar links again and again.

Deliberatly did not use timeout while checking for hyperlink redirection effect. Because specifying small timeout may lead to TimeOutException for many URLs or specifying large timeout may lead an API to run for a long time.

o Logic to check if page has login box - Search for text fields with password type. If there is password field, there is possibility that it has login box. I could have checked for username field, or submit, login buttons as well. But it is not always possible that field names will be similar to what we guessed. Sometimes button text might be in regional languages like German, Chinese, so it's better to use input field type to determine this.

o Known constraints or limitations in your solution - Performance of the API may degrade in case there are hundreds of hyperlinks on the website.

o Implementation of hyperlink validation - I have used CompletableFuture concept which is part of concurrent packages to validate urls in async way. Sequencial checks would have been expensive and will take a lot of time to validate all the links.

o Ancient websites with HTML versions less than 5 - HTML 4.01 - http://www.dpgraph.com/ HTML 4.0 - http://www.taco.com No Version - http://www.mcspotlight.org/index.shtml HTML 1.0 - http://www.agroweb.com/

html-parser's People

Contributors

jeevan-patil avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.