Coder Social home page Coder Social logo

crawler4j-1's Introduction

Crawler4j

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

How to use?

Basicly it is very simple. Crawler4j exposes an API with only two methods which you need to implement:

boolean shouldVisit(WebUrl webUrl) - called BEFORE fetching of the page, so you can decide if you want to allow the fetching of the page by analyzing the URL (example: returning false on all non html pages, or returning true only when the url is of specific domains etc)

void visit(Page page) - called after the page was fetched and lets you decide what to do about it.

Detailed examples can be found on this site's [wiki pages] (https://github.com/Chaiavi/Crawler4j/wiki)

Code Examples

  • Basic crawler: Basic example usecase of Crawler4j.
  • Image crawler: a simple image crawler that downloads image content from the crawling domain and stores them in a folder. This example demonstrates how binary content can be fetched using crawler4j.
  • Collecting data from threads: this example demonstrates how the controller can collect data/statistics from crawling threads.
  • Multiple crawlers: this is a sample that shows how two distinct crawlers can run concurrently. For example, you might want to split your crawling into different domains and then take different crawling policies for each group. Each crawling controller can have its own configurations.
  • Shutdown crawling: this example shows have crawling can be terminated gracefully by sending the 'shutdown' command to the controller.

CLARIFICATION

This project aims to be a continuation of the [Original Project] (https://code.google.com/p/crawler4j/) which is the best [open source java web crawler] (http://myblog.chaiware.org/2014/07/crawling-site-with-java.html).

This project began with the latest release of crawler4j (v3.5), and will add patches to the code where each patch will take care of one objective issue.

No custom code for specific needs here, this projects aims for the good of all, thus will only accept code which makes crawler4j better for everyone's needs and not for any company's specific needs.

I hope that the original author will recognize the value of my patches and will accept all of my changes into the original code repository.

RELEASE

For latest stable release, download v3.5 from the maven repository or from the [original site] (https://code.google.com/p/crawler4j/downloads/list)

For v3.5 with several bug fixes and good features (will be listed in a following paragraph) clone my project ad build.

If many bugfixes will accumulate, I will build a jar for the benefit of all (maybe even put it on maven's repo?).

Can I participate in this project?

YES But, I don't know you yet, so in order to know that you are a worthy committer, please follow the following guidelines

Want to add your code?

  • Make sure it is good for everyone and not only solves your specific needs.
  • Open an issue detailing the need and your suggested solution
  • If this is a bug, then please detail the scenario
  • Fork my project and make the fix
  • Make sure you didn't add any other code except this issue's specific code No code styling, no adding something else etc - only ATOMIC changes correlated to this specific issue.
  • Send me a pull request

That's it, I will see the issue and probably reply so a conversation might arise in the issues list. I will see the code and if it is good I will apply it to my own. If it is not suitable I might request you to fix it accordingly. After I see that you are fit, I will request you to become a committer on this project - welcome!

Changelog

  1. Updated all libraries to their latest version (August 2014) (issue #21)
  2. Removed log4j implementation and switched it with slf4j (issue #1)
  3. Added more logs
  4. Many more to come, feel free to look into the [issues list] (https://github.com/Chaiavi/Crawler4j/issues)

What about the original issues ?

I went over all of the [original issues] (https://code.google.com/p/crawler4j/issues/list) and copied only the relevant ones (about 15 issues), by relevant I mean those issues which were actual bugs and had a definitive scenario, or those feature requests which were for the good of all and not for specific needs.

What about the other forks of Crawler4j

I searched github and found 41 forks! of crawler4j. I went over all of them twice, and found only two forks worth mentioning (all of the rest, are either old or aren't adding anything), but both forks are for private companies and cover their own custom requirements.

So I will keep monitoring them and take whatever helps my users most.

crawler4j-1's People

Contributors

yasserg avatar chaiavi avatar

Watchers

 avatar

Forkers

chaowangcanada

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.