Coder Social home page Coder Social logo

drbenway / siteresearch Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 12.41 MB

php scripts to analyse webdata. Currently it includes a configurable crawler with various export options

Home Page: http://www.westworld.be/siteResearch/API/

License: Other

CSS 6.31% JavaScript 38.25% Shell 0.07% PHP 55.37%

siteresearch's Introduction

siteResearch:


Technical requirements:
-----------------------
* PHP
* MYSql
* Terminal
* Composer => Symfony

Your environement should support running PHP from the command line. 
A recent version of PHP 5.3 or up is needed as well as a recent version of MySql.

Install:
--------
1. Copy siteresearch to a folder accessible for php cli. 
2. navigate to  the root folder of siteResearch and update composer by typing
the following command From the commandline run 
"php composer.phar install". This should download all dependancys in a vendor
folder in the same directory. 
In the siteResearch src directory you can find the Crawler directory. Read on to
understand the inner workings of the crawler

How the crawler works:
----------------------
To start the crawler you call crawler.php via the commandline. You do this by
writing php crawler.php --fromurl "a starting url for the domain you want to crawl"

The script will then add this url to a database and start crawling it as first url.
All urls that are found on this page are then added to the database as urls to crawl.
From here on the process is repeated until no new urls are found.

The siteResearch/Crawler package also allows you to define tweaks and filters.
These allow you to customize your crawlresults.

###Tweaks: 
are php scripts that tweak the found urls.
An exaple of this could be "strip all parameters from the url"
This would turn http://www.bbc.co.uk/index.php?weather=true and
http://www.bbc.co.uk/index.php?sports=true into one url. Saving you a lot of 
crawling time. A full list of available tweaks can be found in the Tweaks folder.

###Filters:
are php scripts that will filter out the results of the crawler.
(Tweaks alter urls, filters remove them from the results) A good example 
is the filterExtrnalUrls script. This allows you to verify each found url against
a set of given values. If the values are not found in the url, the url will be 
stripped.

##Setting up your crawler
First you should decide on what urls to strip from the crawling. This is done by 
providing a set of values to the FilterExternalUrls Filter. 
Let's start by explaining how this works:

Call php crawler.php filterUrls.php with one of 3 parameters:
1. "php filterUrls -r" will return the contents of the filterExternalUrls 
xml file.

The values between the domain tags are checked for each url
(expl <domain>bbc.co.uk</domain>). If none of the values are found in the
url, the url is stripped by the crawler and not added to the crawling queue.
In the above example http://www.bbc.com/test will be stripped but 
http://www.bbc.co.uk/weather/today.html will not.


2. Updating this list can be done with the -a (append) parameter.
If we would like to crawl bbc.co.uk we could for example provide the setting
php crawler.php filterurls -a "bbc.co.uk". The crawler would then accept any url that 
contains bbc.co.uk. Crawling only the news section would be as simple as 
php crawler.php filterurls.php -a "http://www.bbc.com/news/". 
Beware that if you define multiple filters, the more specific will overwrite 
more general ones. Thus if you add bbc.co.uk and bbc.co.uk/weather to the list 
(filterurls -a "bbc.co.uk,bbc.co.uk/weather", only the urls with 
bbc.co.uk/weather will be kept.

3. The -a / --append adds urls to the already existing list. If you want to create 
a new list from scratch, you should use the -w or --write parameter. 
The filterExternalUrls.xml file will be overwritten with the new values.
expl. php crawler.php filterurls -w "www.bbc.co.uk,bbc.com"

After configuring the crawler it's time to launch the script.
The Crawler can be launched with php crawler.php fromurl "your url". 
The "fromurl" parameter is mandatory for a basic crawl. This is the first page of the domain you want
to crawl.(other options are available below) If all goes well, your terminal should display a message saying that 
the crawler started. After that a series of dots will appear. This is an indication
that the crawler is crawling pages. Per finished page, a dot will appear. 
For more information about how to tweak the crawler, please see the wiki.

## fromsitemap alternative
Alternatively, you can start crawling from data in a sitemap.xml file.
This is done with php crawler.php fromsitemap "path to a sitemap file"
The rest is exactly the same as "fromurl". The "fromurl" puts one url in the database
"fromsitemap" puts all urls from the sitemap file in the database before crawling.

Exporting results
-----------------
After crawling a site you have the option to export the results to multiple formats

##Broken links
With the broken links option you export the all the urls that could not be found
Every url that returned an http errorcode above 399 (for expl 404 page not found) 
will be in the list as well as the page that contained the broken link
To export the broken links do php crawler.php brokenlinks pathtofile.csv

##GEXFExport
GEXFE is a fileformat to do network analysis. It's supported by the open source
tool Gephi. This can help in visualising the interlinking the pages of a site.

##sitemap export
Exports the crawled pages to a sitemap file.

siteresearch's People

Contributors

drbenway avatar

Stargazers

 avatar

Watchers

 avatar

siteresearch's Issues

wiki

update wiki or project description to point to the api docs and designs

minimum requirements

do tests with different php and mysql versions and settings do define a set of minimum requirements

export csv

option to export the crawler table to csv file

import all urls from a domain

When setting up the crawler, provide an option to populate the url table with all known urls from google.
(google search "site:www.yourdomain.com")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.