Coder Social home page Coder Social logo

zrashwani / arachnid Goto Github PK

View Code? Open in Web Editor NEW
253.0 253.0 60.0 221 KB

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

License: MIT License

PHP 88.24% HTML 11.76%
crawler php scraping seo

arachnid's People

Contributors

flangofas avatar howtomakeaturn avatar msjyoo avatar noplanman avatar onema avatar spekulatius avatar zrashwani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arachnid's Issues

filterLinks Issue

filterLinks(function($link){ return (bool) preg_match('/\/google\/(.*)/',$link) }) ->traverse() ->getLinksArray(); print_r($links); I have written this code to traverse only which have google as domain name, but it returns an empty array. What am I missing??

Adding a release

Hi, can you please add a release for commit 244180c?

We are not able to use HEAD anymore as we have Laravel 5.3 and your package now requires Illuminate/Support 5.4.

Thank you,
Anthony

'Don't crawl external links' option

It would be great to have the option to disable the crawling of external links.

I'd like to crawl an entire website, but am not interested in external links. By setting the 'depth' option to something suitably high in order to capture the entire website, I also end up doing a deep crawl of external websites.

Named array keys

What is the rationale for using named $nodeUrl and $nodeText array keys?

$childLinks[$hash]['original_urls'][$nodeUrl] = $nodeUrl;
$childLinks[$hash]['links_text'][$nodeText] = $nodeText;

Crawler.php Line 363 & 364

Would it not be more consistent and easier to parse if we changed to numerical keys?

$childLinks[$hash]['original_urls'][] = $nodeUrl;
$childLinks[$hash]['links_text'][] = $nodeText;

Timeout configuration for Goutte client

It's not currently possible to configure a timeout for the Guzzle client which is used to make HTTP requests when spidering a site. Without a default, Guzzle defaults to 0 timeout – i.e. it'll wait indefinitely until a response is received. (Which arguably isn't a sensible default anyway.)

I'm trying to spider a site which contains a link to a dead server. Requests to the URL never timeout, meaning the spider process gets stuck on this URL and never proceeds.

The timeout is configured when constructing a new Guzzle client, which is currently done in Arachnid\Crawler::getScrapClient():

protected function getScrapClient()
{
    $client = new GoutteClient();
    $client->followRedirects();

    $guzzleClient = new \GuzzleHttp\Client(array(
        'curl' => array(
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_SSL_VERIFYPEER => false,
        ),
    ));
    $client->setClient($guzzleClient);

    return $client;
}

It would be really helpful if a timeout was configured here. To do that, all we need to do is change the configuration array which is passed to the Guzzle client constructor method:

$guzzleClient = new \GuzzleHttp\Client(array(
    'curl' => array(
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
    ),
    'timeout' => 30,
    'connect_timeout' => 30,
));

I think a sensible default would be a 30 second timeout, but it would be great to have that configurable. That could either be an additional parameter in the constructor method, or alternatively an object property which can be changed.

In fact – it might make sense to allow us to add anything to the Guzzle constructor configuration. Perhaps again by means of a class property or constructor parameter whereby we can pass in an array of configuration options. This could be useful when configuring other client options, for example HTTP authentication:

$crawler = new Crawler($url, 3, array(
    'timeout' => 5,
    'connect_timeout' => 5,
    'auth' => array('username', 'password'),
));

Thoughts? I'd be happy to put together a PR for this, provided we can get some agreement on how this should be configured (class constructor, public property, static property, getter/setter, etc.)

filterLinks not work.

Hi, filterLinks not work in this example..

$url = "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t";
$crawler = new Crawler($url, 2); 
$links = $crawler
				->filterLinks(function($link){
                    
                    return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link); 
                })
                ->traverse()
                ->getLinks();

What Is wrong?

Parent > Children > Grand Children

how to get the URL/Page Title of Parent page of crawl link.

suppose, I have a url "www.example.com" and it is the parent but it has a child "www.example.com/pageone.html" and grand child is "www.example.com/pageone/pagetwo.html".

after traversing pagetwo.html how to get the Page Title/URL of other two urls?

problems with some of my websites

This is the error that gives me:

Array
(
    [http://www.*********.com] => Array
        (
            [links_text] => Array
                (
                    [0] => BASE_URL
                )

            [absolute_url] => http://www.*********.com
            [frequency] => 1
            [visited] => 
            [external_link] => 
            [original_urls] => Array
                (
                    [0] => http://www.***********.com
                )

            [status_code] => 404
            [error_code] => 0
            [error_message] =>The current node list is empty.
        )

)

Another errors:

Warning: array_replace(): Argument #2 is not an array in /var/www/test/includes/vendor/symfony/browser-kit/CookieJar.php on line 200

Thanks!!

The use of `$this` inside closures

Your package supports php version 5.3, but there are several lines with in closures that use $this yet $this can not be used in anonymous functions before php version 5.4.

https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L181
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L193
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L200
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L235

A simple solution would be to required version 5.4. This would also have a nice side effect as it would use the latest version of goute.

tel: links get crawled

Despite beeing blacklisted in checkIfCrawlable tel links get crawled.

tested via

$crawler = new \Arachnid\Crawler('https://www.handyflash.de/', 3);
$crawler->traverse();

in the apache accesslog a hit like

www.handyflash.de:443 213.XXX.YYY.ZZZ - - [10/Jun/2016:12:24:42 +0200] "GET /tel:+4923199778877 HTTP/1.1" 404 37312 "-" "Symfony2 BrowserKit" 0

is recorded

Crawler the whole site, page inside another page.

Hi.

Thank for the script.

I can find how to the scan site deeper. I mean there is a front page like https://example.com and on the page there are links to other pages where exist other pages with links. In the code below, crawler visit pages only by the links on the front, but not inside the pages.

Eg on the front exists link to the page https://example.com/links and on this page, there are a few links, the script doesn't visit the link on the page.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

set_time_limit(6000);

$linkDepth = 500;
// Initiate crawl    

$crawler = new \Arachnid\Crawler("https://example,com", $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();

it's possible to modify the code above but if exists solution from the box, it's better.

Thx

Catchable fatal error

I'm getting this error while running the code:

Catchable fatal error: Argument 1 passed to Front\simpleCrawler::extractTitleInfo() must be an instance of Front\DomCrawler, instance of Symfony\Component\DomCrawler\Crawler given, called in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 336 and defined in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 453

Improvement suggestions

We from sulu-cmf wants to use your crawler to create a http cache warmer and website information extractor. I will start today to use your class in a new SymfonyBundle.

Because of this reason i would like to ask you if you had time to contribute to your class?

I will create a PR to include some improvements we need:

  • Extract metadata
  • Get status_code of external links to check broken links
  • Perhaps the possibility to add a "progress bar"

I hope you will be able to merge this PR and thanks to your good work until now (= it saves me a lot of time.

With best regards
sulu-cmf

Does not support js-rendered sites

Sites like https://taxibambino.com or other sites that html is being rendered from js is not supported by this crawler and I understand that by design(being a back-end based crawler) this is not fixable. I am afraid that with the lacking support of js-only sites this crawler becomes obsolete.

Images treated as 404 - false positive

Hello, in this case images are found as 404, while in reality they have good urls. This should be fixed.

array:8 [▼
  "/images/2017-putsschema-1.png" => array:9 [▼
    "original_urls" => array:1 [▼
      "/images/2017-putsschema-1.png" => "/images/2017-putsschema-1.png"
    ]
    "links_text" => array:1 [▼
      "PUTSSCHEMA 1" => "PUTSSCHEMA 1"
    ]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-1.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-2.png" => array:9 [▼
    "original_urls" => array:1 [▶]
    "links_text" => array:1 [▶]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-2.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-3.png" => array:9 [▼
    "original_urls" => array:1 [▶]
    "links_text" => array:1 [▶]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-3.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-4.png" => array:9 [▶]
  "/images/2017-putsschema-5.png" => array:9 [▶]
  "/images/2017-putsschema-6.png" => array:9 [▶]
  "/images/2017-putsschema-7.png" => array:9 [▶]
  "/images/2017-putsschema-8.png" => array:9 [▶]
]

Abandoned?

I tried to pull this through composer, but it was flagged abandoned. Just checking if this was intentional :)

Update Goutte dependency version

Hi,

I'm trying to use Goutte and Arachnid together to crawl and then scrape content from a website. I've installed Goutte which currently sits at version 3.1. I'm unable to install Arachnid alongside this version of Goutte because it requires Gouttee version ~1.

Is there any chance we can get the composer.json requirements either updated, or loosened to just accept any version of Goutte? Or is there a reason for this library to require that version of Goutte?

Arachnid's composer.json requirements:

"require": {
    "php": ">=5.4.0",
    "fabpot/goutte": "~1"
}

My composer.json requirements, using latest stable version of Goutte:

"require": {
    "fabpot/goutte": "^3.1"
}

Thanks

Sites like linkedin should be probably excluded

Some social giants should be excluded from the results if they require login to be accessed - in order to only see the really broken links. With the current situation we get some false positives, with sites like linkedin.

How to find out from which url the url was crawled?

So let's say I am crawling a website http://website.com and it has a broken link http://website.com/dir/subdir/red located in http://website.com/dir/subdir . Is there a way that with all the data there would also be a key "source" => " http://website.com/dir/subdir"

Also,
is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?

"original_urls" => 
    "links_text" =>
    "absolute_url" => 
    "external_link" => 
    "visited" => 
    "frequency" => 
    "depth" => 
    "status_code" => 
    "error_code" => 
    "error_message" =>

Undefined index: external_link

I'm getting an Undefined index: external_link error in LinksCollection.php (line 51). I'm suspecting it may be because the site I'm indexing has some "javascript:void(0)" links on buttons and so forth that are tied to jQuery events, etc. Wondering if you might have any insight or ideas. Any help would be greatly appreciated. Thanks.

Response 401 - Authentification

Hi, i need authentification against LDAP vía HTTP Auth, and it gives me an 401 status code.

How i can do this using 'CookieJar'? Like in the comment:
http://zrashwani.com/simple-web-spider-php-goutte/#comment-92

It gives me:

Array
(
    [http://somehost] => Array
        (
            [links_text] => Array
                (
                    [0] => BASE_URL
                )
            [absolute_url] => http://somehost
            [frequency] => 1
            [visited] => 
            [external_link] => 
            [original_urls] => Array
                (
                    [0] => http://somehost
                )
            [status_code] => 401
        )
)

Absolute links and the actual urls in some cases is being rendered wrongly.

For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)

"/evil/../links/index.html" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://toastytech.com/evil/../links/index.html"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://toastytech.com/evil/"
    "depth" => 1
    "status_code" => 200
    "title" => "Nathan's Links"
    "meta_keywords" => ""
    "meta_description" => ""
    "h1_count" => 1
    "h1_contents" => array:1 [ …1]

404 error is hardcoded

Hello, so
the error_code is hardcoded to always return a 404, but in real life we are often dealing with 403, or a 500 etc. Would be nice to see a bit more info - I know this is not difficult to check. :)

For e.g. the method could look something like this:

function check_http_code($a)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $a);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = curl_exec($ch);
    $headers = curl_getinfo($ch);
    curl_close($ch);
    return $headers['http_code'];
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.