zrashwani / arachnid Goto Github PK
View Code? Open in Web Editor NEWCrawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
License: MIT License
Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
License: MIT License
Hi, can you please add a release for commit 244180c?
We are not able to use HEAD anymore as we have Laravel 5.3 and your package now requires Illuminate/Support 5.4.
Thank you,
Anthony
It would be great to have the option to disable the crawling of external links.
I'd like to crawl an entire website, but am not interested in external links. By setting the 'depth' option to something suitably high in order to capture the entire website, I also end up doing a deep crawl of external websites.
What is the rationale for using named $nodeUrl
and $nodeText
array keys?
$childLinks[$hash]['original_urls'][$nodeUrl] = $nodeUrl;
$childLinks[$hash]['links_text'][$nodeText] = $nodeText;
Would it not be more consistent and easier to parse if we changed to numerical keys?
$childLinks[$hash]['original_urls'][] = $nodeUrl;
$childLinks[$hash]['links_text'][] = $nodeText;
It's not currently possible to configure a timeout for the Guzzle client which is used to make HTTP requests when spidering a site. Without a default, Guzzle defaults to 0 timeout – i.e. it'll wait indefinitely until a response is received. (Which arguably isn't a sensible default anyway.)
I'm trying to spider a site which contains a link to a dead server. Requests to the URL never timeout, meaning the spider process gets stuck on this URL and never proceeds.
The timeout is configured when constructing a new Guzzle client, which is currently done in Arachnid\Crawler::getScrapClient()
:
protected function getScrapClient()
{
$client = new GoutteClient();
$client->followRedirects();
$guzzleClient = new \GuzzleHttp\Client(array(
'curl' => array(
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_SSL_VERIFYPEER => false,
),
));
$client->setClient($guzzleClient);
return $client;
}
It would be really helpful if a timeout was configured here. To do that, all we need to do is change the configuration array which is passed to the Guzzle client constructor method:
$guzzleClient = new \GuzzleHttp\Client(array(
'curl' => array(
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_SSL_VERIFYPEER => false,
),
'timeout' => 30,
'connect_timeout' => 30,
));
I think a sensible default would be a 30 second timeout, but it would be great to have that configurable. That could either be an additional parameter in the constructor method, or alternatively an object property which can be changed.
In fact – it might make sense to allow us to add anything to the Guzzle constructor configuration. Perhaps again by means of a class property or constructor parameter whereby we can pass in an array of configuration options. This could be useful when configuring other client options, for example HTTP authentication:
$crawler = new Crawler($url, 3, array(
'timeout' => 5,
'connect_timeout' => 5,
'auth' => array('username', 'password'),
));
Thoughts? I'd be happy to put together a PR for this, provided we can get some agreement on how this should be configured (class constructor, public property, static property, getter/setter, etc.)
Hi, filterLinks not work in this example..
$url = "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t";
$crawler = new Crawler($url, 2);
$links = $crawler
->filterLinks(function($link){
return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link);
})
->traverse()
->getLinks();
What Is wrong?
how to get the URL/Page Title of Parent page of crawl link.
suppose, I have a url "www.example.com" and it is the parent but it has a child "www.example.com/pageone.html" and grand child is "www.example.com/pageone/pagetwo.html".
after traversing pagetwo.html how to get the Page Title/URL of other two urls?
This is the error that gives me:
Array
(
[http://www.*********.com] => Array
(
[links_text] => Array
(
[0] => BASE_URL
)
[absolute_url] => http://www.*********.com
[frequency] => 1
[visited] =>
[external_link] =>
[original_urls] => Array
(
[0] => http://www.***********.com
)
[status_code] => 404
[error_code] => 0
[error_message] =>The current node list is empty.
)
)
Another errors:
Warning: array_replace(): Argument #2 is not an array in /var/www/test/includes/vendor/symfony/browser-kit/CookieJar.php on line 200
Thanks!!
Your package supports php version 5.3, but there are several lines with in closures that use $this
yet $this
can not be used in anonymous functions before php version 5.4.
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L181
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L193
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L200
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L235
A simple solution would be to required version 5.4. This would also have a nice side effect as it would use the latest version of goute.
Despite beeing blacklisted in checkIfCrawlable
tel links get crawled.
tested via
$crawler = new \Arachnid\Crawler('https://www.handyflash.de/', 3);
$crawler->traverse();
in the apache accesslog a hit like
www.handyflash.de:443 213.XXX.YYY.ZZZ - - [10/Jun/2016:12:24:42 +0200] "GET /tel:+4923199778877 HTTP/1.1" 404 37312 "-" "Symfony2 BrowserKit" 0
is recorded
Hi.
Thank for the script.
I can find how to the scan site deeper. I mean there is a front page like https://example.com and on the page there are links to other pages where exist other pages with links. In the code below, crawler visit pages only by the links on the front, but not inside the pages.
Eg on the front exists link to the page https://example.com/links and on this page, there are a few links, the script doesn't visit the link on the page.
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
set_time_limit(6000);
$linkDepth = 500;
// Initiate crawl
$crawler = new \Arachnid\Crawler("https://example,com", $linkDepth);
$crawler->traverse();
// Get link data
$links = $crawler->getLinks();
it's possible to modify the code above but if exists solution from the box, it's better.
Thx
I'm getting this error while running the code:
Catchable fatal error: Argument 1 passed to Front\simpleCrawler::extractTitleInfo() must be an instance of Front\DomCrawler, instance of Symfony\Component\DomCrawler\Crawler given, called in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 336 and defined in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 453
We from sulu-cmf wants to use your crawler to create a http cache warmer and website information extractor. I will start today to use your class in a new SymfonyBundle.
Because of this reason i would like to ask you if you had time to contribute to your class?
I will create a PR to include some improvements we need:
I hope you will be able to merge this PR and thanks to your good work until now (= it saves me a lot of time.
With best regards
sulu-cmf
Sites like https://taxibambino.com or other sites that html is being rendered from js is not supported by this crawler and I understand that by design(being a back-end based crawler) this is not fixable. I am afraid that with the lacking support of js-only sites this crawler becomes obsolete.
Hello, in this case images are found as 404, while in reality they have good urls. This should be fixed.
array:8 [▼
"/images/2017-putsschema-1.png" => array:9 [▼
"original_urls" => array:1 [▼
"/images/2017-putsschema-1.png" => "/images/2017-putsschema-1.png"
]
"links_text" => array:1 [▼
"PUTSSCHEMA 1" => "PUTSSCHEMA 1"
]
"absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-1.png"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "https://ssfonsterputs.se/putsschema/"
"depth" => 2
"status_code" => 404
]
"/images/2017-putsschema-2.png" => array:9 [▼
"original_urls" => array:1 [▶]
"links_text" => array:1 [▶]
"absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-2.png"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "https://ssfonsterputs.se/putsschema/"
"depth" => 2
"status_code" => 404
]
"/images/2017-putsschema-3.png" => array:9 [▼
"original_urls" => array:1 [▶]
"links_text" => array:1 [▶]
"absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-3.png"
"external_link" => false
"visited" => false
"frequency" => 1
"source_link" => "https://ssfonsterputs.se/putsschema/"
"depth" => 2
"status_code" => 404
]
"/images/2017-putsschema-4.png" => array:9 [▶]
"/images/2017-putsschema-5.png" => array:9 [▶]
"/images/2017-putsschema-6.png" => array:9 [▶]
"/images/2017-putsschema-7.png" => array:9 [▶]
"/images/2017-putsschema-8.png" => array:9 [▶]
]
I tried to pull this through composer, but it was flagged abandoned. Just checking if this was intentional :)
Hi,
I'm trying to use Goutte and Arachnid together to crawl and then scrape content from a website. I've installed Goutte which currently sits at version 3.1. I'm unable to install Arachnid alongside this version of Goutte because it requires Gouttee version ~1.
Is there any chance we can get the composer.json
requirements either updated, or loosened to just accept any version of Goutte? Or is there a reason for this library to require that version of Goutte?
Arachnid's composer.json requirements:
"require": {
"php": ">=5.4.0",
"fabpot/goutte": "~1"
}
My composer.json requirements, using latest stable version of Goutte:
"require": {
"fabpot/goutte": "^3.1"
}
Thanks
Some social giants should be excluded from the results if they require login to be accessed - in order to only see the really broken links. With the current situation we get some false positives, with sites like linkedin.
So let's say I am crawling a website http://website.com and it has a broken link http://website.com/dir/subdir/red located in http://website.com/dir/subdir . Is there a way that with all the data there would also be a key "source" => " http://website.com/dir/subdir"
Also,
is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?
"original_urls" =>
"links_text" =>
"absolute_url" =>
"external_link" =>
"visited" =>
"frequency" =>
"depth" =>
"status_code" =>
"error_code" =>
"error_message" =>
I'm getting an Undefined index: external_link error in LinksCollection.php (line 51). I'm suspecting it may be because the site I'm indexing has some "javascript:void(0)" links on buttons and so forth that are tied to jQuery events, etc. Wondering if you might have any insight or ideas. Any help would be greatly appreciated. Thanks.
Hi, i need authentification against LDAP vía HTTP Auth, and it gives me an 401 status code.
How i can do this using 'CookieJar'? Like in the comment:
http://zrashwani.com/simple-web-spider-php-goutte/#comment-92
It gives me:
Array
(
[http://somehost] => Array
(
[links_text] => Array
(
[0] => BASE_URL
)
[absolute_url] => http://somehost
[frequency] => 1
[visited] =>
[external_link] =>
[original_urls] => Array
(
[0] => http://somehost
)
[status_code] => 401
)
)
For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)
"/evil/../links/index.html" => array:14 [▼
"original_urls" => array:1 [ …1]
"links_text" => array:1 [ …1]
"absolute_url" => "http://toastytech.com/evil/../links/index.html"
"external_link" => false
"visited" => true
"frequency" => 1
"source_link" => "http://toastytech.com/evil/"
"depth" => 1
"status_code" => 200
"title" => "Nathan's Links"
"meta_keywords" => ""
"meta_description" => ""
"h1_count" => 1
"h1_contents" => array:1 [ …1]
There are a method to get anchor ?
Hello, so
the error_code is hardcoded to always return a 404, but in real life we are often dealing with 403, or a 500 etc. Would be nice to see a bit more info - I know this is not difficult to check. :)
For e.g. the method could look something like this:
function check_http_code($a)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $a);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
return $headers['http_code'];
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.