Coder Social home page Coder Social logo

php-spider's Introduction

Build Status Latest Stable Version Total Downloads License

PHP-Spider Features

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as robots.txt and Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Installation

The easiest way to install PHP-Spider is with composer. Find it on Packagist.

$ composer require vdb/php-spider

Usage

This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.

Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see the link checker example. It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.

First create the spider

$spider = new Spider('http://www.dmoz.org');

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.

$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

Execute the crawl

$spider->crawl();

When crawling is done, we could get some info about the crawl

echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}

Contributing

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.

There a few requirements for a Pull Request to be accepted:

  • Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
  • Prove that the code works with unit tests and that coverage remains 100%;

Note: An easy way to check if your code conforms to PHP-Spider is by running the script bin/static-analysis, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.

Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run bin/coverage-enforce.

Support

For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License

PHP-Spider is licensed under the MIT license.

php-spider's People

Contributors

dependabot[bot] avatar dmitrysidorenkoshim avatar eddiejaoude avatar greatwitenorth avatar mvdbos avatar peter17 avatar readmecritic avatar scrutinizer-auto-fixer avatar soeren-helbig avatar spekulatius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-spider's Issues

Cannot retrieve href attribute of source page

I try to retrieve all href attributes of the top categories on the startpage like 'Arts', 'Business','Computers' a.s.o.
But I cannot succeed.

<aside class="arts" xpath="1">
        <div id="home-cat-arts" class="category arts mobile" onclick="window.location.href='/Arts/'">
            <h2 class="top-cat"><a href="/Arts/">Arts</a></h2>
            <h3 class="sub-cat"><a href="/Arts/Movies/">Movies</a>, 
                                <a href="/Arts/Television/">Television</a>, 
                                <a href="/Arts/Music/">Music</a>...</h3>
        </div>
</aside>

I tried the folllowing:

// Create Spider
$spider = new Spider('http://dmoztools.net');

// Add a URI discoverer. Without it, the spider does nothing.
 $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//section[@id='category-section']//h2[@class='top-cat']"));

// Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

// Let's add something to enable us to stop the script
$spider->getDispatcher()->addListener(
    SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
    function (Event $event) {
        echo "\nCrawl aborted by user.\n";
        exit();
    }
);

// Add a listener to collect stats to the Spider and the QueueMananger.
// There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);

// Execute crawl
$spider->crawl();

// Build a report
echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

// Finally we could do some processing on the downloaded resources
echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
   echo "\n$resource->getCrawler()->filterXpath("//a/@href")->text();
}

A little help could be very nice!

Crawler Stopping on Page with 500 error

Hey @mvdbos! Love this package, and have used this in a couple places. Thanks for killer work on this.

We have a crawler that is trying to crawl a site. However, one of the linked pages on the site has a 500 error, which then stops the crawler. Is there a way to make sure its continues to crawl even if the page has a 5xx level error?

Our Code

        // Create Spider
        $spider = new Spider($url);

        // Add a URI discoverer. Without it, the spider does nothing.
        // In this case, we want <a> tags and the canonical link
        $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//a|//link[@rel=\"canonical\"]/a"));
        $spider->getDiscovererSet()->addFilter(new AllowedHostsFilter([$url], true));

        // Set limits
        $spider->getDiscovererSet()->maxDepth = 25;
        $spider->getQueueManager()->maxQueueSize = 1000;

        // Let's add something to enable us to stop the script
        $spider->getDispatcher()->addListener(
            SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
            function (GenericEvent $event) {
                exit;
            }
        );

        // Execute crawl - dies here on the error
        $spider->crawl();

Abandoned package (guzzle/guzzle)

Hello,
When I was installing the package with composer, it appears that guzzle/guzzle is abandoned. I mean the package moved to a new name: guzzlehttp/guzzle
I think it's better to copy-paste the install log so you can see what's going on and a picture is 10K words worth

l4p1n@l4p1n:/var/www/default/sites/mirror-o-matic$ composer require vdb/php-spider
Using version ^0.2.0 for vdb/php-spider
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)

# [Some installing fu]

symfony/event-dispatcher suggests installing symfony/dependency-injection ()
symfony/event-dispatcher suggests installing symfony/http-kernel ()
guzzle/guzzle suggests installing guzzlehttp/guzzle (Guzzle 5 has moved to a new package name. The package you have installed, Guzzle 3, is deprecated.)
Package guzzle/guzzle is abandoned, you should avoid using it. Use guzzlehttp/guzzle instead.
Writing lock file
Generating autoload files
l4p1n@l4p1n:/var/www/default/sites/mirror-o-matic$ 

(Note: Writing this post at 2AM and I could have made some stupid misspells or not been clear enough)

Is it possible to skip creation of the results files and just report if the links are valid?

Hi.

I'm wondering if it's possible to use the link checker example to just check for valid links, and maybe store them in a JSON, or CSV file instead of creating binary files and index.html files inside the results folder?

Should I try to create my own persistence handler for this?

Basically, I'd just like to crawl my web to check if there are any 404 pages in my web, I'm not necessarily interested if any of the links on the page is returning 404, just need to check if all my pages are healthy.

Install by Composer problem unsolved

Hello;

I've tried to install php-spider with Composer. I added "minimum-stability": "dev", as it was needed for VDB/uri, but problem emerged:

` Your requirements could not be resolved to an installable set of packages.

Problem 1
- Conclusion: don't install symfony/symfony 2.6.x-dev
- Conclusion: don't install symfony/symfony 2.5.x-dev
- Conclusion: don't install symfony/symfony v2.5.3
- Conclusion: don't install symfony/symfony v2.5.2
- Conclusion: don't install symfony/symfony v2.5.1
- Conclusion: don't install symfony/symfony v2.5.0
- Conclusion: don't install symfony/symfony v2.5.0-RC1
- Conclusion: don't install symfony/symfony v2.5.0-BETA2
- Conclusion: don't install symfony/symfony v2.5.0-BETA1
- Conclusion: remove symfony/symfony 2.4.x-dev
- Conclusion: don't install symfony/symfony 2.4.x-dev
- Conclusion: don't install symfony/symfony v2.4.8
- Conclusion: don't install symfony/symfony v2.4.7
- Conclusion: don't install symfony/symfony v2.4.6
- Conclusion: don't install symfony/symfony v2.4.5
- Conclusion: don't install symfony/symfony v2.4.4
- Conclusion: don't install symfony/symfony v2.4.3
- Conclusion: don't install symfony/symfony v2.4.2
- Conclusion: don't install symfony/symfony v2.4.1
- Conclusion: don't install symfony/symfony v2.4.0
- Installation request for vdb/php-spider dev-master -> satisfiable by vdb/php-spider[dev-master].
- Conclusion: don't install symfony/symfony v2.4.0-RC1
- Conclusion: don't install symfony/symfony v2.4.0-BETA2
- vdb/php-spider dev-master requires symfony/finder 2.2.*@dev -> satisfiable by symfony/symfony[2.2.x-dev], symfony/finder[2.2.x-dev, v2.2.0, v2.2.1, v2.2.10, v2.2.11, v2.2.2, v2.2.3, v2.2.4, v2.2.5, v2.2.6, v2.2.7, v2.2.8, v2.2.9].
- Can only install one of: symfony/symfony[v2.4.0-BETA1, 2.2.x-dev].
- don't install symfony/finder 2.2.x-dev|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.0|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.1|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.10|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.11|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.2|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.3|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.4|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.5|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.6|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.7|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.8|don't install symfony/symfony v2.4.0-BETA1
- don't install symfony/finder v2.2.9|don't install symfony/symfony v2.4.0-BETA1
- Installation request for symfony/symfony ~2.4 -> satisfiable by symfony/symfony[2.4.x-dev, 2.5.x-dev, 2.6.x-dev, v2.4.0, v2.4.0-BETA1, v2.4.0-BETA2, v2.4.0-RC1, v2.4.1, v2.4.2, v2.4.3, v2.4.4, v2.4.5, v2.4.6, v2.4.7, v2.4.8, v2.5.0, v2.5.0-BETA1, v2.5.0-BETA2, v2.5.0-RC1, v2.5.1, v2.5.2, v2.5.3]. `

What should I do to install php-spider then? How to load it manually? And can You fix this problem?

Robots.txt filtering?

Hello @mvdbos,

I hope you are doing well!

I was wondering what your approach (if any) is to using the spider with robots.txt pattern for filtering?

The UriFilter seems to support only allow and no disallow and hence wouldn't support the most used case of robots files.

I was just thinking I ask if you got ideas on this before I build something myself.

Thank you in advance,

Peter

Limit spider to supplied domain

Is there a way to limit the spider to only stay on the supplied domain.
i.e. to list any links to other domains but not to follow them?

URL with Port number issue

If the url contains a port number, i.e. mydomain.com:8080, it appears this gets dropped? Is this correct or is there config for this?

question: distributed spidering

will eventually support distributed spidering with a central queue

you have any roadmap when will you implement this? have you tough about how would that work? I would like to help if you need to.

best,

Add prefetch filter to cache downloads with a max age

With this prefetch filter in place, skip fetching resources that are already downloaded and younger than max age. This requires that downloads are not segmented per spider id. A simple option is to set the same spider id for runs where you want to use the cache.

get source url

First of all, thanks for creating the php-spider script (almost) everything I need for my project is in it.

Is it possible to get the source of the spider where the relevant URLs were found?

For example, if I now index 500 URLs and some of these URLs give an incorrect return code (404 ,403 or 500) then it is currently difficult to find out where this incorrect URL was noticed.

Thank you
Constan

Limit parse uri with mask

How I make filter crawl uri by mask?
For example: I need all pages with uri domain.com/category1/ but don't need pages on uri domain.com/about/. How I make this filter?
I check this

$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//a[@href^='http://domain.com/category1']"));

but get php error...

install problems

I'm not a PhD in php here but composer is having issues installing this on my machine. Very interested in seeing what you have put together with your spider...

I don't have curl installed as I'm just testing this on my local Windows 7 machine running xampp, but when I run the command in composer

php composer.phar require vdb/php-spider
Please provide a version constraint for the way/generators requirement: 0.1.*

I'm told the

vdb/uri dev-master
can't be found.

I've tried installing this from a clone, but no luck...

any thoughts?

Update
Just re-shifted some things and now I'm getting this error in the example_simple.php

Fatal error: Call to undefined function VDB\Spider\pcntl_signal() in C:\xampp\htdocs\test\src\VDB\Spider\Spider.php on line 98

Why so many methods are private?

I am trying to build own spider upon php-spider and faced the troubles with extending default services because they use private methods (for instance DiscovererSet). Is there some special reason for this? Can we make them "protected"?

Fatal error: Uncaught TypeError: Example\GuzzleTimerMiddleware::onResponse(): Argument #3 ($response) must be of type GuzzleHttp\Promise\FulfilledPromise, GuzzleHttp\Promise\RejectedPromise given

Hi - great work,

I tried to crawl my own website and got the following errors (renamed the domain name) - interestingly other domains worked fine (e.g., example.com - although

Warning: file_get_contents(https://example.com/robots.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in D:\adstxt\php-spider\src\Filter\Prefetch\RobotsTxtDisallowFilter.php on line 45

but I think the error is not catched ...

Crawling.
Fatal error: Uncaught TypeError: Example\GuzzleTimerMiddleware::onResponse(): Argument #3 ($response) must be of type GuzzleHttp\Promise\FulfilledPromise, GuzzleHttp\Promise\RejectedPromise given, called in C:\php-spider\vendor\guzzlehttp\guzzle\src\Middleware.php on line 144 and defined in C:\php-spider\example\lib\Example\GuzzleTimerMiddleware.php on line 33

TypeError: Example\GuzzleTimerMiddleware::onResponse(): Argument #3 ($response) must be of type GuzzleHttp\Promise\FulfilledPromise, GuzzleHttp\Promise\RejectedPromise given, called in C:\php-spider\vendor\guzzlehttp\guzzle\src\Middleware.php on line 144 in C:\php-spider\example\lib\Example\GuzzleTimerMiddleware.php on line 33

Call Stack:
    0.0008     602328   1. {main}() C:\php-spider\example\example_complex.php:0
    0.1850    2424688   2. VDB\Spider\Spider->crawl() C:\php-spider\example\example_complex.php:102
    0.1864    2441312   3. VDB\Spider\Spider->doCrawl() C:\php-spider\src\Spider.php:101
    0.1905    2580408   4. VDB\Spider\Downloader\Downloader->download($uri = class VDB\Spider\Uri\DiscoveredUri { protected VDB\Uri\UriInterface|string $decorated = class VDB\Uri\Http { private string ${VDB\Uri\Uri}uri = 'https://myprivatewebsite.com'; protected VDB\Uri\Uri $baseUri = *uninitialized*; private string ${VDB\Uri\Uri}remaining = ''; private ?string ${VDB\Uri\Uri}composedURI = 'https://myprivatewebsite.com/'; protected ?string $authority = 'myprivatewebsite.com'; protected ?string $userInfo = NULL; protected ?string $scheme = 'https'; protected ?string $host = 'myprivatewebsite.com'; protected ?int $port = NULL; protected ?string $path = '/'; protected ?string $query = NULL; protected ?string $fragment = NULL; protected ?string $username = NULL; protected ?string $password = NULL }; private int $depthFound = 0 }) C:\php-spider\src\Spider.php:177
    0.1905    2580408   5. VDB\Spider\Downloader\Downloader->fetchResource($uri = class VDB\Spider\Uri\DiscoveredUri { protected VDB\Uri\UriInterface|string $decorated = class VDB\Uri\Http { private string ${VDB\Uri\Uri}uri = 'https://myprivatewebsite.com'; protected VDB\Uri\Uri $baseUri = *uninitialized*; private string ${VDB\Uri\Uri}remaining = ''; private ?string ${VDB\Uri\Uri}composedURI = 'https://myprivatewebsite.com/'; protected ?string $authority = 'myprivatewebsite.com'; protected ?string $userInfo = NULL; p

Thanks
Robert

suitable as link checker?

Is this suitable to use as a base for developing a link checker?

I've given it a quick go, but cant find an easy way to get the response code for each link found?

I'd be wanting to create a report of 404's, 500's etc that need attention. 200's for the sitemap. 301's and 302's that maybe need fixing...

PHP Fatal error: Class 'VDB\\Spider\\Spider' not found

I know I must be doing something stupid but I'm getting this error after install php-spider 2 in my document root using composer.json then running example_simple.php from the root directory. The vendor directory gets created but in the subfolder vendor/vdb I only have a folder called "uri".

I tried copying the "Spider" folder from the /src/VDB/ directory to vendor/vdb and creating a folder VDB/ and copying the "Spider" folder there. Still get the same error.

Im using CENTOS 6 with php 5.5 and didn't see any errors when I ran the command "composer install" in the same directory as composer.json.

What am I doing wrong here?

Thanks

Confusing separate event dispatchers

Currently, the default queue manager and downloader creates their own separate instances of a event dispatcher. This is very confusing when you try to add listeners to $spider->getDispatcher() but it only receives a subset of all the events. Of course you could:

$spider->getDispatcher()->addListener(…);
$spider->getDownloader()->getDispatcher()->addListener(…);
$spider->getQueueManager()->getDispatcher()->addListener(…);

… but then you have to look up which events goes into what subsystem.

I have an idea for a simple fix for this. :) PR to come!

Some feature questions

Hi,
I have some questions regarding the features of this crawler, which are not covered by the documentation.

  1. Does php-spider support JavaScript (content and URLs generated via JavaScript)
  2. Does php-spider follows robot.txt files?
  3. Is php-spider able to leverage a sitemap?
  4. Is it possible to crawl sites that require authentication?

support PHP7?

Hello, can you tell me whether the program support PHP7

Crawl multiple pages

Is it possibble to crawl all local site links to basically crawl every page on a website if they are linked correctly?

FilePersistenceHandler descendants fail with too long file paths

I have a problem with file persisting when the full path is too long.

I'm running on Windows and there is a limit 260 chars per full file path [source].

Also on other platforms there is a limit. It's 255 chars per file name (not full path which is nearly unlimited).

Here is more explanation of limits per platform/file system: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits


Current implementations of \VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler and \VDB\Spider\PersistenceHandler\FileRawResponsePersistenceHandler makes file name via urlencode() of the original full URI. And it can be longer than FS limits.


What about replace urlencode() with md5(), sha1() or custom function which make short file name?

I can prepare a PR with it if you agree.

Upgrading to support Symfony 6.

Hello,

do you have plans to upgrade the package
for supporting "symfony/finder": "^6.0"?

It would be awesome if so.
Because I have to upgrade one package to support latest Laravel version
but can't do this because of
"symfony/finder": "^3.0.0||^4.0.0||^5.0.0"
in your composer.json

Thank you in advance

Authentication

Hi,
Does it support webform authentication before crawling ? Many websites need to login before letting you access their data.
Thanks

Laravel 5.3 not found class

I'm trying to use this package in a laravel 5.3 app. I installed it using composer.

When I try to instantiate a new spider I get class not found exception.

My composer

    "require": {
        "php": ">=5.6.4",
        "laravel/framework": "5.3.*",
        "vdb/php-spider": "^0.2.0",
        "barryvdh/laravel-debugbar": "^2.3"
    },

In my controller I have the following:

use \VDB\Spider;

class HomeController extends Controller
{
    /**
     * Show the application dashboard.
     *
     * @return \Illuminate\Http\Response
     */
    public function index()
    {
        $spider = new \VDB\Spider('http://www.dmoz.org');
        return view('home');
    }
}

Can you help me?

thanks

Info

Hi, how to implement a search results page with your spider?
There is some example?

Thx.

Follow only internal redirects

Hello @mvdbos

I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:

I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.

Appreciate any ideas on how you would approach this.

Cheers,
Peter

Wrong Event Arguments

@mvdbos i think your "Simplify build and clean up use statements" commit introduced some evil changes. :)

use Symfony\Contracts\EventDispatcher\EventDispatcherInterface;

Symfony\Contracts\EventDispatcher\EventDispatcherInterface does not exists in S3. Why not using the default Symfony\Component\EventDispatcher\EventDispatcherInterface?

$this->getDispatcher()->dispatch($event, $eventName);

=> Wrong argument order (event name has to be the first argument)

$this->getDispatcher()->dispatch(

=> Wrong argument order (event name has to be the first argument)

$this->getDispatcher()->dispatch($event, $eventName);

=> Wrong argument order (event name has to be the first argument)

"EventDispatcher::dispatch() must be an object, string given"

Hello @mvdbos,

I'm about to use php-spider as part of a Laravel 7 project and got this error when starting a crawl:

Argument 1 passed to Symfony\Component\EventDispatcher\EventDispatcher::dispatch() must be an object, string given, called in /var/www/dev.project.com/vendor/vdb/php-spider/src/VDB/Spider/QueueManager/InMemoryQueueManager.php on line 88

I started researching it more and I guessed that it comes from a breaking change introduced with the upgrade of symfony/event-dispatcher from v4.4.5 to v5.0.5.

I've checked a bit more and found out that my version of php-spider is actually reverted from v0.4.2 down to v0.2. It's because v0.2 didn't require event-dispatcher and was therefore matching my set of requirements.

I've looked closer at the error and found that switching the parameters on line 87 & 88 in the InMemoryQueueManager-class fixed it. I've prepared a PR and started writing this issue as I've found the old issue about this: #61 haha. I could have solved it quicker :)

It would be great if you could let me know if the PR works or you or needs further tweaks.

Just in case - Interesting versions / packages:

PHP 7.2.24

laravel/framework                     v7.2.2     
guzzle/guzzle                         v3.8.1       
symfony/event-dispatcher              v5.0.5            
symfony/event-dispatcher-contracts    v2.0.1     
vdb/php-spider                        v0.2              
vdb/uri                               v0.2              

Cheers,
Peter

is php-spider a abandoned repo?

Hi @mvdbos,

because it's a little bit complicated to get in contact with you i'll try it with a issue. :)

php-spider is still the greatest web spider out there but the latest release uses guzzle 3, which has reached the EOL months ago. Is there any chance of a new release any time soon?

it would be great to hear from you and hopefully you get a chance to reply.

thanks

Cannot install in Laravel 4 project

Trying to install with composer in laravel 4 environment, returns:

$composer require vdb/php-spider:dev-master

./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

Problem 1
- Conclusion: remove symfony/translation 2.3.x-dev
- Installation request for vdb/php-spider dev-master -> satisfiable by vdb/php-spider[dev-master].
- Installation request for symfony/finder == 2.3.9999999.9999999-dev -> satisfiable by symfony/finder[2.3.x-dev], symfony/symfony[2.3.x-dev].
- Conclusion: don't install symfony/translation 2.3.x-dev
- Can only install one of: symfony/symfony[2.4.x-dev, 2.3.x-dev].
- don't install symfony/symfony 2.3.x-dev|remove symfony/class-loader 2.4.x-dev
- don't install symfony/class-loader 2.4.x-dev|don't install symfony/symfony 2.3.x-dev
- Installation request for symfony/translation == 2.3.9999999.9999999-dev -> satisfiable by symfony/symfony[2.3.x-dev], symfony/translation[2.3.x-dev].
- Installation request for symfony/class-loader == 2.4.9999999.9999999-dev -> satisfiable by symfony/class-loader[2.4.x-dev], symfony/symfony[2.4.x-dev].

Installation failed, reverting ./composer.json to its original content.

Naturally, I cannot remove components.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.