swader / diffbot-php-client Goto Github PK

View Code? Open in Web Editor NEW

53.0 9.0 20.0 361 KB

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

License: MIT License

PHP 100.00%

diffbot php crawling crawl scrape scraping scraper scraped-data machine-learning nlp

diffbot-php-client's Introduction

Deprecated / Being Phased Out - please prefer using API calls directly, they have been simplified for max usability!

Diffbot PHP API Wrapper

This package is a slightly overengineered Diffbot API wrapper. It uses PSR-7 and PHP-HTTP friendly client implementations to make API calls. To learn more about Diffbot see here and their homepage. Right now it only supports Analyze, Product, Image, Discussion, Crawl, Search, and Article APIs, but can also accommodate Custom APIs. Video and Bulk API support coming soon.

Full documentation available here.

Requirements

Minimum PHP 5.6 is required. PHP 7.0 is recommended.

This package uses some non-stable packages, so you must set your project's minimum stability to something like beta or dev in composer.json:

"minimum-stability": "dev",
"prefer-stable": true

If you don't the installation procedure below will fail.

Install

The library depends on an implementation of the client-implementation virtual package. If you don't know what this means, simply requiring the Guzzle6 adapter will do:

composer require php-http/guzzle6-adapter

This adapter satisfies the requirement for client-implementation (see above) and will make it possible to install the client with:

composer require swader/diffbot-php-client

Usage - simple

Simplest possible use case:

$diffbot = new Diffbot('my_token');
$url = 'http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/';
$articleApi = $diffbot->createArticleAPI($url);

echo $articleApi->call()->author; // prints out "Bruno Skvorc"

That's it, this is all you need to get started.

Usage - advanced

Full API reference manual in progress, but the instructions below should do for now - the library was designed with brutal UX simplicity in mind.

Setup

To begin, always create a Diffbot instance. A Diffbot instance will spawn API instances. To get your token, sign up at http://diffbot.com

$diffbot = new Diffbot('my_token');

Pick API

Then, pick an API.

Currently available automatic APIs are:

product (crawls products and their reviews, if available)
article (crawls news posts, blogs, etc, with comments if available)
image (fetches information about images - useful for 500px, Flickr etc). The Image API can return several images - depending on how many are on the page being crawled.
discussion (fetches discussion / review / comment threads - can be embedded in the Product or Article return data, too, if those contain any comments or discussions)
analyze (combines all the above in that it automatically determines the right API for the URL and applies it)

Video is coming soon. See below for instructions on Crawlbot, Search and Bulk API.

There is also a Custom API like this one - unless otherwise configured, they return instances of the Wildcard entity)

All APIs can also be tested on http://diffbot.com

The API you picked can be spawned through the main Diffbot instance:

$api = $diffbot->createArticleAPI($url);

API configuration

All APIs have some optional fields you can pass with parameters. For example, to extract the 'meta' values of the page alongside the normal data, call setMeta:

$api->setMeta(true);

Some APIs have other flags that don't qualify as fields. For example, the Article API can be told to ignore Discussions (aka to not extract comments). This can speed up the fetching, because by default, it does look for them. The configuration methods all have the same format, though, so to accomplish this, just use setDiscussion:

$api->setDiscussion(false);

All config methods are chainable:

$api->setMeta(true)->setDiscussion(false);

For an overview of all the config fields and the values each API returns, see here.

Calling

All API instances have the call method which returns a collection of results. The collection is iterable:

$url = 'http://smittenkitchen.com/blog/2012/01/buckwheat-baby-with-salted-caramel-syrup/';
$imageApi = $diffbot->createImageAPI($url);
/** @var Image $imageEntity */
foreach ($imageApi->call() as $imageEntity) {
    echo 'Image dimensions: ' . $imageEntity->getHeight() . ' x ' . $imageEntity->getWidth() . '<br>';
}

/* Output:
Image dimensions: 333 x 500
Image dimensions: 333 x 500
Image dimensions: 334 x 500
Image dimensions: 333 x 500
Image dimensions: 333 x 500
Image dimensions: 333 x 500
Image dimensions: 333 x 500
Image dimensions: 333 x 500
Image dimensions: 333 x 500
*/

In cases where only one entity is returned, like Article or Product, iterating works all the same, it just iterates through the one single element. The return data is always a collection!

However, for brevity, you can access properties directly on the collection, too.

$articleApi = $diffbot->createArticleAPI('http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/');
echo $articleApi->call()->author;
// or $articleApi->call()->getAuthor();

In this case, the collection applies the property call to the first element which, coincidentally, is also the only element. If you use this approach on the image collection above, the same thing happens - but the call is only applied to the first image entity in the collection.

Just the URL, please

If you just want the final generated URL (for example, to paste into Postman Client or to test in the browser and get pure JSON), use buildUrl:

$url = $articleApi->buildUrl();

You can continue regular API usage afterwards, which makes this very useful for logging, etc.

Pure response

You can extract the pure, full Guzzle Response object from the returned data and then manipulate it as desired (maybe parsing it as JSON and processing it further on your own):

$articleApi = $diffbot->createArticleAPI('http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/');
$guzzleResponse = $articleApi->call()->getResponse();

Individual entities do not have access to the response - to fetch it, always fetch from their parent collection (the object that the call() method returns).

Discussion and Post

The Discussion API returns some data about the discussion and contains another collection of Posts. A Post entity corresponds to a single review / comment / forum post, and is very similar in structure to the Article entity.

You can iterate through the posts as usual:

$url = 'http://community.sitepoint.com/t/php7-resource-recap/174325/';
$discussion = $diffbot->createDiscussionAPI($url)->call();
/** @var Post $post */
foreach($discussion->getPosts() as $post) {
    echo 'Author: '.$post->getAuthor().'<br>';
}

/*
Output:

Author: swader
Author: TaylorRen
Author: s_molinari
Author: s_molinari
Author: swader
Author: s_molinari
Author: swader
Author: s_molinari
Author: swader
Author: s_molinari
Author: TomB
Author: s_molinari
Author: TomB
Author: Wolf_22
Author: swader
Author: swader
Author: s_molinari
*/

An Article or Product entity can contain a Discussion entity. Access it via getDiscussion on an Article or Product entity and use as usual (see above).

Custom API

Used just like all others. There are only two differences:

When creating a Custom API call, you need to pass in the API name
It always returns Wildcard entities which are basically just value objects containing the returned data. They have __call and __get magic methods defined so their properties remain just as accessible as the other Entities', but without autocomplete.

The following is a usage example of my own custom API for author profiles at SitePoint:

$diffbot = new Diffbot('my_token');
$customApi = $diffbot->createCustomAPI('http://sitepoint.com/author/bskvorc', 'authorFolioNew');

$return = $customApi->call();

foreach ($return as $wildcard) {
    dump($wildcard->getAuthor()); // Bruno Skvorc
    dump($wildcard->author); // Bruno Skvorc
}

Of course, you can easily extend the basic Custom API class and make your own, as well as add your own Entities that perfectly correspond to the returned data. This will all be covered in a tutorial in the near future.

Crawlbot and Bulk API

Basic Crawlbot support has been added to the library. To find out more about Crawlbot and what, how and why it does what it does, see here. I also recommend reading the Crawlbot API docs and the Crawlbot support topics just so you can dive right in without being too confused by the code below.

In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default).

List of all crawl / bulk jobs

A joint list of all your crawl / bulk jobs can be fetched via:

$diffbot = new Diffbot('my_token');
$jobs = $diffbot->crawl()->call();

This returns a collection of all crawl and bulk jobs. Each type is represented by its own class: JobCrawl and JobBulk. It's important to note that Jobs only contain the information about the job - not the data. To get the data of a job, use the downloadUrl method to get the URL to the dataset:

$url = $job->downloadUrl("json");

Crawl jobs: Creating a Crawl Job

See inline comments for step by step explanation

// Create new diffbot as usual
$diffbot = new Diffbot('my_token');

// The crawlbot needs to be told which API to use to process crawled pages. This is optional - if omitted, it will be told to use the Analyze API with mode set to auto.
// The "crawl" url is a flag to tell APIs to prepare for consumption with Crawlbot, letting them know they won't be used directly.
$url = 'crawl';
$articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false);

// Make a new crawl job. Optionally, pass in API instance
$crawl = $diffbot->crawl('sitepoint_01', $articleApi);

// Set seeds - seeds are URLs to crawl. By default, passing a subdomain into the crawl will also crawl other subdomains on main domain, including www.
$crawl->setSeeds(['http://sitepoint.com']);

// Call as usual - an EntityIterator collection of results is returned. In the case of a job's creation, only one job entity will always be returned.
$job = $crawl->call();

// See JobCrawl class to find out which getters are available 
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result

Crawl jobs: Inspecting an existing Crawl Job

To get data about a job (this will be the data it was configured with - its flags - and not the results!), use the exact same approach as if creating a new one, only without the API and seeds:

$diffbot = new Diffbot('my_token');

$crawl = $diffbot->crawl('sitepoint_01');

$job = $crawl->call();

dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result

Crawl jobs: Modifying an existing Crawl Job

While there is no way to alter a crawl job's configuration post creation, you can still do some operations on it.

Provided you fetched a $crawl instance as in the above section on inspecting, you can do the following:

// Force start of a new crawl round manually
$crawl->roundStart();

// Pause or unpause (0) a job
$crawl->pause();
$crawl->pause(0)

// Restart removes all crawled data but keeps the job (and settings)
$crawl->restart();

// Delete a job and all related data
$crawl->delete();

Note that it is not necessary to issue a call() after these methods.

If you would like to extract the generated API call URL for these instant-call actions, pass in the parameter false, like so:

$crawl->delete(false);

You can then save the URL for your convenience and call call when ready to execute (if at all).

$url = $crawl->buildUrl();
$url->call();

Search API

The Search API is used to quickly search across data obtained through Bulk or Crawl API.

$diffbot = new Diffbot('my_token');
$search = $diffbot->search('author:"Miles Johnson" AND type:article')->call();


foreach ($search as $article) {
    echo $article->getTitle();
}

Use Search APIs setCol method to target a specific collection only - otherwise, all your token's collections are searched.

Testing

Just run PHPUnit in the root folder of the cloned project. Some calls do require an internet connection (see tests/Factory/EntityTest).

phpunit

Adding Entity tests

I'll pay $10 for every new set of 5 Entity tests, submissions verified set per set - offer valid until I feel like there's enough use cases covered. (a.k.a. don't submit 1500 of them at once, I can't pay that in one go).

If you would like to contribute by adding Entity tests, I suggest following this procedure:

Pick an API you would like to contribute a test for. E.g., Product API.

In a scratchpad like index.php, build the URL:

$diffbot = new Diffbot('my_token');
$url = $diffbot
    ->createProductAPI('http://someurl.com')
    ->setMeta(true)
    ->...(insert other config methods here as desired)...
    ->buildUrl();
echo $url;

Grab the URL and paste it into a REST client like Postman or into your browser. You'll get Diffbot's response back. Keep it open for reference.
Download this response into a JSON file. Preferably into tests/Mocks/Products/[date]/somefilename.json, like the other tests are. This is easily accomplished by executing curl "[url] > somefilename.json" in the Terminal/Command Line.
Go into the appropriate tests folder. In this case, tests/Entity and open ProductTest.php. Notice how the file is added into the batch of files to be tested against. Every provider has it referenced, along with the value the method being tested should produce. Slowly go through every test method and add your file. Use the values in the JSON you got in step 3 to get the values.
Run phpunit tests/Entity/ProductTest.php to test just this file (much faster than entire suite). If OK, send PR :)

If you'd like to create your own Test classes, too, that's fine, no need to extend the ones that are included with the project. Apply the whole process just rather than extending the existing ProductTest class make a new one.

Adding other tests

Other tests don't have specific instructions, contribute as you see fit. Just try to minimize actual remote calls - we're not testing the API itself (a.k.a. Diffbot), we're testing this library. If the library parses values accurately from an inaccurate API response because, for example, Diffbot is currently bugged, that's fine - the library works!

Contributing

Please see CONTRIBUTING for details and TODO for ideas.

Credits

Bruno Skvorc

License

The MIT License (MIT). Please see License File for more information.

diffbot-php-client's People

Contributors

Stargazers

Watchers

diffbot-php-client's Issues

PHP Parse error: syntax error, unexpected '.' in vendor/swader/diffbot-php-client/src/Entity/EntityIterator.php on line 87

$rv = $zeroth->$name(...$args);

Bad dates in articles cause exceptions when using \Carbon

When using the getDate() method from an Article, if the field has an unparsable date and Carbon is being used, it throws an exception:

PHP Fatal error:  Uncaught Exception: DateTime::__construct(): Failed to parse time string (Robin Murray / / 28 · 03 · 2017
1508454066 0) at position 0 (R): The timezone could not be found in the database in lib/vendor/nesbot/carbon/src/Carbon/Carbon.php:291
1508454066 Stack trace:
1508454066 #0 lib/vendor/nesbot/carbon/src/Carbon/Carbon.php(291): DateTime->__construct('Robin Murray / ...', Object(DateTimeZone))
1508454066 #1 lib/vendor/swader/diffbot-php-client/src/Entity/Article.php(68): Carbon\Carbon->__construct('Robin Murray / ...', 'GMT')
1508454066 #2 /path/to/my/code.php(95): Swader\Diffbot\Entity\Article->getDate()

I have patched our code to catch the exception but feel this should be handled more gracefully within Article.php

An example URL where this came from is http://www.clashmusic.com/features/sorcerer-the-perpetual-transition-of-jordan-rakei where you can see the byline is ROBIN MURRAY FEATURES 20 · 10 · 2017 (perhaps an opportunity to improve the article processing algo too?)

Thank you!

Consider alternative solution to "crawl" setting in API instance for Crawlbot

Currently, to prepare an API for Crawlbot consumption and to satisfy the required name param condition in the constructor, one has to pass in the string 'crawl'. This feels clumsy. Make URL optional and implicitly considered crawlbot-friendly?

Question: What's the 'correct' way to access all results of a CrawlJob?

From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.

Is there a better way of doing this?

Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?

Thanks!

PHP Notices generated on Article\getDate() if no date has been identified

When using getDate() on an Article entity, if no date has been identified, the following error is generated:
PHP Notice: Undefined index: date in /path/to/lib/vendor/swader/diffbot-php-client/src/Entity/Article.php on line 65

This is:

    public function getDate()
    {
        return (class_exists('\Carbon\Carbon')) ?
            new \Carbon\Carbon($this->data['date'], 'GMT') :
            $this->data['date'];
    }

Refactoring to this resolves it:

    public function getDate()
    {
        $date = isset($this->data['date']) ? $this->data['date'] : null;
        return (class_exists('\Carbon\Carbon')) ?
            new \Carbon\Carbon($date, 'GMT') : $date;
    }

Can you apply if it's in the style you want, else do similar?

Thanks,

Guzzle 6 via Httplug

Guzzle 6 is out, but knowing its typical release cycle, 7 will be with us before long.

Plan: replace concrete implementation with Httplug (http://php-http.readthedocs.org/en/latest/)

Diffbot client not closing TCP connections until client exits

When making a call to create or update a job, the client is not closing TCP connections after the request to the api endpoint.

Running the following code demonstrates the issue:

    $diffbot = new Diffbot("xxxxxxxxx");
    $job = $diffbot->crawl("jonathan_test");
    $job->setSeeds(["http://www.example.com"])->setMaxToCrawl(100)
        ->setMaxToProcess(100)->setMaxRounds(1)
        ->setOnlyProcessIfNew(1)->setMaxHops(3);

    $api
        = $diffbot->createArticleAPI('crawl')->setMeta(true)->setDiscussion(false)
        ->setQuerystring(true)
    ;

    $job->setApi($api);
    $x = $job->call();
sleep(100);

Now the socket remains open until the process quits:

$ netstat -an|  grep 443                                                                                                       
tcp        0      0 192.168.22.214:50844    35.192.184.37:443       TIME_WAIT

While not too bad for a single socket, if you create a lot of diffbot objects using new Diffbot(), you can quickly run out of open files on the system as the sockets aren't closed even when the object falls out of scope.

Add Streaming to Crawlbot

A crawlbot's job constantly enhances the resultset with new data. Ergo, when fetching the resultset, it's always different unless the job is done or paused.

Since the library depends on Guzzle, perhaps consider using streams?

To keep in mind: stopping the stream if job is paused. Would require re-checking job status? Open for discussion.

Custom HTTP Headers

Diffbot allows Custom HTTP Headers like X-Forward-Cookie.

How can I set these using this system?

My current code is just:
$api = $diffbot->createProductAPI($_POST["siteURL"]);

Issues on Returning a Product where `text` Does Not Exist

As an example scanning this url: https://www.bhphotovideo.com/c/product/1365551-REG/synology_ds418play_diskstation_4_bay.html with the Product API does not return the text field. Resulting in errors in attempting to process the response.

{
  "request": {
    "options": [
      "_=1513203180811",
      "callback=jQuery111107118896645815873_1513203180810",
      "format=jsonp"
    ],
    "pageUrl": "https://www.bhphotovideo.com/c/product/1365551-REG/synology_ds418play_diskstation_4_bay.html",
    "api": "product",
    "version": 3
  },
  "objects": [
    {
      "images": [
        {
          "xpath": "/html[1]/body[1]/div[1]/div[2]/div[2]/div[1]/div[3]/div[1]/div[1]/a[1]/img[1]",
          "naturalHeight": 500,
          "width": 392,
          "diffbotUri": "image|3|483071054",
          "title": "Synology DS418play Diskstation 4 Bay Nas",
          "url": "https://static.bhphoto.com/images/images500x500/synology_ds418play_diskstation_4_bay_1507130659000_1365551.jpg",
          "naturalWidth": 500,
          "primary": true,
          "height": 392
        }
      ],
      "offerPrice": "$429.99",
      "productId": "DS418PLAY",
      "diffbotUri": "product|3|-1255355555",
      "mpn": "DS418PLAY",
      "multipleProducts": true,
      "availability": true,
      "type": "product",
      "title": "Synology DS418play Diskstation 4 Bay Nas",
      "offerPriceDetails": {
        "symbol": "$",
        "amount": 429.99,
        "text": "$429.99"
      },
      "breadcrumb": [
        {
          "link": "https://www.bhphotovideo.com",
          "name": "Home"
        },
        {
          "link": "https://www.bhphotovideo.com/c/browse/Computers/ci/9581/N/4294542559",
          "name": "Computers"
        },
        {
          "link": "https://www.bhphotovideo.com/c/browse/Drives-Storage/ci/13216/N/4294542392",
          "name": "Drives & Storage"
        },
        {
          "link": "https://www.bhphotovideo.com/c/browse/Network-Attached-Storage-NAS-/ci/26927/N/3832759815",
          "name": "Network Attached Storage (NAS)"
        },
        {
          "link": "https://www.bhphotovideo.com/c/buy/NAS-Enclosures/ci/26903/N/3832759809",
          "name": "NAS Enclosures"
        },
        {
          "link": "https://www.bhphotovideo.com/c/product/1365551-REG/synology_ds418play_diskstation_4_bay.html",
          "name": "Synology DiskStation DS418play"
        },
        {
          "link": "https://www.bhphotovideo.com/c/product/1365551-REG/tzftqtxwebbdfaedfe.html",
          "name": "wyfeweadbyrxbefatwsewxtwwsdybwsyef"
        }
      ],
      "humanLanguage": "en",
      "pageUrl": "https://www.bhphotovideo.com/c/product/1365551-REG/synology_ds418play_diskstation_4_bay.html",
      "category": "Computers",
      "sku": "SYDS418PLAY",
      "brand": "Synology"
    }
  ],
  "url": "https://www.bhphotovideo.com/c/product/1365551-REG/synology_ds418play_diskstation_4_bay.html"
}

[Critical] Setting custom entity factories doesn't work

Due to a lapse in logic, the setCustomEntity method on the Diffbot class is always called without an argument if the setHttpClient isn't called beforehand.

Change

        if (!$this->getHttpClient()) {
            $this->setHttpClient();
            $this->setEntityFactory();
        }

        if (!$this->getHttpClient()) {
            $this->setHttpClient();
        }
        if (!$this->getEntityFactory()) {
            $this->setEntityFactory();
        }

everywhere for a fix.

The Bulk API now enforces minimum 50 links

This should be taken into account in the client.

Callback setter not there

It doesn't seem like we have a callback setter in the various API classes. Is it needed in this case?

Todo: investigate.

Loose comparison of string to bool

see https://scrutinizer-ci.com/g/Swader/diffbot-php-client/issues/master/files/src/Api/Crawl.php?selectedSeverities%5B0%5D=10&orderField=path&order=asc#inspectioncomment-834565583

Pecl jsonc bug causes notices

When using PHP with the "drop in replacement" jsonc from pecl, conversion of numbers over 64bits to strings fails because of this and this.

Details on situation here.

Looking for solutions on how to approach this, because Search API often returns unsigned 64bit ints (hashes), meaning they go over the BIGINT size for signed and thus cause an overflow.

Edit: additional clarification on this bug here.

Bizarre issue with Diffbot using guzzlehttp

I've created a Crawl API job which has a few hundred results. I'm trying to get the results using type:article (so $bot->search("type:article") with setNum to "all") and it's throwing an exception:

PHP Warning:  curl_multi_exec(): Unable to create temporary file, Check permissions in temporary files directory. in /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php on line 106

Warning: curl_multi_exec(): Unable to create temporary file, Check permissions in temporary files directory. in /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php on line 106
PHP Fatal error:  Uncaught GuzzleHttp\Exception\RequestException: cURL error 23: Failed writing body (2749 != 16384) (see http://curl.haxx.se/libcurl/c/libcurl-errors.html) in /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:187
Stack trace:
#0 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(150): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array)
#1 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(103): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#2 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php(179): GuzzleHttp\Handler\CurlFactory::finish(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#3 /home/tullettj/websites/c in /home/tullettj/websites/core-code/lib/vendor/php-http/guzzle6-adapter/src/Promise.php on line 127

Fatal error: Uncaught GuzzleHttp\Exception\RequestException: cURL error 23: Failed writing body (2749 != 16384) (see http://curl.haxx.se/libcurl/c/libcurl-errors.html) in /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:187
Stack trace:
#0 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(150): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array)
#1 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(103): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#2 /home/tullettj/websites/core-code/lib/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php(179): GuzzleHttp\Handler\CurlFactory::finish(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#3 /home/tullettj/websites/c in /home/tullettj/websites/core-code/lib/vendor/php-http/guzzle6-adapter/src/Promise.php on line 127

So I've played with the setNum values and 60 seems to be the magic number. If I query for 60 or less, it's fine, however if I go for 61 or above, it throws this exception.

Have you seen this before, @Swader? It's a bit of a head scratcher (I have ~2Gb free in the temporary files directory)

Thanks!

Questions about the right way to get subarray like images/tags

Hi, I`m using the php library. Normally I use the field array as:

 $fields = array("author", "title", "siteName", "date", "icon", "images","html","tags"); // fields to be returned
 $json_diffr = $d->article($myurl, $fields);

But the returned json didn`t contain the scores or url as the website shown.
I tried to use like this

$tag=array("scores","labels","url");
$fields=array("author","title",$tag);

but the json still not returning the scores for tags.

EntityIterator bugged on Unset

When unsetting some elements in the EntityIterator, the next iteration over it will be bugged in that it will only go through some of the elements. Example code for reproduction (dump is from Symfony VarDumper, use var_dump if you don't have it installed):

$a = range(1, 30);
$r = new Response(200);
$ei = new EntityIterator($a, $r);

dump('First run');
foreach ($ei as $i) {
    dump($i);
}

dump('Second run');
foreach ($ei as $i) {
    dump($i);
}

dump($ei);

foreach ($ei as $index => $element) {
    if ($element % 3 === 0) {
        dump("Unsetting " . $element . ' at index ' . $index);
        unset($ei[$index]);
    }
}

dump($ei);

dump('Third run');
foreach ($ei as $i) {
    dump($i);
}

The first and second loop display all 30 elements. The third loop unsets multiples of 3. The fourth loop then only loops as far as number 23, but no further.

setTimeout not being used

Looks like $api-> setTimeout($timeout) is never sent to the http client. Any info in this?

Implement Webhook for Crawlbot / Bulk API completion ping

A Crawlbot / Bulk API job has the ability to ping an endpoint with details once complete. (see NotifyWebhook here)

Think about implementing this somehow.

Not a priority - hooks can be custom implemented easily and there's also Zapier integration.

Image API field changes

Image API fields seem to have been changed without notice.

The fields height and width are now no longer returned by default, and are optional fields that require setters. They have been renamed to displayHeight and displayWidth.

Todo: implement setters and getters, remove old ones (or keep old ones for BC? Not many users, so maybe remove?)

Implement Video API support

Currently, video API can be consumed via Custom API. A true implementation should happen sooner or later. The API is currently still in beta, so this is not a priority.

Tags sometimes missing

Sometimes the article API doesn't return the tags field (bug in Diffbot). The getTags method should thus accommodate for this by returning null.

Number of results to return from search API

Hi.
I'm trying to use the search api to get all results from a crawl robot. In diffbots documentation I see that it's possible to use the num option (num=all) but how do I do that using the diffbot-php-client? I tried:
$search = $diffbot->search('type:product AND num:all')->setCol($crawlName)->call();
and
$search = $diffbot->search('type:product')->setCol($crawlName)->num('all')->call();
but no luck.

Date enhancements

The getDate and getEstimatedDate methods could use an upgrade.

The plan is to implement a DateObject with a __toString method to keep backwards compatibility, but to also add in some helper methods (perhaps via Carbon) that turn the Diffbot-returned string into something more useful in the current context. For example:

echo $article->getDate(); // "Wed, 18 Dec 2013 00:00:00 GMT" - as usual!
echo $article->getDate()->year; // 2013

Carbon's setToStringFormat will need to be used to set the format to Diffbot's default, to maintain BC, but other than that, Carbon can be plugged into the entity directly.

Wondering if we should make Carbon a recommendation only, and then crash the SDK if it's not installed an a Carbon method is used, or just force people to use Carbon...?

Broken EntityIterator on zero results

This loop gets stuck forever when there's 0 results.

Related: Issue #14

EntityIterator - no support for missing keys

When manipulating returned data to, for example, unset some of the entities (due to Search API returning duplicates, for example), the EntityIterator uses next() to advance the cursor by one, but that cursor may not exist due to prior manipulation of the EntityIterator's data. So iterating directly on this changed EI will cause problems.

Solutions:

add skipping of non existent cursors or
extract data with getData and iterate over that

Generate API Docs

The classes are rather well documented, so generating an API documentation would be beneficial to understanding the various methods available.

Consider:

phpDocumentor
apiGen
phpDox
Sami
ReadTheDocs?
Readme.io?

Weigh pros/cons, decide, implement.

Problem installing guzzle6-adapter

Dear diffbot community:

I'm trying to install diffbot using composer,
1- Downloaded the latest version (2.0.1)
2- Rechecked the lines
"prefer-stable": true,
"minimum-stability": "dev"
in the composer.json

3- Executed the following
composer require php-http/guzzle6-adapter

Every thing seems to be ok until I get the following at the end of the installation
Package guzzle/guzzle is abandoned, you should avoid using it. Use guzzlehttp/guzzle instead.
Warning: Version check failed: Could not determine Puli version. "puli -V" returned:

(Tried minimum-stability as dev, beta, stable >>> not resolved)

So that I can't install diffbot-php-client

Fatal error: Puli Factory class does not exist

I'm able to install this client by composer; however, when I attempt to call the CrawlBot API I get an error. Here is my code:

require '/vendor/autoload.php';

use Swader\Diffbot\Diffbot;

$diffbot = new Diffbot('my_token');
$crawl = $diffbot->crawl('SomeSite');
$job = $crawl->call();

var_dump($job->getDownloadUrl("json"));

Here is the error that is returned:

PHP Fatal error:  Uncaught RuntimeException: Puli Factory class does not exist in /app/vendor/php-http/discovery/src/ClassDiscovery.php:38
Stack trace:
#0 /app/vendor/php-http/discovery/src/ClassDiscovery.php(79): Http\Discovery\ClassDiscovery::getPuliFactory()
#1 /app/vendor/php-http/discovery/src/ClassDiscovery.php(99): Http\Discovery\ClassDiscovery::getPuliDiscovery()
#2 /app/vendor/php-http/discovery/src/HttpClientDiscovery.php(21): Http\Discovery\ClassDiscovery::findOneByType('Http\\Client\\Htt...')
#3 /app/vendor/swader/diffbot-php-client/src/Diffbot.php(102): Http\Discovery\HttpClientDiscovery::find()
#4 /app/vendor/swader/diffbot-php-client/src/Diffbot.php(268): Swader\Diffbot\Diffbot->setHttpClient()
#5 /private/var/www/sov/shell/get_diffbot_data.php(17): Swader\Diffbot\Diffbot->crawl('SomeSiteName')
#6 {main}
  thrown in /app/vendor/php-http/discovery/src/ClassDiscovery.php on line 38

My composer.json contains:

{
    "minimum-stability": "dev",
    "prefer-stable": true,
    "require": {
        "php-http/guzzle6-adapter": "~1.0",
        "php-http/message": "^1.2",
        "php-http/discovery": "~0.8.0",
        "puli/cli": "~1.0",
        "php-http/plugins": "~1.0",
        "puli/composer-plugin":"^1.0",
        "swader/diffbot-php-client": "^2.1@dev"
    }
}

I'm wondering if there is an issue upstream somewhere. Ideas?

Unable to set custom queryString parameters

Hello,

I am looking for a way to set the Diffbot Querystring parameter to include additional fields. I see a number of setX methods in the API (setMeta, setQuerystring) but they only take booleans for a fixed field to be added.

As an example, I'm wanting to include fields like diffbotUri,pageUrl,title (and quite a few more - so I'm not sure that creating setX methods for each of them is appropriate).

Have I missed something within the API?

For reference, http://support.diffbot.com/crawlbot/all-about-the-querystring-parameter/

Thanks very much.

No more searching across all collections

Searching across all collections has been deprecated - the col parameter is now obligatory with the Search API, and should be reflected as such in the code.

[Article API] Author URL

In the Article Entity, looks like we forgot about the authorUrl getter.

Norender

Implement the ability to set norender in order to speed up processing time.

http://support.diffbot.com/automatic-apis/improving-api-performance-and-response-times/

Implement Bulk Processing Support

Bulk API support is missing right now. Should be implemented as prerequisite for Search API.

metaTags insight

Look into metaTags.

They seem to be a default field with Article API, as long as there are entries such as <meta property="article:tag" content="Artificial Intelligence" />. These then get translated into an array of objects like:

{
          "name": "Artificial Intelligence"
        }

If this is default now, add a getter into all entities (into the trait?).

Tags as objects?

Diffbot's derived tags are interesting enough to potentially warrant their own class and accessors for common use.

Todo: implement the TagCollection and Tag Entity, assume there will eventually be something like a Tags API. Keep tags array accessible as backwards compatibility measure.

Implement Crawlbot

Crawlbot support is missing right now, should be added as prerequisite for Search API

Custom Pagination

The pagination side of Diffbot is buggy at best. It will often fail to recognize articles that are multi-page and will not merge them. What's more, it tops out at 20 pages, so anything longer will get ignored.

The feature suggestion for the client is as follows:

Add a new method to the Article API: paginateBy. This method takes 2 arguments: $identifier and $maxPages. The former is a way to identify the nextPage link element on the page. This element would auto-processed to find out all the next pages programmatically. The latter is the max number of pages to concat.

This method would, in order:

Make an Article API request to the original URL.
Find the nextPage element and process it to find out the pattern to which to attach incrementing numbers, thus generating next pages.
Make an additional Article API request to each page, up to $maxPages number of pages
Concatenate the HTML content of all pages.
Send the merged HTML content as a POST request to the Article API, for a final analysis of the entire post.

Alternatively, in order to save Article API requests and use up only one, the client could just Guzzle the raw HTML of all the articles, extract the content HTML, merge that and send it as POST. This, however, is less reliable, as Diffbot is much better at figuring out what is content on the page, and what isn't (headers, ads, comments, etc.).

Maybe make it a switch of some kind, and additional setter?

composer require swader/diffbot-php-client fails even after composer require php-http/guzzle6-adapter

Running composer require php-http/guzzle6-adapter works. Adds an entry "php-http/guzzle6-adapter": "^0.1.0", into composer.json

Running composer require swader/diffbot-php-client fails with message :

Using version ^1.2 for swader/diffbot-php-client
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Installation request for swader/diffbot-php-client ^1.2 -> satisfiable by swader/diffbot-php-client[1.2].
    - swader/diffbot-php-client 1.2 requires php-http/client-implementation ^1.0 -> no matching package found.

Potential causes:
 - A typo in the package name
 - The package is not available in a stable-enough version according to your minimum-stability setting
   see <https://groups.google.com/d/topic/composer-dev/_g3ASeIFlrc/discussion> for more details.

Read <https://getcomposer.org/doc/articles/troubleshooting.md> for further common problems.

Installation failed, reverting ./composer.json to its original content.

PHP Fatal error

I have installed diffbot using composer and it created the autoload files as normal.

My code is simply:

require_once "../../vendor/autoload.php";
$diffbot = new Diffbot('KEY');

BUT I get the error:
PHP Fatal error: Class 'Diffbot' not found

I am running PHP Version 5.6.12

Accessing getters on EntityIterator directly sometimes buggy

When accessing methods on the EntityIterator directly, one should always get the same results as if accessed directly on the first subelement of the Iterator. However, this was sometimes buggy - e.g. when properties were dynamically generated via certain methods. In such cases, methods should be tried first.

Offending code is __call and __get magic metods inside EntityIterator.

POSTing content

Look into adding POSTing functionality into the mix.

New Article API fields

New Article API fields seem to have cropped up:

siteName
publisherRegion
publisherCountry
estimatedDate

Todo: add into Article Entity as getters.

Custom API managing

Custom API can now be made programmatically, and includes some nifty features, like evaling some custom JS at render time (useful for clicking ad links in interstitials etc).

https://www.diffbot.com/dev/docs/custom/managing/

Should be added.

$ composer require swader/diffbot-php-client
Using version ^1.2 for swader/diffbot-php-client
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - swader/diffbot-php-client 1.2.1 requires php-http/utils dev-master -> satisfiable by php-http/utils[dev-master].
    - php-http/utils dev-master requires php-http/httplug 1.0.0-beta -> satisfiable by php-http/httplug[v1.0.0-beta].
    - Conclusion: don't install php-http/httplug v1.0.0-beta
    - Can only install one of: php-http/httplug[v1.0.0-alpha3, v1.0.0].
    - Can only install one of: php-http/httplug[v1.0.0-alpha3, v1.0.0].
    - Can only install one of: php-http/httplug[v1.0.0-alpha3, v1.0.0].
    - php-http/utils v0.1.0 requires php-http/httplug 1.0.0-alpha3 -> satisfiable by php-http/httplug[v1.0.0-alpha3].
    - swader/diffbot-php-client 1.2 requires php-http/utils ^0.1.0@dev -> satisfiable by php-http/utils[v0.1.0].
    - Installation request for swader/diffbot-php-client ^1.2 -> satisfiable by swader/diffbot-php-client[1.2, 1.2.1].
    - Installation request for php-http/httplug == 1.0.0.0 -> satisfiable by php-http/httplug[v1.0.0].

What's the recommended course of action here? My 'minimum-requirements' are 'dev'.