Coder Social home page Coder Social logo

php-goose's Introduction

PHP Goose - Article Extractor

Scrutinizer Code Quality

Intro

PHP Goose is a port of Goose originally developed in Java and converted to Scala by GravityLabs. Portions have also been ported from the Python port python-goose. Its mission is to take any news article or article type web page and not only extract what is the main body of the article but also all metadata and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.

Goose will try to extract the following information:

  • Main text of an article
  • Main image of article
  • Any YouTube/Vimeo movies embedded in article
  • Meta Description
  • Meta tags
  • Publish Date

The PHP version was rewritten by:

  • Andrew Scott

Requirement

  • PHP 7.1 or later
  • PSR-4 compatible autoloader

The older 0.x versions with PHP 5.5+ support are still available under releases.

Install

This library is designed to be installed via Composer.

Add the dependency into your projects composer.json.

{
  "require": {
    "scotteh/php-goose": "^1.0"
  }
}

Download the composer.phar

curl -sS https://getcomposer.org/installer | php

Install the library.

php composer.phar install

Autoloading

This library requires an autoloader, if you aren't already using one you can include Composers autoloader.

require('vendor/autoload.php');

Usage

use \Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$videos = $article->getVideos();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

Configuration

All config options are not required and are optional. Default (fallback) values have been used below.

use \Goose\Client as GooseClient;

$goose = new GooseClient([
    // Language - Selects common word dictionary
    //   Supported languages (ISO 639-1):
    //     ar, cs, da, de, en, es, fi, fr, hu, id, it, ja,
    //     ko, nb, nl, no, pl, pt, ru, sv, vi, zh
    'language' => 'en',
    // Minimum image size (bytes)
    'image_min_bytes' => 4500,
    // Maximum image size (bytes)
    'image_max_bytes' => 5242880,
    // Minimum image size (pixels)
    'image_min_width' => 120,
    // Maximum image size (pixels)
    'image_min_height' => 120,
    // Fetch best image
    'image_fetch_best' => true,
    // Fetch all images
    'image_fetch_all' => false,
    // Guzzle configuration - All values are passed directly to Guzzle
    //   See: http://guzzle.readthedocs.io/en/stable/request-options.html
    'browser' => [
        'timeout' => 60,
        'connect_timeout' => 30
    ]
]);

Licensing

PHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details.

php-goose's People

Contributors

anare avatar arraintxo avatar aurelie-vndl avatar cdubz avatar chanafdo avatar chubv avatar crscheid avatar dependabot-preview[bot] avatar elmariachi111 avatar fazers avatar jeroenseegers avatar lucascvs avatar mhugot avatar oliverhermanni avatar peter279k avatar psefranek avatar r-da avatar rbatukaev avatar samwilson avatar shaneiseminger avatar squallstar avatar sters avatar tfevens avatar tommm avatar treeleaf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.