Coder Social home page Coder Social logo

jkphl / micrometa Goto Github PK

View Code? Open in Web Editor NEW
115.0 9.0 39.0 787 KB

A meta parser for extracting micro information out of web documents, currently supporting Microformats 1+2, HTML Microdata, RDFa Lite 1.1, JSON-LD and Link Types, written in PHP

Home Page: http://micrometa.jkphl.is

License: MIT License

PHP 85.82% HTML 14.18%

micrometa's Introduction

jkphl/micrometa

Build Status Coverage Status Scrutinizer Code Quality Code Climate Documentation Status Clear architecture

A meta parser for extracting micro information out of web documents, currently supporting Microformats 1+2, HTML Microdata, RDFa Lite 1.1, JSON-LD and Link Types

Documentation

Please find the project documentation in the doc directory. We recommend reading it via Read the Docs.

Installation

This library requires PHP >=5.6 or later. I recommend using the latest available version of PHP as a matter of principle. It has no userland dependencies. It's installable and autoloadable via Composer as jkphl/micrometa.

composer require jkphl/micrometa

Alternatively, download a release or clone this repository, then require or include its autoload.php file.

Dependencies

Composer dependency graph

Quality

To run the unit tests at the command line, issue composer install and then phpunit at the package root. This requires Composer to be available as composer, and PHPUnit to be available as phpunit.

This library attempts to comply with PSR-1, PSR-2, and PSR-4. If you notice compliance oversights, please send a patch via pull request.

Contributing

Found a bug or have a feature request? Please have a look at the known issues first and open a new issue if necessary. Please see contributing and conduct for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

License

Copyright © 2017 Joschi Kuphal / [email protected]. Licensed under the terms of the MIT license.

micrometa's People

Contributors

b4rtaz avatar blankse avatar chtipepere avatar jkphl avatar jspaetzel avatar lyrixx avatar madeitbelgium avatar rbairwell avatar rvanlaak avatar sarke avatar tomgillett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

micrometa's Issues

JSON LD with `sameas` throws `InvalidArgumentException`

I'm honestly not sure if this is an issue with the underlying JSON LD Parser or not. But an exception is thrown for metadata that includes the sameAs property. Testing with some example code:

{
"@context": "http://schema.org",
"@type": "Website",
"name": "Example Website",
"url": "https://example.com",
"sameAs": [
  "https://facebook.com/Example",
  "https://twitter.com/Example"
]
}

Which appears to be valid:

https://www.dropbox.com/s/sfcs4ifc9lj8izr/Screenshot%202017-06-19%2010.34.02.png?dl=0

Parsing this snippet results in an InvalidArgumentException with the message Empty type list is not allowed. After digging a little bit this appears to be due to the existence of the sameAs property.

Unknown JSON-LD item

Hi

I'm looking to build a script that sees what data it can glean from any given url, microdata first, then content. Your parser seems perfect for that, but I've noticed a case where an error is thrown in certain situations.

I'm giving the following url:
http://www.currys.co.uk/gbuk/computing/laptops/laptops/lenovo-yoga-510-14-2-in-1-black-10146249-pdt.html

And I'm getting the following warning:

Warning: get_class() expects parameter 1 to be object, array given in C:\Users\danm\Documents\Websites\page-scraper-analyser\vendor\jkphl\micrometa\src\Jkphl\Micrometa\Parser\JsonLD.php on line 217
Unknown JSON-LD item: {"items":[{"id":"_:b0","types":["http:\/\/schema.org\/BreadcrumbList"]

Is it finding microdata but attempting to parse it as JSON-LD?

I've also noticed cases where no data is obtained though microdata is used on the page, is this indicative of poor configuration their end?

Thanks in advance

EDIT

Here's a list of urls with data that either isn't being returned, or is buggy:

I appreciate that some of these may be down to the implementation of the microdata on the pages themselves.

Parse ld+json wrong

Hi i have like this json

  {
    "@context": "https://schema.org/",
    "@type": "Product",
    "name": "Extra",
    "image": "https://www..jpg",
    "category": [
      "category",
    ],
    "description": "This Stun",
    "SKU": "11111",
    "Offers": {
      "@type": "Offer",
      "priceCurrency": "GBP",
      "price": "509.99",
      "itemcondition": "http://schema.org/NewCondition",
      "availability": "https://schema.org/PreOrder",
      "url": "https://www."
    },
  }

I try to get offers but immutableName wrong and i get error OutOfBoundsException|
image

if i change manually Offers to offers all works
image

change json i cant it external and use like this i think bad idea str_replace("Offers", "offers", "json") because json on HTML page

What else can I do?

Endless parsing

Hi!

The JSON-LD micrometa 2 parser can't finish if the url of @id and http://schema.org/sameAs property is the same.
I used the following script for testing:

<?php
require_once  'vendor/autoload.php';
use Jkphl\Micrometa\Ports\Parser;

$htmlSource_jsonld = '<!DOCTYPE html>
<html>
    <head>
        <script type="application/ld+json">
        {
          "@context": "http://schema.org",
          "@type": "organization",
          "@id": "http://www.website.de",
          "sameAs": "http://www.website.de"
        }
        </script>
    </head>
    <body></body>
</html>';

$time_start = microtime(true);

$objMicrometa = new Parser();
$result = $objMicrometa("http://www.website.de", $htmlSource_jsonld);

$time_end = microtime(true);
$time = $time_end - $time_start;
echo "\n $time Seconds Runtime";

JSON-LD parser does only find the first item

Am 20.03.2017 um 13:59 schrieb Claas Kalwa:

Hallo Joschi,

ich habe Probleme beim Extrahieren mehrerer JSON-LD Items mit dem
Micrometa V1 Parser. Er erkennt lediglich das erste Item, egal ob die
Items mit @graph gruppiert sind oder seperat in eigenen script-Elementen
vorkommen.

Im Anhang habe ich ein Beispiel, das eigentlich funktionieren sollte,
denke ich.

Hast Du eine Idee, wo das Problem liegen könnte?

Example source:

<!DOCTYPE html>

<html>
    <head>
        <title>TODO supply a title</title>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">

	<script type="application/ld+json">
	{
	 "@context": "http://schema.org",
	 "@graph": [
	{
	  "name": "Google Inc.",
	  "@type": "LocalBusiness",
	  "address": {
	    "@type": "PostalAddress",
	    "addressCountry": "United States",
	    "streetAddress": "1600 Amphitheatre Parkway",
	    "addressLocality": "Mountain View",
	    "addressRegion": "CA",
	    "postOfficeBoxNumber": null,
	    "postalCode": "94043",
	    "telephone": "+1 650-253-0000",
	    "faxNumber": "+1 650-253-0001"
	  }
	},
	{
	  "name": "Google Ann Arbor",
	  "@type": "LocalBusiness",
	  "address": {
	    "@type": "PostalAddress",
	    "addressCountry": "United States",
	    "streetAddress": "201 S. Division St. Suite 500",
	    "addressLocality": "Ann Arbor",
	    "addressRegion": "MI",
	    "postOfficeBoxNumber": null,
	    "postalCode": "48104",
	    "telephone": "+1 734-332-6500",
	    "faxNumber": "+1 734-332-6501"
	  }
	}
	 ]
	}
	</script>

    </head>
    <body>
        <div>TODO write content</div>
        
    </body>
</html>

Recursive loop when ID is the same as URL

Example code:

<?php
include("vendor/autoload.php");
$jsonld = <<<EOF
<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@type": "Website",
  "@id": "https://www.example.com/",
  "url": "https://www.example.com/"
}
</script>
EOF;

$parser = new \Jkphl\Micrometa\Ports\Parser();
$parser("https://www.example.com/", $jsonld);

If url is changed to omit the final backslash, it works fine. I'm assuming this is related to issue #27. I'm attempting to put together a PR to fix it but I'm ending up a little lost, any advice you can give as to what the most likely cause would be appreciated.

Error in ItemList->getFirstItem()

Method getFirstItem() uses results of method getItems()

$items = $this->getItems(...$types);

and if all ok, then get element with index 0

return $items[0];

Here is the link to this line in code

But method getItems() uses function array_filter(), which preserves the array's keys. So, for example, if we have

ItemList $items [
    0 => type 'Breadcrumb'
    1 => type 'Breadcrumb'
    2 => type 'Product'
]

then after $items->getItems('Product') it will become

ItemList $items [
   2 => type 'Product'
]

and $items[0] will return null.

I suggest to use

public function getFirstItem(...$types)
    {
        $items = array_values($this->getItems(...$types));

or

public function getItems(...$types)
    {
        ...

        return array_values($this->items);

Class 'Guzzle\Http\Url' not found

Hi

When trying that library, i got this message:

 [message] => Class 'Guzzle\Http\Url' not found
 [file] => vendor/jkphl/dom-factory/src/Domfactory/Infrastructure/Dom.php

The thing is I have upgraded Guzzle to 6.2.3 so it should work. It looks like Guzzle has not that class anymore.

Any idea?

Documentation errata

  • Add a hint to the live demo site
  • Promote the headlines > level 2 to show up on Read The Docs

Is there anyway to set proxy?

My server IP are blocked by some sites,

curl_setopt($ch, CURLOPT_PROXY, 'http://proxy-server.tld:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'mypass');

Is there anyway to set this option? #

Option to allow HTML in values

There are some cases when allowing HTML in values is expected.

Example input:

<div itemprop="description">line1<br>line2<br>line3<p>line4</p></div>

Output:

line1line2line3line4

Expected output:

line1<br>line2<br>line3<p>line4</p>

Example from documentation is not working

Hey guys, I tried to follow documentation and I failed to get a working example without having to take a look at the demo file.

So, here is how it is into documentation:

use Jkphl\Micrometa\Ports\Parser;
$micrometa = new Parser();
$items = $micrometa('http://example.com');

This will fail, since Parser class is nowhere to find.

Unable to get "name" from ListItem -> item

Hi, I can't get the name property from ListItem -> item

Please tell me what can I do.

$items = $micrometa($url, '<script type="application/ld+json">
{
 "@context": "http://schema.org",
 "@type": "BreadcrumbList",
 "itemListElement":
 [
  {
   "@type": "ListItem",
   "position": 1,
   "item":
   {
    "@id": "https://example.com/dresses",
    "name": "Dresses"
    }
  },
  {
   "@type": "ListItem",
  "position": 2,
  "item":
   {
     "@id": "https://example.com/dresses/real",
     "name": "Real Dresses"
   }
  }
 ]
}
</script>');
var_dump($items); exit;
  ["dom":protected]=>
  object(DOMDocument)#1998 (35) {
    ["doctype"]=>
    NULL
    ["implementation"]=>
    string(22) "(object value omitted)"
    ["documentElement"]=>
    string(22) "(object value omitted)"
    ["actualEncoding"]=>
    NULL
    ["encoding"]=>
    NULL
    ["xmlEncoding"]=>
    NULL
    ["standalone"]=>
    bool(true)
    ["xmlStandalone"]=>
    bool(true)
    ["version"]=>
    string(3) "1.0"
    ["xmlVersion"]=>
    string(3) "1.0"
    ["strictErrorChecking"]=>
    bool(true)
    ["documentURI"]=>
    string(9) "/var/www/"
    ["config"]=>
    NULL
    ["formatOutput"]=>
    bool(false)
    ["validateOnParse"]=>
    bool(false)
    ["resolveExternals"]=>
    bool(false)
    ["preserveWhiteSpace"]=>
    bool(true)
    ["recover"]=>
    bool(false)
    ["substituteEntities"]=>
    bool(false)
    ["nodeName"]=>
    string(9) "#document"
    ["nodeValue"]=>
    NULL
    ["nodeType"]=>
    int(9)
    ["parentNode"]=>
    NULL
    ["childNodes"]=>
    string(22) "(object value omitted)"
    ["firstChild"]=>
    string(22) "(object value omitted)"
    ["lastChild"]=>
    string(22) "(object value omitted)"
    ["previousSibling"]=>
    NULL
    ["nextSibling"]=>
    NULL
    ["attributes"]=>
    NULL
    ["ownerDocument"]=>
    NULL
    ["namespaceURI"]=>
    NULL
    ["prefix"]=>
    string(0) ""
    ["localName"]=>
    NULL
    ["baseURI"]=>
    string(9) "/var/www/"
    ["textContent"]=>
    string(315) " { "@context": "http://schema.org", "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "item": { "@id": "https://example.com/dresses", "name": "Dresses" } }, { "@type": "ListItem", "position": 2, "item": { "@id": "https://example.com/dresses/real", "name": "Real Dresses" } } ] } "
  }
  ["links":protected]=>
  NULL
  ["items":protected]=>
  array(1) {
    [0]=>
    object(Jkphl\Micrometa\Ports\Item\Item)#2429 (3) {
      ["item":protected]=>
      object(Jkphl\Micrometa\Application\Item\Item)#4104 (8) {
        ["format":protected]=>
        int(4)
        ["value":protected]=>
        NULL
        ["children":protected]=>
        array(0) {
        }
        ["propertyListFactory":protected]=>
        object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
        }
        ["type":protected]=>
        array(1) {
          [0]=>
          object(Jkphl\Micrometa\Domain\Item\Iri)#2514 (2) {
            ["immutableProfile":protected]=>
            string(18) "http://schema.org/"
            ["immutableName":protected]=>
            string(14) "BreadcrumbList"
          }
        }
        ["properties":protected]=>
        object(Jkphl\Micrometa\Application\Item\PropertyList)#4105 (6) {
          ["aliases":protected]=>
          array(1) {
            ["http://schema.org/itemListElement"]=>
            array(1) {
              [0]=>
              string(15) "itemListElement"
            }
          }
          ["aliasFactory":protected]=>
          object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4103 (0) {
          }
          ["values":protected]=>
          array(1) {
            [0]=>
            array(2) {
              [0]=>
              object(Jkphl\Micrometa\Application\Item\Item)#3986 (8) {
                ["format":protected]=>
                int(4)
                ["value":protected]=>
                NULL
                ["children":protected]=>
                array(0) {
                }
                ["propertyListFactory":protected]=>
                object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
                }
                ["type":protected]=>
                array(1) {
                  [0]=>
                  object(Jkphl\Micrometa\Domain\Item\Iri)#4109 (2) {
                    ["immutableProfile":protected]=>
                    string(18) "http://schema.org/"
                    ["immutableName":protected]=>
                    string(8) "ListItem"
                  }
                }
                ["properties":protected]=>
                object(Jkphl\Micrometa\Application\Item\PropertyList)#4108 (6) {
                  ["aliases":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    array(1) {
                      [0]=>
                      string(4) "item"
                    }
                    ["http://schema.org/position"]=>
                    array(1) {
                      [0]=>
                      string(8) "position"
                    }
                  }
                  ["aliasFactory":protected]=>
                  object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4094 (0) {
                  }
                  ["values":protected]=>
                  array(2) {
                    [0]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#3998 (2) {
                        ["value":protected]=>
                        string(27) "https://example.com/dresses"
                        ["language":protected]=>
                        NULL
                      }
                    }
                    [1]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4006 (2) {
                        ["value":protected]=>
                        string(1) "1"
                        ["language":protected]=>
                        NULL
                      }
                    }
                  }
                  ["names":protected]=>
                  array(2) {
                    [0]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4093 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(4) "item"
                    }
                    [1]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4098 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(8) "position"
                    }
                  }
                  ["nameToCursor":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    int(0)
                    ["http://schema.org/position"]=>
                    int(1)
                  }
                  ["cursor":protected]=>
                  int(0)
                }
                ["itemId":protected]=>
                string(4) "_:b1"
                ["itemLanguage":protected]=>
                NULL
              }
              [1]=>
              object(Jkphl\Micrometa\Application\Item\Item)#4100 (8) {
                ["format":protected]=>
                int(4)
                ["value":protected]=>
                NULL
                ["children":protected]=>
                array(0) {
                }
                ["propertyListFactory":protected]=>
                object(Jkphl\Micrometa\Application\Factory\PropertyListFactory)#3988 (0) {
                }
                ["type":protected]=>
                array(1) {
                  [0]=>
                  object(Jkphl\Micrometa\Domain\Item\Iri)#4097 (2) {
                    ["immutableProfile":protected]=>
                    string(18) "http://schema.org/"
                    ["immutableName":protected]=>
                    string(8) "ListItem"
                  }
                }
                ["properties":protected]=>
                object(Jkphl\Micrometa\Application\Item\PropertyList)#4101 (6) {
                  ["aliases":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    array(1) {
                      [0]=>
                      string(4) "item"
                    }
                    ["http://schema.org/position"]=>
                    array(1) {
                      [0]=>
                      string(8) "position"
                    }
                  }
                  ["aliasFactory":protected]=>
                  object(Jkphl\Micrometa\Application\Factory\AliasFactory)#4095 (0) {
                  }
                  ["values":protected]=>
                  array(2) {
                    [0]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4102 (2) {
                        ["value":protected]=>
                        string(32) "https://example.com/dresses/real"
                        ["language":protected]=>
                        NULL
                      }
                    }
                    [1]=>
                    array(1) {
                      [0]=>
                      object(Jkphl\Micrometa\Application\Value\StringValue)#4096 (2) {
                        ["value":protected]=>
                        string(1) "2"
                        ["language":protected]=>
                        NULL
                      }
                    }
                  }
                  ["names":protected]=>
                  array(2) {
                    [0]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4078 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(4) "item"
                    }
                    [1]=>
                    object(Jkphl\Micrometa\Domain\Item\Iri)#4076 (2) {
                      ["immutableProfile":protected]=>
                      string(18) "http://schema.org/"
                      ["immutableName":protected]=>
                      string(8) "position"
                    }
                  }
                  ["nameToCursor":protected]=>
                  array(2) {
                    ["http://schema.org/item"]=>
                    int(0)
                    ["http://schema.org/position"]=>
                    int(1)
                  }
                  ["cursor":protected]=>
                  int(0)
                }
                ["itemId":protected]=>
                string(4) "_:b2"
                ["itemLanguage":protected]=>
                NULL
              }
            }
          }
          ["names":protected]=>
          array(1) {
            [0]=>
            object(Jkphl\Micrometa\Domain\Item\Iri)#2526 (2) {
              ["immutableProfile":protected]=>
              string(18) "http://schema.org/"
              ["immutableName":protected]=>
              string(15) "itemListElement"
            }
          }
          ["nameToCursor":protected]=>
          array(1) {
            ["http://schema.org/itemListElement"]=>
            int(0)
          }
          ["cursor":protected]=>
          int(0)
        }
        ["itemId":protected]=>
        string(4) "_:b0"
        ["itemLanguage":protected]=>
        NULL
      }
      ["items":protected]=>
      array(0) {
      }
      ["pointer":protected]=>
      int(0)
    }
  }
  ["pointer":protected]=>
  int(0)
}

"Empty type list is not allowed" when rel=""

I'd like to preface this by saying that I'm not sure whether this actually warrants a change in the library. I just wanted to report this (and my workaround) in case others run into it, and in case there is anything that might be worthwhile adding to the library.

I've encountered a website that contains a lot of good structured data, but unfortunately also contains several links with an empty rel attribute in the footer, unrelated to the structured data type I was trying to retrieve.

Example:

<a href="/some/link" rel="">Some Link</a>

This causes Jkphl\Micrometa\Domain\Exceptions\InvalidArgumentException: Empty type list is not allowed to be thrown which prevents me from grabbing any of the actual data I was looking for which was already successfully retrieved.

From what I can tell, an empty rel attribute isn't strictly invalid, even if unusual.

My workaround is to retrieve the HTML manually, and remove all empty rel attributes before passing it to Micrometa.

Example:

$html = file_get_contents($url);
$html = preg_replace('/rel=["\']{2}/', '', $html);
$items = $parser($url, $html);

Not sure if it's worth adding anything to the library to ignore these empty rel attributes? I'd be happy to come up with a PR if so.

Warm regards.

Inaccessible properties with capital letters

Hi

Not sure if this applies to all properties, as I'm just testing out the library, but I can not access a startDate property of an event microdata type.

Reproducable as:

$url = 'http://www.residentadvisor.net/events.aspx?ai=174'; // just some random event site
$parser = new \Jkphl\Micrometa($url);
$item = $parser->item('http://data-vocabulary.org/Event');
print_r($item);
var_dump($item->startDate);

Outputs:

Jkphl\Micrometa\Parser\Microdata\Item Object
(
[_url:protected] => Jkphl\Utility\Url Object
    (
        [_url:protected] => http://www.residentadvisor.net/events.aspx?ai=174
        [_parts:protected] => Array
            (
                [scheme] => http
                [host] => www.residentadvisor.net
                [path] => /events.aspx
                [query] => Array
                    (
                        [ai] => 174
                    )

            )

    )

[types] => Array
    (
        [0] => http://data-vocabulary.org/Event
    )

[id] => 
[value] => 
[_properties:protected] => stdClass Object
    (
        [startDate] => Array
            (
                [0] => 2014-05-16T00:00
            )

        [summary] => Array
            (
                [0] => 360 Degrees: Osunlade at Bird
            )

        [url] => Array
            (
                [0] => http://www.residentadvisor.net/event.aspx?569206=
            )

    )

)
NULL

In Jkphl\Micrometa\Item::__get any uppercase characters are exchanged as follows:

`startDate` -> `start-date`

If I remove the responsible line, it works just fine.

Is the property stored correctly as startDate or is it intended to be stored as start-date?

Edit: as for the versions I'm using:

    "mf2/mf2": "dev-master",
    "euskadi31/microdata": "dev-master",
    "jkphl/micrometa": "dev-master",

Microdata Does not parse content="" on item property fields

Problem

According to http://schema.org/Review and https://search.google.com/structured-data/testing-tool/u/0/ , the following schema:

<div itemscope itemtype="http://schema.org/Offer">
    <!--price is 1000, a number, with locale-specific thousands separator
    and decimal mark, and the $ character is marked up with the
    machine-readable code "USD" -->
    <span itemprop="priceCurrency" content="USD">$</span><span
        itemprop="price" content="1000.00">1,000.00</span>
    <link itemprop="availability" href="http://schema.org/InStock" />In stock
</div>

is valid and "priceCurrency" should have the value of "USD" (and price should have "1000.00").

Schema.org type serialization invalid when http and https are mixed

issue #24 does describe the argumentation the semantics of on namespaces being different. This is something we can work around as the library's implementation is flexible enough for that.

A consequence is that calling item->toObject will lead to an invalid type:

image

An IRI as http://schema.org/https://schema.org/WebPage can be seen as invalid.

What about adding several helpers for sanitizing / handling http & https on the profile equally?

Schema.org redirects http to https since a while, so both can be seen as identical. That would solve the following inconsistency:

image

Uncaught exception 'Jkphl\Domfactory\Ports\RuntimeException' with message 'cURL error 60: SSL certificate problem: unable to get local issuer certificate

Hi,

I have this error:
Uncaught exception 'Jkphl\Domfactory\Ports\RuntimeException' with message 'cURL error 60: SSL certificate problem: unable to get local issuer certificate in lib\vendor\jkphl\dom-factory\src\Domfactory\Infrastructure\Dom.php:72

Any idea? Exists the posibility to can set something like $client->setOption(CURLOPT_CAINFO, $certKeysPath.'/cacert.pem'); ??

Do not assume `ItemInterface` is a collection by extending traversable `ItemListInterface`

The latest version of PHPStan does check for generics. In other words, for traversables it also does static code detection of their items.

The ItemInterface does extend ItemListInterface and thereby is a traversable. This is incorrect because it also can be a single item.

What is needed / what would conflict when removing the item interface from extending the list interface?

Assume `isPartOf` item as `CreativeWork` when instance of `ValueInterface`

Our implementation does traverse through the properties of an ItemInterface to look for isPartOf.

The expected value of isPartOf always should be an instance of CreativeWork, as stated in the specification: https://schema.org/isPartOf

As we do not live in an ideal world, for many URLs the value of isPartOf actually just is a string. More technically; isPartOf will get returned as ValueInterface instead of as ItemInterface.

How to fix?

As the expected value always should be an instance of CreativeWork, the interpreter / parser should not parse values of isPartOf as ValueInterface, but should change them to a ItemInterface of type CreativeWork.

To support this fix; the Google Structured Data test tool does also mark these values according to the above.

Another approach; what about also adding the isPartOf method to ItemInterface with return value : ?ItemInterface? Or is that JSON-LD specific?

Test data

Bump monolog up to ^2

Could you update your monolog dependency up to ^2 ? This issue forces us to use another library :(

References broken between separate script tags

I ran into this problem with the New York Times. They have the NewsArticle and the NewsMediaOrganization in separate tags, and because of the way Jkphl\Micrometa\Infrastructure\Parser\JsonLD parses them separately with ML\JsonLD\JsonLD, the latter is not able to resolve the references across tags, (for example in this case the publication).

My solution was to rewrite Jkphl\Micrometa\Infrastructure\Parser\JsonLD::parseDom() to collect the schema in one root node, and send that off to ML\JsonLD\JsonLD once.

I don't have time to do a PR tonight, but I will eventually.

Installation with composer

Something is weird about the composer/autoloading integration. Instead of installing the dependencies into my vendor folder, it adds a new vendor folder and then is not compatible with composer's autoload.

However, if I remove the lines searching and including the autoload in Micrometa.php, it works fine with including the class.

(Enviroment: OS X 10.10.3, PHP 5.5.15, Composer version 1.0-dev)

Poorly coded Microdata attributes

Hi!

Is there any way to deal with poorly coded Microdata attributes which specifically mess up with the Vocabulary URI?

For example, fetching "https://www.belibe.it/zoccoli-professionali-dian-eva.html" while looking for Microdata I got a "BreadcrumbList" with some "ListItem" inside (yes, it's ok) but I also got a "https://schema.org/Product" (not a "Product") with "//schema.org/Offer" inside "offer" property. It's clear that they are not using just one Vocabulary URI but three: "http://schema.org", "https://schema.org" and "//schema.org".

How could I get all three as just one?

Thanks!

Author is parsed as _:b1

When parsing the following URL, the author attribute is parsed as _:b1 where it is expected to be empty:

https://eu.usatoday.com/story/news/politics/elections/2016/07/18/donald-trump-hispanics-hillary-clinton/87241102/

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "NewsArticle",
  "author": {},
  "dateModified": "0001-01-01T00:00:00Z",
  "datePublished": "0001-01-01T00:00:00Z",
  "image": {},
  "mainEntityOfPage": {},
  "publisher": {
            "@type": "Organization",
    "logo": {},
    "name": "USA TODAY"
  }
}
</script>

image

Getting type by schema.org hierarchy?

Is there's a way to select items based on their schema.org inheritance?

One example from the docs:

$event = $item->getFirstItem(
    new Iri('http://microformats.org/profile/', 'h-event'),
    new Iri('http://schema.org/', 'Event')
);

Now say the schema.org Event in question is actually listed as a Festival or BusinessEvent (or whatever). They satisfy all the properties of the Event parent type, but they are not matched by micrometa.

Is there a way to do this without having to list all the possible sub-types?

ML\JsonLD\Exception\JsonLdException: Loading https://schema.org failed

I was trying to parse the data from this link

https://www.lazada.co.id/products/dispenser-mini-air-minum-anak-aneka-karakter-hello-kitty-helo-kity-hk-doraemon-doremon-i363397334-s382716523.html

But then it throws JsonLdException.

When I try in Google schema tool, it works fine.

google-schema

This is my setup

$options = [ 
    'client'  => [
        'timeout' => 30, 
        'curl' => [
            CURLOPT_PROXY => 'proxy-host',
            CURLOPT_PROXYUSERPWD => 'proxy-pass',
        ],  
    ],  
    'request' => [
        'verify'  => false,
        'headers' => [
            'Cache-Control' => 'no-cache, no-store, must-revalidate',
            'Pragma'        => 'no-cache',
            'User-Agent'    => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
            'Expires'       => 'Thu, 19 Nov 1981 08:52:00 GMT',
        ]   
    ],  
];

$micrometa = new Parser();
$micrometa($url, null, FORMAT::ALL, $options);

Anything I've missing out?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.