paquettg / php-html-parser Goto Github PK

View Code? Open in Web Editor NEW

2.4K 2.4K 456.0 747 KB

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.

License: MIT License

PHP 30.71% HTML 69.29%

php-html-parser's People

Contributors

Stargazers

Watchers

Forkers

gufran luisaicor edwardpro duynguyenhoang wellingtonlorindo thebennos thinkbox miahelf bacnd lucasnpinheiro dennykuo tydaikho sankam-nikolya msudak lonnylot nbremont goraneza bithaolee donaldlee2008 liangfflia juerik karsel tegansnyder baconbao zgolus trancerr adamhut petraeus hanicker midoooo roln 9kopb ankitsny sergeykozlov fredfilo atiteam congpeijun liberatomota-itcenter dimonserp robertclarkson itinance olliepop vipercz poleon t-web xszyh lukasros agolomazov megabedder qingsing gringo-m vitaliyb ninthspace ssfinney ericphamhoang luotianzhu cybrox lukeb truedrago daiduong47 graphcon wesavetheworld lussoluca zjsxwc daaquan adriandiazgar cynhard parisholley eazyserver xiaochen-2050 thinkingmedia vol-de-nuit lxholding bg6aer m4h4n bosoy83 vpozzebon pwnb0t digideskio abmn614 uzegonemad wlazlak sbarale nonsensecreativity a2design-inc booteille moriedan cojad pranid ljvicente yangxikun iandreyev andreyshade tanushshukla ranjithsiji jakubbadowski taitava oleg-andreyev hikmat30ce liuzhuoling2011

php-html-parser's Issues

'Illegal string offset 'value''

I'm using this package to read some HTML and replace all anchor tags with some other href (to be more specific, I have coded a function getRedirectUrl that follows all redirects and gets the final URL after all redirections, but this is not relevant for the issue I'm running into).

This is the code I've came up with:

$dom = new Dom;
$dom->load($content);

$links = $dom->find('a');

foreach($links as $link)
{
    $finalUrl = $this->getRedirectUrl($link->getAttribute('href'));
    $tag = $link->getTag();
    $tag->setAttribute('href', $finalUrl);
}

return trim(strip_tags($dom->root->innerHtml(), '<p><a><img>'));

$content contains very simple HTML, something like:

<div>
Actual content

<a href="http://bit.ly/1jIDoCy">This is a link</a>

<a href="http://bit.ly/1N0ZM5s">This is another one</a>

Parse this.
</div>

The final $dom->root->innerHtml() fails with:

local.ERROR: exception 'ErrorException' with message 'Illegal string offset 'value'' in /home/vagrant/Code/marketer/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Tag.php:161

Am I doing something wrong? I'd appreciate any help, I've spent the past day trying to fix it to with no success.

Can the find feature use regx to search for things?

As the title states above, I am wondering if the find feature use regx to search for things and if so how.

Thanks,
Mooror

Resolving timed out after

Hi,

Do you have any idea how to fix the below error?

Fatal error: Uncaught exception 'PHPHtmlParser\Exceptions\CurlException' with message 'Error retrieving "http://google.com" (Resolving timed out after 5521 milliseconds)

Code is here:
$dom = new Dom;
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml;

I got the same on dev-master and 1.6.4 versions.

Thanks.

Parser corrupts my input by cleaning contents of divs and replace it with iv> for some reason

This is the input I feed to php-html-parser:

matchstats2301484-b0f319a362db1f67e36f4702a9970e53.txt

$content = file_get_contents('file.html');
$dom = new Dom();
$container = $dom->loadStr($content, []);
echo $container->innerHtml;

gives

Any clue as to what causes this? I've tried all options, but nothing changes.

Space before closing > parsed incorrectly

Hi,

If you have a space before the closing > in a tag, this library incorrectly assumes the following text is a set of attributes for the tag. Eg

<a href="http://www.example.com" >This is text</a>

The a node will then have the following attributes: href, This, is, text.

returns only one result while there is several ones

Hi !
I don't know if I'm doing something wrong or if there is a bug in the code, but I would like to fetch items using a selector and php-html-parser returns only one result while there is several ones.

    //$body is the content of this page : http://www.novaplanet.com/radionova/cetaitquoicetitre
    //I can't use  $dom->load('http://www.novaplanet.com/radionova/cetaitquoicetitre') here; I need to use a string.
    $dom = new Dom;
    $dom->loadStr($body, []);
    $track_nodes = $dom->find('.cestquoicetitre_results .resultat');

I get only the first result.
Can someone help here ?

Load HTML Incompletely

Here's my code.

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('http://example.com');
$html = $dom->outerHtml;
echo $html;

When I view-source, the html isn't load completely. It just load like half of the html page only.
For some reason I can't expose the real URL here.

[Proposal] Option to disable cleanup

There are some cases in which it might be useful to use the DOM parser to check the presence of certain <style> or <script> tags. In those cases it should be possible to pass an option to skip some of the cleaning steps that remove those tags. This option could be either as broad as "skipHtmlCleanup" or specific such as "keepScriptsInDom" or "keepStylesInDom".

error when fetch from url

my code

        $dom = new Dom;
        $dom->load('http://google.com');

got this

    PHP Fatal error:  Call to undefined function PHPHtmlParser\curl_init() in /home/howtomakeaturn/projects/veblen/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Curl.php on line 17

do i miss anything?

or I should install something?

thanks!

[Proposal] Add removeAttribute and removeAttributes methods to Tag class

It would be very useful if were possible to remove an attribute (or a list of attributes) from a tag. Ideally we can add two new methods both to the AbstractNode and the Tag classes. The first method, removeAttribute, just removes an attributes. The second method, removeAttributes, removes all attributes from a tag except the ones passed to the method.

preserveLineBreaks option doesn't work

Setting the preserveLineBreaks option to TRUE doesn't seem to work.

Collection protected

This a part of the code where the problem occurred:
https://gist.github.com/rachid804/1c0c23fc7b2398660f4e

I can access div#listing-image-frame but when i want to get any of child tags i get this:
PHPHtmlParser\Dom\Collection Object ( [collection:protected] => Array ( ) )

class TextNode have a encoding bug

public function __construct($text)
{
    // remove double spaces
    $text = preg_replace('/\s+/', ' ', $text);

$text = preg_replace('/\s+/', ' ', $text);

there are encoding bug. deleted just fine.

test case:

    $dom = new \PHPHtmlParser\Dom;
    $dom->load(
    ' <div class="content">'.
    '            一哥们开车巨慢，早上上班十几公里路能开四十多分钟，每天起很早为了不迟到，经常见一骑三轮车的保洁老大爷晃晃悠悠超他的车对他说:小伙子又早起练车呀，快点开吧，交警一会就上班了。。。'.
    '        </div>'
    );
    $content = $dom->find('div.content');
    var_dump($content[0]->innerHtml);exit;

maybe have a better solution。

Incorrect behaviour on parsing html

I was trying to pars html content of cnn.com news pages, and when I get body tag, using both find() and getElementByTag() half the content was gone. I put parsed content into a file, and realized some tags like <article> are out of <body> or <html> tag, something like this:

<html>
  <head>...</head>
  <body>...</body>
  <article>...</article>
  <div>...</div>
</html>
<div>...</div>

php code:

<?php

$dom = new PHPHtmlParser\Dom();
$url = 'http://edition.cnn.com/2015/11/19/tennis/world-tour-finals-federer-nishikori/index.html';
$dom->load($url);
file_put_contents('test.html', (string) $dom);

[Request] Please add in itinance's getChildren() feature

Hey there. First Of I want to thank you so much for your continuation of this amazing project. It is a God send for me and my projects. Keep it up. Second, I would like to request you add in itinance's getChildren() feature found here: Link . I think it would be a great feature to add in.

P.s. I can create a pull request if needed.
Thanks,
Mooror

select element that has specific text

hi!
how i can select a element that has a specific text?

And a suggestion :
in css or jquery when i want select a element with two class i use this syntax .class1.class2
but in your script it doesn't work.i think you choose this syntax for this : .class1+.class2
please fix this.Thank You.

Incorrect encoding conversion when using loadFromUrl

When you attempt to load an html page from a URL using loadFromUrl the encoding is incorrect.

I have not found this problem when attempting to open the same html page but as a file on the local server.

[Request] Replace child node

Currently we are able to remove specific child and add new child in the tail of the parent, but we can't replace specific tag node.

sunra/php-simple-html-dom-parser can do this.

$child->outertext = '<new />';

Here is my thought:

$child = $parent->find('child')[0];
$newChild = new Tag('new');
$parent->replaceChild($child->id(), $newChild);

Thanks!

[Proposal] Fix error in PHPHtmlParser\Dom\TextNode class

We need to add more functions: innerHtml and outerHtml because parent class call them but this call don't define them.

Can not get all content in body tag

$this->dom = new Dom;
$this->dom->loadFromUrl($url);

$this->dom->find('body')->innerHtml;

But it just get a part of content in body tag. I don't understand the reason? May you help me. Thank you.

This is my URL: http://casiovietnam.net/dong-ho-dien-tu-casio-f91wg9sdf-dong-co-dien

How do you delete nodes?

Hello!

I'm trying to do some easy html manipulation and I can't seem to figure it out.

According to the docs for sunra/php-simple-html-dom-parser which I figure should work (correct me if i'm wrong) you should be able to do this.

// Remove a element, set it's outertext as an empty string 
$e->outertext = '';

But in my testcase, that doesn't seem to work.
This is the relevant part of my code

$dom = new Dom;
$dom->loadStr( $pageMarkup, [] );
$menu = $dom->find( '.inPageMenu' );
$menu->outertext = '';
$html = $dom->outerHtml;

After that, the item with the class is still present in $html.

Am I going at this backwards or am I missing something?
Thanks in advance!

Replace child broken

Fatal error: Method PHPHtmlParser\Dom::__toString() must not throw an exception, caught Error: Cannot use object of type PHPHtmlParser\Dom\HtmlNode as array

Here is some example code to reproduce

$dom = new Dom;
    $dom->load($content);
    $images = $dom->find('img');

    $newimages = [];

    foreach ($images as $image) {
        $tag = new Tag('amp-img');

        $src = $image->getAttribute('src') ?: $image->getAttribute('data-src');
        $tag->setAttribute('src', $src);

        $html = new HtmlNode($tag);

        $image->getParent()->replaceChild($image->id(), $html);
    }

    return (string)$dom;

How to check if it's unable to find to prevent the error page

Hi
I'm using this simple codes to catch a field from an external URL:

$dom = new Dom;
$dom->loadFromUrl($dom_address);
$time = $dom->getElementsByClass('exampleclass')->getAttribute('data-datetime');

The problem is, when it can't find the element containing that class for example the website is not working fine(goes offline or even when displays a 404 page) I get a php error(it's fine when it CAN find it). So how can I check to see if it can't find it, to simply set $time = 'N/A'; and prevent the page error.

Thanks

Script cleaner can break down html

If i meet something like that

    <p>.....</p>
    <script>
        some code ....
        document.write("<script src='some script'><\/script>")
        some code ....
    </script>
    <p>....</p>

cleaner remove many of html body

It can fix by chane code:

    $str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is", '', $str);

$str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<[^\/]*/\s*script\s*>'is", '', $str);

wrapping HtmlNode with another one

Is there any option to wrap existing nodes with new ones or some way to change node attributes?

install without composer?

is it possible to install this without composer? I am testing this on Wamp Server on Windows and planning to use in shared hosting so is there any possible installation without using composer?

bug in loadFromUrl

Bug in your code or documentation. https://github.com/paquettg/php-html-parser/blob/master/src/PHPHtmlParser/Dom.php#L151
$dom->loadFromUrl('http://google.com', new Connector);
in documentation written implementation of CurlInterface an optional second parameter but in your code second parameter is options

Problem with spesific website

Hey folks, I'm trying to use parser in one of the websites (code below) but seems like parser cannot get whole website data. I tried with 'file_get_contents' and $dom->load still doesn't work at all.

require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$races = ['Karu', 'Elmo'];

$dom = new Dom;
$dom->loadFromUrl('www.nttgameonline.com/knight/en/ranking/clan/0/Karu');
$html = $dom->innerHtml;

$contents = $dom->find('#server');
echo count($contents); // It should print 4

foreach ($contents as $content)
{     
    echo ($content->plaintext);   
}

setAttribute does not work without an array as value

On home page of this repo, I see an example of setting an attribute:

$tag->setAttribute('class', 'foo');

But above code does not work and through below error:

Illegal string offset 'value'

But If I use an array for second parameter then it works fine:

$tag->setAttribute('class', array('value'=>'foo', 'doubleQuote'=>true));

Not getting

i use linux mint, can i get video of it... it vl b very helpful to understand by watching video.

Add a method for TextNode that can be reset the text for the node.

Just like the title what i said, i think it will be useful!

index-related selector ?

Hi,
It seems that index-related selectors does not work ?

I need to have a selector to get the second matched item, like

$node = $html->find('a:eq(1)')

I can't use

$nodes = $html->find('a');
$node = $nodes[1];

in my script, I need to get it using a selector.

Any way to achieve this ?

Thanks.

<code> tag is not recognized

This is the HTML:

<strong>hello</strong>
<code class="language-php">$foo = "bar";</code>

The parser only recognizes <strong> according to the output:

there are 1 nodes

strong

$dom = new Dom;
$dom->load('<strong>hello</strong><code class="language-php">$foo = "bar";</code>');
$nodes = $dom->find('*');
$total = count($nodes);
echo "there are {$total} nodes";
/** @var Dom\AbstractNode $node*/
foreach ($nodes as $node) {
    $tag = $node->getTag();
    echo "<br>- {$tag->name()}";
}

Any idea why <code> is ignored? I have tested a lot of tags and only this one is not recognized. Thank you in advance.

Multiple class selector does not work

Using 1.7.0, selectors of the style "a.b.c" do not correctly find elements with multiple classes.

index.php:

<?php
require_once("vendor/autoload.php");
$dom = new \PHPHtmlParser\Dom();
$dom->loadFromFile("fake.html");
$A = $dom->find("a.b");
var_dump(count($A));
$B = $dom->find("a.b.c");
var_dump(count($B));

fake.html:

<a class="b">alpha</a>
<a class="b c">bravo</a>

output:

$ php -f index.php
int(2)
int(0)
$

I wouldn't necessarily consider this a bug, except the documentation says that "any CSS selector" can be used, which is not the case here. I also tested with 1.6.4 and 1.6.9, with the same results.

Thanks!

[Request] Add licensing information

I'm currently working on a project and want to include php-html-parser as a dependency. I found the license in composer.json, but this should really be easier to find. Can this be added to the repo as a LICENSE-file?

HTML attributes with double quotes in them break parsing

Given this HTML:

<a title="This is a "test" of double quotes" href="http://www.example.com">Hello</a>

When passed into Dom::load(), the parser ends up correctly finding the element, but misparses the attributes and body text. The attributes (from var_dump($element->getAttributes())) appear like so:

array(1) { ["title"]=> string(10) "This is a " }

and the body appears like so (from var_dump($element->text())):

string(58) "est" of double quotes" href="http://www.example.com">Hello"

I realize that putting double quotes inside an attribute is noncomformant to HTML, but ideally PHPHtmlParser should be tolerant of such things and parse the element anyway, much in the way web browsers do. While it may be impossible to accurately determine what the intended title attribute's correct value is, it should be possible to ensure that the element text does not include content from before the > marker.

Add ability to wrap existing nodes in a dom

[Request] Wildcard symbol?

I was trying to get all the elements,

here's what I tried:

$dom->load('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
$a   = $dom->find('*');

exit(var_dump($a)));

It returns 1(int), it seems like php-html-parser doesn't support the wildcard symbol?

Attribute without value creates extra '>' on tag

If I load the following,

<div class="content">
    <div class="grid-container" ui-view>
        <!-- the main content appears here -->
    </div>
</div>

then when I render the HTML I get (note the extra > in the inner div),

<div class="content"> 
    <div class="grid-container" ui-view>> </div> 
</div>

However, if I move the ui-view attribute before the class attribute or add a value to it, then it is rendered correctly.

Losing Line Breaks when getting ->innerHtml

Looks like html is losing line breaks when getting $dom->find( '.content', 0 )->innerHtml, I'm still digging in to see why.

Cannot add attribute

There is no way to add a new attribute to a tag. For example, I wanted to add ng-app='appName' to an HTML tag

Using the setAttribute method results in,

[ErrorException]
  Illegal string offset 'value'

Ideally, the setAttribute method should create a new attribute if it doesn't exist yet.

Unable to find element using child selector '>'

Firstly, let me say thanks for maintaining such a great library, it has been exceptionally useful in my project. But on to the issue

Using the following code:

$dom->find('div > ul');

Results in an empty set, despite the html being valid. It seems the find() function does not support child selectors. I added in a unit test to SelectorTest.php to confirm the results.

The test code:

    public function testFindClassWithChildSelector() {

        $root   = new HtmlNode(new Tag('root'));
        $parent = new HtmlNode(new Tag('div'));
        $child1 = new HtmlNode(new Tag('ul'));  
        $root->addChild($parent);
        $parent->addChild($child1);

        $selector = new Selector('div > ul');
        $this->assertEquals(1, count($selector->find($root)));
    }

The results:

I'm going to see if I can't add the functionality myself. However, given your familiarity with the library you may be able to make a quick change to fix this.

<tr> as HTMLNode

I have selected a table by finding it by class, I then need to loop over each row and examine the cells within them however returns as a TextNode so I am unable to do selects/foreach on the table cells .

Is there any way to make return as a HTMLNode?

Issue with UTF-8

There is something wrong due to conversion made to the text. B

   $dom = new Dom;
   $dom->loadFromUrl($page_url, [ 'enforceEncoding'=> 'UTF-8']);

The text is sometimes decoded as it should be and sometimes I have a decoding that kills my UTF-8 characters (the source is UTF-8 and I don't do any change on results grabbed from ->text(TRUE)function).

Not sure about a bug because I traced your code and it should not apply conversion due to the forced UTF8.

Stripping style and script tags in clean function

Hey mate, great module, just started using it, just wondering why i can't pull out any script tags, and i can see they're getting stripped in your clean() function.

This could be just what you want, but i'm actually wanting to parse these out! Ah well could just fork it ay.

Thanks again.
Cheers
Rob

Update to 1.6.9 breaks HTML entities

When I updated to 1.6.9, it appears the way HTML entities are treated in parsed documents has changed, and it incorrectly handles some characters. The only entity I see so far that converts incorrectly is i (lower case "i"). When requesting a node's text with this entity in it, it returns "\n5;" instead (newline, "5", semi-colon). As soon as I reverted back to 1.6.8, it fixed it.

Is it possible to disable all entity handling entirely, since I can easily do that myself with html_entity_decode() if I need to?

[Proposal] Replace child node with node array

Currently we are able to replace a tag node with another one(#52), but we can't replace a tag node with several nodes.

sunra/php-simple-html-dom-parser can do this.

$child->outertext = $child->innertext;

Here is my thought:

$child = $parent->find('child')[0];
$children = $child->getChildren();
$parent->replaceChild($child->id(), $children);

Thanks!

Incorrect encoding conversion when using loadFromUrl

When you attempt to load an html page from a URL using loadFromUrl the encoding is incorrect.

I have not found this problem when attempting to open the same html page but as a file on the local server.

Class 'stringEncode\Encode' not found

Fatal error: Class 'stringEncode\Encode' not found in /var/www/clients/client19/web83/web/libs/PHPHtmlParser/Dom.php on line 593

File name is longer than the maximum allowed path length on this platform (4096)

i am trying to load a string

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('A HUGE HTML STRING ');
$a = $dom->find('a')[0];
echo $a->text; // "click here"

and i get the error

is_file(): File name is longer than the maximum allowed path length on this platform (4096):

i think that you should create a function or give access to loadStr widouth checking if it is a file or not

    public function load($str, $options = [])
    {
        // check if it's a file
        if (is_file($str))
        {
            return $this->loadFromFile($str, $options);
        }
        // check if it's a url
        if (preg_match("/^https?:\/\//i",$str))
        {
            return $this->loadFromUrl($str, $options);
        }

        return $this->loadStr($str, $options);
    }

paquettg / php-html-parser Goto Github PK

php-html-parser's People

Contributors

Stargazers

Watchers

Forkers

php-html-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org