Coder Social home page Coder Social logo

php-html-parser's People

Contributors

andreyshade avatar billythekid avatar cybrox avatar gufran avatar halityurttas avatar hmerritt avatar johncoles avatar leonk avatar ljvicente avatar lukasros avatar mallardduck avatar oleg-andreyev avatar onlinesid avatar paquettg avatar parisholley avatar phh avatar rafpaf avatar rajataimur7 avatar rhrebecek avatar rikvdh avatar rnewton avatar scrutinizer-auto-fixer avatar ssfinney avatar sunra avatar tfedor avatar thenotsoft avatar thiagotalma avatar upperfoot avatar vipercz avatar yavork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-html-parser's Issues

'Illegal string offset 'value''

I'm using this package to read some HTML and replace all anchor tags with some other href (to be more specific, I have coded a function getRedirectUrl that follows all redirects and gets the final URL after all redirections, but this is not relevant for the issue I'm running into).

This is the code I've came up with:

$dom = new Dom;
$dom->load($content);

$links = $dom->find('a');

foreach($links as $link)
{
    $finalUrl = $this->getRedirectUrl($link->getAttribute('href'));
    $tag = $link->getTag();
    $tag->setAttribute('href', $finalUrl);
}

return trim(strip_tags($dom->root->innerHtml(), '<p><a><img>'));

$content contains very simple HTML, something like:

<div>
Actual content

<a href="http://bit.ly/1jIDoCy">This is a link</a>

<a href="http://bit.ly/1N0ZM5s">This is another one</a>

Parse this.
</div>

The final $dom->root->innerHtml() fails with:

local.ERROR: exception 'ErrorException' with message 'Illegal string offset 'value'' in /home/vagrant/Code/marketer/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Tag.php:161

Am I doing something wrong? I'd appreciate any help, I've spent the past day trying to fix it to with no success.

Resolving timed out after

Hi,

Do you have any idea how to fix the below error?

Fatal error: Uncaught exception 'PHPHtmlParser\Exceptions\CurlException' with message 'Error retrieving "http://google.com" (Resolving timed out after 5521 milliseconds)

Code is here:
$dom = new Dom;
$dom->loadFromUrl('http://google.com');
$html = $dom->outerHtml;

I got the same on dev-master and 1.6.4 versions.

Thanks.

Space before closing > parsed incorrectly

Hi,

If you have a space before the closing > in a tag, this library incorrectly assumes the following text is a set of attributes for the tag. Eg

<a href="http://www.example.com" >This is text</a>

The a node will then have the following attributes: href, This, is, text.

returns only one result while there is several ones

Hi !
I don't know if I'm doing something wrong or if there is a bug in the code, but I would like to fetch items using a selector and php-html-parser returns only one result while there is several ones.

    //$body is the content of this page : http://www.novaplanet.com/radionova/cetaitquoicetitre
    //I can't use  $dom->load('http://www.novaplanet.com/radionova/cetaitquoicetitre') here; I need to use a string.
    $dom = new Dom;
    $dom->loadStr($body, []);
    $track_nodes = $dom->find('.cestquoicetitre_results .resultat');

I get only the first result.
Can someone help here ?

Load HTML Incompletely

Here's my code.

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('http://example.com');
$html = $dom->outerHtml;
echo $html;

When I view-source, the html isn't load completely. It just load like half of the html page only.
For some reason I can't expose the real URL here.

[Proposal] Option to disable cleanup

There are some cases in which it might be useful to use the DOM parser to check the presence of certain <style> or <script> tags. In those cases it should be possible to pass an option to skip some of the cleaning steps that remove those tags. This option could be either as broad as "skipHtmlCleanup" or specific such as "keepScriptsInDom" or "keepStylesInDom".

error when fetch from url

my code

        $dom = new Dom;
        $dom->load('http://google.com');

got this

    PHP Fatal error:  Call to undefined function PHPHtmlParser\curl_init() in /home/howtomakeaturn/projects/veblen/vendor/paquettg/php-html-parser/src/PHPHtmlParser/Curl.php on line 17

do i miss anything?

or I should install something?

thanks!

[Proposal] Add removeAttribute and removeAttributes methods to Tag class

It would be very useful if were possible to remove an attribute (or a list of attributes) from a tag. Ideally we can add two new methods both to the AbstractNode and the Tag classes. The first method, removeAttribute, just removes an attributes. The second method, removeAttributes, removes all attributes from a tag except the ones passed to the method.

class TextNode have a encoding bug

public function __construct($text)
{
    // remove double spaces
    $text = preg_replace('/\s+/', ' ', $text);

$text = preg_replace('/\s+/', ' ', $text);

there are encoding bug. deleted just fine.

test case:

    $dom = new \PHPHtmlParser\Dom;
    $dom->load(
    ' <div class="content">'.
    '            一哥们开车巨慢,早上上班十几公里路能开四十多分钟,每天起很早为了不迟到,经常见一骑三轮车的保洁老大爷晃晃悠悠超他的车对他说:小伙子又早起练车呀,快点开吧,交警一会就上班了。。。'.
    '        </div>'
    );
    $content = $dom->find('div.content');
    var_dump($content[0]->innerHtml);exit;

maybe have a better solution。

Incorrect behaviour on parsing html

I was trying to pars html content of cnn.com news pages, and when I get body tag, using both find() and getElementByTag() half the content was gone. I put parsed content into a file, and realized some tags like <article> are out of <body> or <html> tag, something like this:

<html>
  <head>...</head>
  <body>...</body>
  <article>...</article>
  <div>...</div>
</html>
<div>...</div>

php code:

<?php

$dom = new PHPHtmlParser\Dom();
$url = 'http://edition.cnn.com/2015/11/19/tennis/world-tour-finals-federer-nishikori/index.html';
$dom->load($url);
file_put_contents('test.html', (string) $dom);

[Request] Please add in itinance's getChildren() feature

Hey there. First Of I want to thank you so much for your continuation of this amazing project. It is a God send for me and my projects. Keep it up. Second, I would like to request you add in itinance's getChildren() feature found here: Link . I think it would be a great feature to add in.

P.s. I can create a pull request if needed.
Thanks,
Mooror

select element that has specific text

hi!
how i can select a element that has a specific text?

And a suggestion :
in css or jquery when i want select a element with two class i use this syntax .class1.class2
but in your script it doesn't work.i think you choose this syntax for this : .class1+.class2
please fix this.Thank You.

[Request] Replace child node

Currently we are able to remove specific child and add new child in the tail of the parent, but we can't replace specific tag node.

sunra/php-simple-html-dom-parser can do this.

$child->outertext = '<new />';

Here is my thought:

$child = $parent->find('child')[0];
$newChild = new Tag('new');
$parent->replaceChild($child->id(), $newChild);

Thanks!

Can not get all content in body tag

$this->dom = new Dom;
$this->dom->loadFromUrl($url);

$this->dom->find('body')->innerHtml;

But it just get a part of content in body tag. I don't understand the reason? May you help me. Thank you.

This is my URL: http://casiovietnam.net/dong-ho-dien-tu-casio-f91wg9sdf-dong-co-dien

How do you delete nodes?

Hello!

I'm trying to do some easy html manipulation and I can't seem to figure it out.

According to the docs for sunra/php-simple-html-dom-parser which I figure should work (correct me if i'm wrong) you should be able to do this.

// Remove a element, set it's outertext as an empty string 
$e->outertext = '';

But in my testcase, that doesn't seem to work.
This is the relevant part of my code

$dom = new Dom;
$dom->loadStr( $pageMarkup, [] );
$menu = $dom->find( '.inPageMenu' );
$menu->outertext = '';
$html = $dom->outerHtml;

After that, the item with the class is still present in $html.

Am I going at this backwards or am I missing something?
Thanks in advance!

Replace child broken

Fatal error: Method PHPHtmlParser\Dom::__toString() must not throw an exception, caught Error: Cannot use object of type PHPHtmlParser\Dom\HtmlNode as array

Here is some example code to reproduce

$dom = new Dom;
    $dom->load($content);
    $images = $dom->find('img');

    $newimages = [];

    foreach ($images as $image) {
        $tag = new Tag('amp-img');

        $src = $image->getAttribute('src') ?: $image->getAttribute('data-src');
        $tag->setAttribute('src', $src);

        $html = new HtmlNode($tag);

        $image->getParent()->replaceChild($image->id(), $html);
    }

    return (string)$dom;

How to check if it's unable to find to prevent the error page

Hi
I'm using this simple codes to catch a field from an external URL:

$dom = new Dom;
$dom->loadFromUrl($dom_address);
$time = $dom->getElementsByClass('exampleclass')->getAttribute('data-datetime');

The problem is, when it can't find the element containing that class for example the website is not working fine(goes offline or even when displays a 404 page) I get a php error(it's fine when it CAN find it). So how can I check to see if it can't find it, to simply set $time = 'N/A'; and prevent the page error.

Thanks

Script cleaner can break down html

If i meet something like that

    <p>.....</p>
    <script>
        some code ....
        document.write("<script src='some script'><\/script>")
        some code ....
    </script>
    <p>....</p>

cleaner remove many of html body

It can fix by chane code:

    $str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is", '', $str);

to

$str = preg_replace("'<\s*script[^>]*[^/]>(.*?)<[^\/]*/\s*script\s*>'is", '', $str);

install without composer?

is it possible to install this without composer? I am testing this on Wamp Server on Windows and planning to use in shared hosting so is there any possible installation without using composer?

Problem with spesific website

Hey folks, I'm trying to use parser in one of the websites (code below) but seems like parser cannot get whole website data. I tried with 'file_get_contents' and $dom->load still doesn't work at all.

require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$races = ['Karu', 'Elmo'];

$dom = new Dom;
$dom->loadFromUrl('www.nttgameonline.com/knight/en/ranking/clan/0/Karu');
$html = $dom->innerHtml;

$contents = $dom->find('#server');
echo count($contents); // It should print 4

foreach ($contents as $content)
{     
    echo ($content->plaintext);   
}

setAttribute does not work without an array as value

On home page of this repo, I see an example of setting an attribute:

$tag->setAttribute('class', 'foo');

But above code does not work and through below error:

Illegal string offset 'value'

But If I use an array for second parameter then it works fine:

$tag->setAttribute('class', array('value'=>'foo', 'doubleQuote'=>true));

Not getting

i use linux mint, can i get video of it... it vl b very helpful to understand by watching video.

index-related selector ?

Hi,
It seems that index-related selectors does not work ?

I need to have a selector to get the second matched item, like

$node = $html->find('a:eq(1)')

I can't use

$nodes = $html->find('a');
$node = $nodes[1];

in my script, I need to get it using a selector.

Any way to achieve this ?

Thanks.

<code> tag is not recognized

This is the HTML:

<strong>hello</strong>
<code class="language-php">$foo = "bar";</code>

The parser only recognizes <strong> according to the output:

there are 1 nodes

  • strong
$dom = new Dom;
$dom->load('<strong>hello</strong><code class="language-php">$foo = "bar";</code>');
$nodes = $dom->find('*');
$total = count($nodes);
echo "there are {$total} nodes";
/** @var Dom\AbstractNode $node*/
foreach ($nodes as $node) {
    $tag = $node->getTag();
    echo "<br>- {$tag->name()}";
}

Any idea why <code> is ignored? I have tested a lot of tags and only this one is not recognized. Thank you in advance.

Multiple class selector does not work

Using 1.7.0, selectors of the style "a.b.c" do not correctly find elements with multiple classes.

index.php:

<?php
require_once("vendor/autoload.php");
$dom = new \PHPHtmlParser\Dom();
$dom->loadFromFile("fake.html");
$A = $dom->find("a.b");
var_dump(count($A));
$B = $dom->find("a.b.c");
var_dump(count($B));

fake.html:

<a class="b">alpha</a>
<a class="b c">bravo</a>

output:

$ php -f index.php
int(2)
int(0)
$

I wouldn't necessarily consider this a bug, except the documentation says that "any CSS selector" can be used, which is not the case here. I also tested with 1.6.4 and 1.6.9, with the same results.

Thanks!

[Request] Add licensing information

I'm currently working on a project and want to include php-html-parser as a dependency. I found the license in composer.json, but this should really be easier to find. Can this be added to the repo as a LICENSE-file?

HTML attributes with double quotes in them break parsing

Given this HTML:

<a title="This is a "test" of double quotes" href="http://www.example.com">Hello</a>

When passed into Dom::load(), the parser ends up correctly finding the element, but misparses the attributes and body text. The attributes (from var_dump($element->getAttributes())) appear like so:

array(1) { ["title"]=> string(10) "This is a " }

and the body appears like so (from var_dump($element->text())):

string(58) "est" of double quotes" href="http://www.example.com">Hello"

I realize that putting double quotes inside an attribute is noncomformant to HTML, but ideally PHPHtmlParser should be tolerant of such things and parse the element anyway, much in the way web browsers do. While it may be impossible to accurately determine what the intended title attribute's correct value is, it should be possible to ensure that the element text does not include content from before the > marker.

[Request] Wildcard symbol?

I was trying to get all the elements,

here's what I tried:

$dom->load('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
$a   = $dom->find('*');

exit(var_dump($a)));

It returns 1(int), it seems like php-html-parser doesn't support the wildcard symbol?

Attribute without value creates extra '>' on tag

If I load the following,

<div class="content">
    <div class="grid-container" ui-view>
        <!-- the main content appears here -->
    </div>
</div>

then when I render the HTML I get (note the extra > in the inner div),

<div class="content"> 
    <div class="grid-container" ui-view>> </div> 
</div>

However, if I move the ui-view attribute before the class attribute or add a value to it, then it is rendered correctly.

Cannot add attribute

There is no way to add a new attribute to a tag. For example, I wanted to add ng-app='appName' to an HTML tag

Using the setAttribute method results in,

[ErrorException]
  Illegal string offset 'value'

Ideally, the setAttribute method should create a new attribute if it doesn't exist yet.

Unable to find element using child selector '>'

Firstly, let me say thanks for maintaining such a great library, it has been exceptionally useful in my project. But on to the issue

Using the following code:

$dom->find('div > ul');

Results in an empty set, despite the html being valid. It seems the find() function does not support child selectors. I added in a unit test to SelectorTest.php to confirm the results.

The test code:

    public function testFindClassWithChildSelector() {

        $root   = new HtmlNode(new Tag('root'));
        $parent = new HtmlNode(new Tag('div'));
        $child1 = new HtmlNode(new Tag('ul'));  
        $root->addChild($parent);
        $parent->addChild($child1);

        $selector = new Selector('div > ul');
        $this->assertEquals(1, count($selector->find($root)));
    }

The results:
php-html-parser-test-failed

I'm going to see if I can't add the functionality myself. However, given your familiarity with the library you may be able to make a quick change to fix this.

<tr> as HTMLNode

I have selected a table by finding it by class, I then need to loop over each row and examine the cells within them however returns as a TextNode so I am unable to do selects/foreach on the table cells .

Is there any way to make return as a HTMLNode?

Issue with UTF-8

There is something wrong due to conversion made to the text. B

   $dom = new Dom;
   $dom->loadFromUrl($page_url, [ 'enforceEncoding'=> 'UTF-8']);

The text is sometimes decoded as it should be and sometimes I have a decoding that kills my UTF-8 characters (the source is UTF-8 and I don't do any change on results grabbed from ->text(TRUE)function).

Not sure about a bug because I traced your code and it should not apply conversion due to the forced UTF8.

Stripping style and script tags in clean function

Hey mate, great module, just started using it, just wondering why i can't pull out any script tags, and i can see they're getting stripped in your clean() function.

This could be just what you want, but i'm actually wanting to parse these out! Ah well could just fork it ay.

Thanks again.
Cheers
Rob

Update to 1.6.9 breaks HTML entities

When I updated to 1.6.9, it appears the way HTML entities are treated in parsed documents has changed, and it incorrectly handles some characters. The only entity I see so far that converts incorrectly is &#x69; (lower case "i"). When requesting a node's text with this entity in it, it returns "\n5;" instead (newline, "5", semi-colon). As soon as I reverted back to 1.6.8, it fixed it.

Is it possible to disable all entity handling entirely, since I can easily do that myself with html_entity_decode() if I need to?

[Proposal] Replace child node with node array

Currently we are able to replace a tag node with another one(#52), but we can't replace a tag node with several nodes.

sunra/php-simple-html-dom-parser can do this.

$child->outertext = $child->innertext;

Here is my thought:

$child = $parent->find('child')[0];
$children = $child->getChildren();
$parent->replaceChild($child->id(), $children);

Thanks!

File name is longer than the maximum allowed path length on this platform (4096)

i am trying to load a string

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('A HUGE HTML STRING ');
$a = $dom->find('a')[0];
echo $a->text; // "click here"

and i get the error

is_file(): File name is longer than the maximum allowed path length on this platform (4096):

i think that you should create a function or give access to loadStr widouth checking if it is a file or not

    public function load($str, $options = [])
    {
        // check if it's a file
        if (is_file($str))
        {
            return $this->loadFromFile($str, $options);
        }
        // check if it's a url
        if (preg_match("/^https?:\/\//i",$str))
        {
            return $this->loadFromUrl($str, $options);
        }

        return $this->loadStr($str, $options);
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.