ressio / pharse Goto Github PK
View Code? Open in Web Editor NEWFastest PHP HTML Parser
Fastest PHP HTML Parser
It's possible to add the PHPDoc comments for every method?
In netbeans, this could be useful for autocomplete.
Original issue reported on code.google.com by [email protected]
on 25 Apr 2013 at 8:15
There is a typo in packed version of ganon.php on line 1860:
protected function filter_containts($text) {
should be
protected function filter_contains($text) {
Original issue reported on code.google.com by [email protected]
on 11 Nov 2010 at 12:15
Hi!
First off all, thank you for your ganon library... it's a great code (very
useful for me)
Well, I am writing any web scrappers with ganon, and, I have memory leak
problems with large run scripts.
At the end, Linux ends my script because exhausts all the memory availlable
(even the swap partition) and leave
my server runninng very slowly... sic.
I took a few days investigating why these memory leaks, and, this is the
complete conclusions:
First, a little test to see the memory leak running with ganon:
<?php
include('lib/ganon.php');
set_time_limit(3600); //For slow servers...
function ParseTest($TheHtml){
//do serveral Parses to check memory liberation
//without leaving the function scope:
for ($f = 1; $f <= 20; $f++){
$test_html = new HTML_Parser_HTML5($TheHtml);
$span=$test_html->root->select('span[class="IsThis"]',0);
//Test if the select works...
if (!$span) echo 'Select Error...';
}//for f
}//ParseTest
echo '<pre>';
echo 'Php Version:'.phpversion().'<br><br>';
//Build an html for testing
$test_string=str_repeat('<div><span class="NOIsThis">Foo</span></div><div><span
class="IsThis">Bar</span></div>',40);
//Loop for testing memory consumption
for ($i = 1; $i <= 20; $i++){
ParseTest($test_string);
echo sprintf( '>>>>>>>>>> Iteration: %4s, Memory Usage: %8s <br>',
$i,number_format(memory_get_usage()) );
}
echo '</pre>';
?>
If I Run the test with the original ganon ( Ganon single file PHP5 (rev. #72)
), the test script stops because it consume
all the memory available for php (I think in my case is 128MB).
This is the output of the test:
Php Version:5.2.14
>>>>>>>>>> Iteration: 1, Memory Usage: 9,277,760
>>>>>>>>>> Iteration: 2, Memory Usage: 18,021,192
>>>>>>>>>> Iteration: 3, Memory Usage: 26,567,912
>>>>>>>>>> Iteration: 4, Memory Usage: 35,508,408
>>>>>>>>>> Iteration: 5, Memory Usage: 44,055,968
>>>>>>>>>> Iteration: 6, Memory Usage: 52,602,616
>>>>>>>>>> Iteration: 7, Memory Usage: 61,935,256
>>>>>>>>>> Iteration: 8, Memory Usage: 70,482,696
>>>>>>>>>> Iteration: 9, Memory Usage: 79,028,872
>>>>>>>>>> Iteration: 10, Memory Usage: 87,575,696
>>>>>>>>>> Iteration: 11, Memory Usage: 96,122,120
>>>>>>>>>> Iteration: 12, Memory Usage: 104,669,872
>>>>>>>>>> Iteration: 13, Memory Usage: 113,216,320
>>>>>>>>>> Iteration: 14, Memory Usage: 123,336,072
>>>>>>>>>> Iteration: 15, Memory Usage: 131,883,464
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to
allocate 3441 bytes) in \lib\ganon.php on line 247
Whell, like I set, I took a few days investigating this problem and... there
are two things I found that cause these
memory leaks:
- You must destroy the baseclass of any extended class (when is necessary), by
calling parent::__destruct();
- The callback functions created can not be destroyed, and Ganon creates alot
of these callback functions.
Reference: The comments of the php manual in: http://php.net/manual/en/function.create-function.php
Always is better not use autogenerated code.
I made all these modifications in ganon.php, creating: nml_ganon.php and, using
it, this is the result of the previous
test:
Php Version:5.2.14
>>>>>>>>>> Iteration: 1, Memory Usage: 600,712
>>>>>>>>>> Iteration: 2, Memory Usage: 600,712
>>>>>>>>>> Iteration: 3, Memory Usage: 600,712
>>>>>>>>>> Iteration: 4, Memory Usage: 600,712
>>>>>>>>>> Iteration: 5, Memory Usage: 600,712
>>>>>>>>>> Iteration: 6, Memory Usage: 600,720
>>>>>>>>>> Iteration: 7, Memory Usage: 600,720
>>>>>>>>>> Iteration: 8, Memory Usage: 600,720
>>>>>>>>>> Iteration: 9, Memory Usage: 600,720
>>>>>>>>>> Iteration: 10, Memory Usage: 600,720
>>>>>>>>>> Iteration: 11, Memory Usage: 600,720
>>>>>>>>>> Iteration: 12, Memory Usage: 600,720
>>>>>>>>>> Iteration: 13, Memory Usage: 600,720
>>>>>>>>>> Iteration: 14, Memory Usage: 600,720
>>>>>>>>>> Iteration: 15, Memory Usage: 600,720
>>>>>>>>>> Iteration: 16, Memory Usage: 600,720
>>>>>>>>>> Iteration: 17, Memory Usage: 600,720
>>>>>>>>>> Iteration: 18, Memory Usage: 600,720
>>>>>>>>>> Iteration: 19, Memory Usage: 600,720
>>>>>>>>>> Iteration: 20, Memory Usage: 600,720
Ok, no memory leaks...
NOTE:
I just changed one of the callback functions (for now), but the code has others
create_function in
the getChildrenByAttribute function of HTML_Node class, so... its not complete
yet (maybe in a days I will
finish this)
I dont know if here I can attach a file (I will tray), in any case, I have put
the file accesible in one of my servers, at:
http://trucomania.org/inaki/nml_ganon_rev72.zip
I hope you think in this for your next revisión. I will change the rest of the
callbacks when I found time.
Thanks again for your great library!
Original issue reported on code.google.com by [email protected]
on 20 Sep 2012 at 9:42
Attachments:
What will reproduce the problem?
Fatal error: Call to protected method HTML_Node::filter_element() from context
'HTML_Formatter' in *.* on line 2761
Which version are you using?
ganon.php rev#59
Original issue reported on code.google.com by [email protected]
on 14 Feb 2012 at 9:27
ganon.php (html_parser) don't work with urls like this:
http://www.google.it/language_tools
Maximum execution time of 30 seconds exceeded in ganon.php on line 247
PHP Version 5.3.8
last version of ganon.php
Original issue reported on code.google.com by [email protected]
on 1 Dec 2011 at 5:29
What will reproduce the problem?
<?php
include 'ganon.php';
$html = '<html><head><body><div class="special-post">This is a special
post</div></body></html>';
$dom = str_get_dom($html);
$special = $dom('.special');
echo $special[0]->getPlainText();
What is the expected output? What do you see instead?
Exception, $special[0] shouldn't be set because the document doesn't have any
element with the class "special". I get the string "This is a special post"
instead.
Which version are you using?
r78
Please provide any additional information below.
I am testing the library and I have found this big bug at the first test. I
like the idea and the way that ganon works, I hope it will get a fix.
Thanks for your work
Original issue reported on code.google.com by [email protected]
on 29 Oct 2012 at 12:24
function setIndex($index) {
if ($this->parent) {
if ($index > $this->index()) {
--$index;
}
$this->parent->deleteChild($this, true);
$this->parent->addChild($this, $index);
}
}
Original issue reported on code.google.com by [email protected]
on 17 May 2013 at 12:08
The following code produces an empty array for '$spans':
$test_string = '<div><span class="text test">Foo</span></div><div><span
class="text test">Bar</span></div>';
$test_html = str_get_dom($test_string);
$spans = $test_html('.text');
$results = '.text: ' . count($spans);
Original issue reported on code.google.com by [email protected]
on 29 Apr 2011 at 10:41
What will reproduce the problem?
Running it with 5.4 :)
Which version are you using?
Latest Rev from Feb 16 with PHP 5.4
Fatal error: 'break' operator with non-constant operand is no longer supported
in C:\work\php\someproject\libs\ganon.php on line 1609
Original issue reported on code.google.com by [email protected]
on 12 Mar 2012 at 12:55
What will reproduce the problem?
<tag attr="value">
<tag attr='value'>
both get output as:
<tag attr="value">
when reconstructing the HTML.
What is the expected output? What do you see instead?
Expected output is to preserve the type of quotes used, single ' or double ".
This is important with inline/embedded javascript in attributes.
Which version are you using?
Ganon single file PHP5 (rev. #78)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 19 Mar 2013 at 6:56
What will reproduce the problem?
Trying to get nodes inside html tags if document uses html5
If 'file.html' starts with the HTML5 tag.
<!DOCTYPE html>
...
</html>
$html_node = $html('html', 0);
echo gettype($html_node); // RETURNS NULL
However if the doc is declared with
<html>
...
</html>
it works as intended
What is the expected output? What do you see instead?
Which version are you using?
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 5 Dec 2012 at 8:56
An example:
$test_string = '<div><span class="NOIsThis">Foo</span></div><div><span
class="IsThis">Bar</span></div>';
$test_html = str_get_dom($test_string);
$spans = $test_html->select('span.IsThis');
echo 'Spans with class IsThis (should be one):'.count($spans);
echo "\r\n";
echo 'This should print Bar:
'.$test_html->select('span.IsThis',0)->getPlainText();
I want select the span with class "IsThis", but, the query returns the first
span (with class "NOIsThis").
I think this is wrong... dont you think?
Original issue reported on code.google.com by [email protected]
on 10 Sep 2012 at 3:01
What will reproduce the problem?
Turn on error_reporting(E_ALL);
What is the expected output? What do you see instead?
It shouldn't kick up any warnings, but it kicks these up:
Notice: Uninitialized string offset: 1 in /mnt/..../ganon.php on line 2086
Notice: Uninitialized string offset: 1 in /mnt/..../ganon.php on line 2086
Which version are you using?
* Ganon single file version - PHP5+ version
* Generated on 24 Mar 2012
Please provide any additional information below.
That's it! Thanks for the great library. :-)
Original issue reported on code.google.com by [email protected]
on 10 Sep 2012 at 4:09
The following code echos nothing. It should echo some prices.
$html = file_get_dom('http://www.libertysilver.se/kopa/guldtackor');
foreach($html->select('div.productBox') as $product){
echo $product->select('span.productUnitSellPrice span', 0)->getPlainText() . "<br>";
}
It seems like whenever I have a problem with ganon, it's related to SPAN tags.
I'm using the latest version of ganon.
Thanks for you help :)
Original issue reported on code.google.com by [email protected]
on 17 Jun 2012 at 3:40
What will reproduce the problem?
$html = file_get_dom('http://www.nhl.com/ice/schedulebyseason.htm');
What is the expected output? What do you see instead?
After taking more than 30 seconds and triggering a fatal error many times, I
set `set_time_limit(0);`. It has been ongoing since for about 15 minutes.
"Fatal error: Maximum execution time of 30 seconds exceeded in
C:\xampp\htdocs\hockey\ganon.php on line 238"
Which version are you using?
Ganon single file PHP5 (rev. #78)
PHP 5.4.7
Please provide any additional information below.
It worked with the examples provided, 'code.google.com'
Original issue reported on code.google.com by [email protected]
on 8 Mar 2013 at 1:24
Running version lower than PHP 5.3 (and even higher versions, if you believe
the comments at php.net) does not default to UTF-8, but to ISO-8859-1, when
using html_entity_decode(...) function. This creates problems when using
getPlainText(), because it does not take into account the encoding.
What will reproduce the problem?
Just parse something in an encoding other than *YOUR* html_entity_decode(...)
function and it should be easy to see the problems.
What is the expected output? What do you see instead?
Expected output are correctly converted html enttities.
I get an empty string, like " " => ""
but I would expect to see, " " => " "
Which version are you using?
Ganon single file PHP5 (rev. #78)
Please provide any additional information below.
It can be easily resolved by replacing the function getPlainText() from
return preg_replace('`\s+`', ' ', html_entity_decode($this->toString(true,
true, true), ENT_QUOTES));
to
return preg_replace('`\s+`', ' ', html_entity_decode($this->toString(true,
true, true), ENT_QUOTES, $this->getEncoding()));
Original issue reported on code.google.com by [email protected]
on 19 Jan 2013 at 1:57
Hi
These functions in ganon.php:
function getChildrenByID($id, $recursive = true) {
return getChildrenByAttribute('id', $id, 'equals', 'total', $recursive);
}
function getChildrenByClass($class, $recursive = true) {
return getChildrenByAttribute('class', $id, 'equals', 'total', $recursive);
}
function getChildrenByName($name, $recursive = true) {
return getChildrenByAttribute('name', $name, 'equals', 'total', $recursive);
}
returns an error, because the call ( getChildrenByAttribute ) is not in the
scope
of the class.
You must make the call as:
function getChildrenByID($id, $recursive = true) {
return $this->getChildrenByAttribute('id', $id, 'equals', 'total', $recursive);
}
function getChildrenByClass($class, $recursive = true) {
return $this->getChildrenByAttribute('class', $id, 'equals', 'total', $recursive);
}
function getChildrenByName($name, $recursive = true) {
return $this->getChildrenByAttribute('name', $name, 'equals', 'total', $recursive);
}
in order to work...
or... you can put off that function_create invention and do a normal select
instead:
function getChildrenByAttribute($attribute, $value, $mode = 'equals', $compare = 'total', $recursive = true) {
return $this->select( sprintf('[%s="%s"]',$attribute,$value) );
}
function getChildrenByTag($tag, $compare = 'total', $recursive = true) {
return $this->select( $tag );
}
function getChildrenByID($id, $recursive = true) {
return $this->select( sprintf('[id="%s"]',$class) );
}
function getChildrenByClass($class, $recursive = true) {
return $this->select( sprintf('[class="%s"]',$class) );
}
function getChildrenByName($name, $recursive = true) {
return $this->select( sprintf('[name="%s"]',$class) );
}
Original issue reported on code.google.com by [email protected]
on 20 Sep 2012 at 11:37
What steps will reproduce the problem?
1. Running the youtube-sample on a host with PHP Version 5.2.6-1+lenny8
What is the expected output? What do you see instead?
I get an error:
Parse error: syntax error, unexpected T_PAAMAYIM_NEKUDOTAYIM in
/var/www/test/ganon.php on line 1053
What version of the product are you using? On what operating system?
Ganon Rev28, Debian Lenny
Original issue reported on code.google.com by [email protected]
on 20 Jul 2010 at 12:48
What will reproduce the problem?
- Using the function removeClass();
What is the expected output? What do you see instead?
- Fatal error: Call to undefined function reg_replace() on line 1509
Which version are you using?
- Generated on 20 Oct 2012
Please provide any additional information below.
It should be 'preg_replace'.
Original issue reported on code.google.com by [email protected]
on 31 Jul 2013 at 12:53
Thank you for this code! It is great!
Working with it, I found a little issue:
When I want to get an element without children:
$p = str_get_dom($html);
$b = $p('*',0);
//Iterate over childnodes
for ($i = 1; $i < $b->childCount(); $i++) {
$b->deleteChild($i);
}
I get this:
Notice: Undefined offset: 0 in
/Applications/XAMPP/xamppfiles/htdocs/lubith/v2/version/2.0.0/library/html/ganon
/ganon.php on line 1302
Fatal error: Call to a member function delete() on a non-object in
/Applications/XAMPP/xamppfiles/htdocs/lubith/v2/version/2.0.0/library/html/ganon
/ganon.php on line 1302
I've change line 1302 from
$this->children[$child]->delete();
to
if(isset($this->children[$child])) $this->children[$child]->delete();
Now it is working.
Original issue reported on code.google.com by [email protected]
on 26 Mar 2013 at 3:03
CODE: $html = file_get_dom('http://www.wikieasy.it');
ERROR: Fatal error: Maximum execution time of 30 seconds exceeded in
C:\Inetpub\wwwroot\uno\ganon.php on line 238
VERSION: Last version for php5
It give me this error with differents site like:
http://www.univpm.it
Thanks
Original issue reported on code.google.com by [email protected]
on 10 Apr 2013 at 2:57
How I can find all text nodes which are not an element nodes?
For example:
I have this text.
<strong>Hallo!</strong> What <strong>are</strong> <strong>you doing</strong>?
And I want find only words "What" and "?". Is it possible?
Original issue reported on code.google.com by [email protected]
on 1 Aug 2013 at 2:09
What will reproduce the problem?
Wrap an element, that isn't last among its siblings, with another element.
What is the expected output? What do you see instead?
Expected: Element is wrapped in another element; nothing else.
Actual: Element is wrapped in another element; element and new parent are set
as last child of original parent element.
Which version are you using?
Rev. 72, PHP 5.3.6
Original issue reported on code.google.com by [email protected]
on 27 Sep 2012 at 2:30
How can i determine if the method is callable or not :
Original issue reported on code.google.com by [email protected]
on 14 Aug 2013 at 1:51
What will reproduce the problem?
$node->getInnerText() returns funny characters in place of html entities.
$node->html() returns entities correctly.
Which version are you using?
Ganon file generated on 20 Oct 2012
Original issue reported on code.google.com by [email protected]
on 12 Dec 2012 at 10:07
Hello,
I'm curious whether the following behavior is supposed to work, or if not is
there some kind of workaround? Thanks!
What will reproduce the problem?
$html = str_get_dom('<div id="a"></div>');
$html('#a', 0)->setInnerText('<div id="b"></div>');
$html('#b', 0)->setInnerText('hello');
echo $html;
What is the expected output?
<div id="a"><div id="b">hello</div></div>
What do you see instead?
<div id="a"><div id="b"></div></div>
Original issue reported on code.google.com by [email protected]
on 13 Apr 2011 at 3:18
$rt="<td>my name somebody</td>";
$html= str_get_dom($rt);
foreach($html('input[class]') as $element) {
echo $element->class;
}
line number 2 show s error
Fatal error: Function name must be a string in
/home/content/18/7124318/html/rkys/geL.php
Original issue reported on code.google.com by [email protected]
on 28 Mar 2012 at 8:10
PHP Notice: Undefined variable: tag_ns in blah/vendor/ganon.php on line 1177
PHP Notice: Undefined variable: tag_ns in blah/vendor/ganon.php on line 1160
Replacing `$tag_ns` with `$_->tag_ns` solved this for me.
Original issue reported on code.google.com by pushkov.alexander.110
on 15 Mar 2013 at 1:01
Got infinite loop on function parse() line 1186, after call in my script
$postImages = $html->find('img');
Te input was:
<p> </p> <h2 style="text-transform: uppercase; font-weight: normal;
font-size: 17px; color: rgb(241, 214, 143); font-family: 'Trebuchet Ms',
Verdana, Arial, sans-serif; line-height: 17px;"> <em>CEDRO</em></h2> <p>
<em><span style="color: rgb(241, 214, 143); font-family: 'Trebuchet Ms',
Verdana, Arial, sans-serif; line-height: 17px; background-color: rgb(43, 63,
28);">O Cedro é uma das madeiras mais conhecidas, mas pouca gente
já viu a árvore em sí. Ele serviu de suporte para uma das
primeiras manifestações artísticas brasileiras: o Barroco
Guarani.</span></em></p> <p> <br /> <p> <br /> <p> <br /> <h2
style="text-transform: uppercase; font-weight: normal; font-size: 17px; color:
rgb(241, 214, 143); font-family: 'Trebuchet Ms', Verdana, Arial, sans-serif;
line-height: 17px;"> <img
src="http://www.umpedeque.com.br/site_umpedeque/public/img/arvores/cedro_inteiro
.jpg" data_ratio="0.76" style="" data_width="380" data_height="500" /></h2>
</p> </p> </p>
Formated:
<p> </p>
<h2 style="text-transform: uppercase; font-weight: normal; font-size: 17px;
color: rgb(241, 214, 143); font-family: 'Trebuchet Ms', Verdana, Arial,
sans-serif; line-height: 17px;"> <em>CEDRO</em></h2>
<p>
<em>
<span style="color: rgb(241, 214, 143); font-family: 'Trebuchet Ms', Verdana, Arial, sans-serif; line-height: 17px; background-color: rgb(43, 63, 28);">O Cedro é uma das madeiras mais conhecidas, mas pouca gente já viu a árvore em sí. Ele serviu de suporte para uma das primeiras manifestações artísticas brasileiras: o Barroco Guarani.</span>
</em>
</p>
<p>
<br />
<p>
<br />
<p>
<br />
<h2 style="text-transform: uppercase; font-weight: normal; font-size: 17px; color: rgb(241, 214, 143); font-family: 'Trebuchet Ms', Verdana, Arial, sans-serif; line-height: 17px;"> <img src="http://www.umpedeque.com.br/site_umpedeque/public/img/arvores/cedro_inteiro.jpg" data_ratio="0.76" style="" data_width="380" data_height="500" /></h2>
</p>
</p>
</p>
Original issue reported on code.google.com by [email protected]
on 18 Aug 2013 at 9:11
Attachments:
Setting the disabled attribute of an input tag like as follows
$node->disabled = 'disabled';
results in the output: <input disabled="">
NOTE: the value is not inserted.
where as if you do the following:
$node->disabled = 1;
the result output is: <input disabled="1">
NOTE: the value is inserted.
The problem is the '$this->attributes[$a] !== $a' check at line 337 within the
HTML_Node::toString_attributes() method in gan_node_html.php
This check stops an attribute name being the same as the attribute's value, but
in the case of disabled="disabled" this is required.
This happens in rev72
Original issue reported on code.google.com by [email protected]
on 20 Jul 2012 at 4:04
What will reproduce the problem?
Using ganon in PHP 5.4
What is the expected output? What do you see instead?
No warnings - instead warnings are shown for lines 1160 and 1177 where $tag_ns
is used
Which version are you using?
Ganon single file PHP5 (rev. #78)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 25 Oct 2013 at 4:31
Hi @all!
So far, I've used phpQuery in my projects. Now I have seen Ganon, and I want to
use it for future. But I have a problem to unterstand how Ganon works.
This is what I have done in phpQuery: I want to load some HTML-Code into my
template and change some attributes.
$index = phpQuery::newDocumentHTML('HTML-Code of the entire page');
$content = phpQuery::newDocumentHTML('Some HTML-Code who has to be in index');
phpQuery::selectDocument($index);
pq('#content')->append($cont);
pq('#content a')->attr("href", "chmod")->text("Next");
die ($index);
And now I've tried to do this with Ganon:
$index = str_get_dom('HTML-Code of the entire page');
$content = str_get_dom('Some HTML-Code who has to be in index');
$index->select('#content', 0)->setInnerText($content);
And here this error comes: "Fatal error: Cannot use object of type HTML_Node as
array"
Could anybody help me with the correct code to do my actions: load some
HTML-Code into my template and change some attributes.
This would be great :)
Regards, Steff
Original issue reported on code.google.com by [email protected]
on 19 May 2013 at 2:54
Hi!
I found a little bug in gan_node_html.php at function getNamespace and getTag.
"if ($tag_ns === null) {"
replace to
"if ($this->tag_ns === null) {"
Original issue reported on code.google.com by [email protected]
on 1 Nov 2012 at 11:39
this is example code
from this wiki url : http://code.google.com/p/ganon/wiki/AccesElements
please let me know what is $node how to defined it.
thanks in advance.
// To use a CSS selector query on a node, you simply use the node as a function.
// The result will be stored in an array (of nodes).
$match_array = $node('.myclass');
// To iterate the result, you can use foreach
foreach($match_array as $element) {
echo $element, "<br>\n";
}
// The above can be shortened to the following
foreach($node('.myclass') as $element) {
echo $element, "<br>\n";
}
// Because $element is also a node, you can also perform a query on that node
// and nest queries
foreach($node('.myclass') as $element) {
foreach($element('.myotherclass') as $new_element) {
echo $new_element, "<br>\n";
}
}
// If you know which element of the array you
// are going to need, you can pass an index to the function
$a = $node('a', 2);
// A negative index will start counting from the end of the array
$a = $node('a', -1);
Original issue reported on code.google.com by [email protected]
on 4 Jan 2013 at 7:26
Some sites serve content differently, or not at all *cough*Facebook*, depending
on the user_agent string passed in the headers. As the function is now, the
user would have to hard-code these extra parameters in. If you change the
file_get_dom function to the following:
function file_get_dom($file, $return_root = true, $use_include_path = false,
$context = null) {
$f = file_get_contents($file, $use_include_path, $context);
return (($f === false) ? false : str_get_dom($f, $return_root));
}
You could set this on a case-by-case basis like so:
$opts = array('http' => array('method' => 'GET', 'header' => "Accept-language:
en\r\n", 'user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16'));
$context = stream_context_create($opts);
$tmp_discussion_html = file_get_html($some_url, true, false, $context);
Original issue reported on code.google.com by [email protected]
on 29 Apr 2011 at 10:45
The parser seems to be inserting the <br> tag anytime I have a span followed by
a link break.
Ex: When presented with the following HTML
<span>
1
</span>
I expect the following (when using $html->select('body')
<span>
1
</span>
Instead all <span> with breaks are replaced with:
<span><br/>
1<br/>
</span><br/>
Using version. 72
Original issue reported on code.google.com by [email protected]
on 16 Jul 2012 at 5:44
What will reproduce the problem?
Grabbing the DOM of some sites just doesn't seem to work. Here's one that
fails for me: http://www.hisradio.com
What is the expected output? What do you see instead?
I expect it to grab the DOM, like when I use http://www.google.com
Which version are you using?
Latest, using php 5.3
Please provide any additional information below.
I'm thinking there are server settings that disallow php access, possibly in a
robots.txt file or something along those lines. Am I missing something?
Original issue reported on code.google.com by [email protected]
on 31 Aug 2013 at 3:22
What will reproduce the problem?
calling a getPlainText() on a element that isn't found.
What is the expected output? What do you see instead?
Would love this to return blank or false or something instead of a fatal error
Which version are you using?
rev72
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 5 Jun 2012 at 5:04
Hi (again)
This is just a suggestion for improvement
I am making any scrappers to get data of several webs, and. I'm concerned about
the possibility of
any changes in the structure of the webs that I'm scrapping.
My scrappers do sistematic (and unnatended) work so... I always need to check
if all the tags what are I
spected are in the web page and log it for posterior analysis.
With this in my main... I never can concatenate several operations (select,
getPlainText, etc) because if any of
the selects returns null, the script crash with the error:
Fatal error: Call to a member function getPlainText() on a non-object in ...
Sometimes I call to select just for test if a node is present (for example,
test if the div with id
"LastMinuteOffer" it's present.
In this case, I dont concatenate calls, just do:
$t1=$html->select('div#LastMinuteOffer',0);
if ($t1){
//There are a last minute offfer...
}
But sometimes, I just want to get the text of a delimited node, so, in any
cases, I concatenate several
calls in one, something like this:
$MovieTitle=$html->select('h3.title a.title',0)->getPlainText();
In this case, if the select fails, returns null, so... the getPlainText() fires
the error:
Fatal error: Call to a member function getPlainText() on a non-object in ...
and the script fails.
This circunstance forces me to no concatenate nothing and test every thing,
with nasty code like this:
$t1=$html->select('h3.title a.title',0)->getPlainText();
if (!$t1) {$TheError='Fail in Movie Title'; return false }
$MovieTitle=$t1->getPlainText();
I have done a new function to improve my code, perhaps any other guy is
interested in:
select_imperative
With this function, I can concatenate all I want without danger of errors and I
can catch the exception if any of the
selects fails.
I can do something like:
try {
$MovieTitle=$html->select_imperative('h3.title a.title',0)->getPlainText();
} catch(Exception $e) {
$TheError='Fail in Movie Title: '.$e->getMessage()."\n";
return false; //Return with error
}
return true; //Return All ok
Or can catch group all the errors in just one:
try {
$MovieTitle=$html->select_imperative('h3.title a.title',0)->getPlainText();
$Author=$html->select_imperative('span.author',0)->getPlainText();
$Date=$html->select_imperative('span.date',0)->getPlainText();
$Format=$html->select_imperative('span.format',0)->getPlainText();
} catch(Exception $e) {
$TheError='Error scrapping Movie: '.$e->getMessage();
return false; //Return with error
}
return true; //Return All ok
With this I reduce my code huff.... a lot.
In the class HTML_Node:
function select_imperative($query = '*', $index = false, $recursive = true, $check_self = false) {
if ( ($rv=$this->select($query,$index,$recursive, $check_self)) == null){
throw new Exception('Null query in select: '.$query);
} else return $rv;
}
and, in the class HTML_Parser:
function select_imperative($query = '*', $index = false, $recursive = true, $check_self = false) {
return $this->root->select_imperative($query, $index, $recursive, $check_self);
}
Regards!
Original issue reported on code.google.com by [email protected]
on 21 Sep 2012 at 6:26
I'm scrapping a web page in iso-8859-1, but my scripts works in UTF-8 (php
code, mysql databases, etc), so.. if I get the text of a node, getPlainText()
returns the text in iso-8859-1 (the charset oh the loaded html) and I cant make
equality comparisions in my code.
I solved this (for this particular case) converting to UTF-8 in the
getPlainText implementation:
function getPlainText() {
return preg_replace('`\s+`', ' ', utf8_encode( html_entity_decode($this->toString(true, true, true), ENT_QUOTES) ));
}
but... I'm thinking... what about an automatic detection of the loaded html
encoding and one option to set the charset for the result strings of
getPlainText()?
I's just an idea O:)
Original issue reported on code.google.com by [email protected]
on 6 Sep 2012 at 7:50
What will reproduce the problem?
$result = str_get_dom('<input name="name"/>');
echo $result->toString(true, true, 1);
What is the expected output? What do you see instead?
Expected output is:
<input name="name"/>
Actual output is:
<input name />
Which version are you using?
not sure but it says at the top of file:
* Ganon single file version - PHP5+ version
* Generated on 20 Oct 2012
Original issue reported on code.google.com by [email protected]
on 17 Apr 2013 at 9:43
It is not exactly an issue, but i'd like to use this beautiful parser with
curl. How can I use this function with your class?
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle compressed
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
Original issue reported on code.google.com by [email protected]
on 24 Oct 2012 at 10:34
In gan_node_html.php line:
$class = reg_replace('`\b'.preg_quote($c).'\b`si', '', $class);
should be:
$class = preg_replace('`\b'.preg_quote($c).'\b`si', '', $class);
Original issue reported on code.google.com by [email protected]
on 16 Jun 2013 at 12:41
What will reproduce the problem?
$html = str_get_dom("<div> ><</div>");
echo $html;
What is the expected output? What do you see instead?
I expect the raw source code to be output, the same as what I put in. Instead,
I get:
<div>�><</div>
Basically, it's the same as what I put in, but run through
html_entity_decode(). Is there some way to get raw html?
Which version are you using?
PHP 5.3.3-7, ganon rev #69
Original issue reported on code.google.com by [email protected]
on 1 Mar 2012 at 4:35
nth-child() selector expect "1" for first element, but ganon starts count from
(0)
Original issue reported on code.google.com by [email protected]
on 15 Mar 2013 at 1:40
The following functions will return all children in the DOM object. However, it
looks like if there is text between the nested tags it sometimes misses a
child. For example, <div>Hello<span>world</span></div> will miss the span data.
Also, the ability to dump the DOM into a JSON obejct as provided below would be
a nice feature.
function get_all_children($el) {
$output = array();
$row = array(
'name' => $el->getTag(),
'raw' => $el->getInnerText()
);
for ($i = 0; $i < $el->childCount(); $i++) {
$row['children'] = get_all_children($el->getChild($i));
}
foreach($el->attributes as $attr => $value) {
$row['attribs'] = array(
$attr => $value
);
}
array_push($output, $row);
return $output;
}
function get_dom_array($html, $selector) {
$output = array();
foreach($html($selector) as $el) {
$row = array(
'name' => $el->getTag(),
'raw' => $el->getInnerText()
);
for ($i = 0; $i < $el->childCount(); $i++) {
$row['children'] = get_all_children($el->getChild($i));
}
foreach($el->attributes as $attr => $value) {
$row['attribs'] = array(
$attr => $value
);
}
array_push($output, $row);
}
return $output;
}
$html = str_get_dom('<html><body><div>Hello World</div></body></html>');
$dom_array = get_dom_array($html, 'div');
echo json_encode($dom_array);
Original issue reported on code.google.com by [email protected]
on 18 Oct 2012 at 1:47
What will reproduce the problem?
If you insert a an element (such as a new input field), and then do a select on
it (such as "input") the new child doesn't show up in the select results.
What is the expected output? What do you see instead?
The inserted item should be included in future results.
Which version are you using?
* Ganon single file version - PHP5+ version
* Generated on 24 Mar 2012
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 11 Sep 2012 at 9:02
What will reproduce the problem?
What is the expected output? What do you see instead?
Which version are you using?
Generated on 20 Oct 2012
Please provide any additional information below.
array(4) {
["type"]=>
int(1)
["message"]=>
string(81) "Allowed memory size of 134217728 bytes exhausted (tried to allocate 114078 bytes)"
["file"]=>
string(69) "libs/pquery/ganon.php"
["line"]=>
int(238)
}
Original issue reported on code.google.com by [email protected]
on 31 Jul 2013 at 3:08
What will reproduce the problem?
include( 'ganon.php' );
$html = str_get_dom( '<html><body>foo bar<p>foobar</p><?php echo "foobar";
?></body></html>' );
echo $html;
What is the expected output? What do you see instead?
Expected:
<html><body>foo bar<p>foobar</p><?php echo "foobar"; ?></body></html>
Got:
PHP Warning: Invalid argument supplied for foreach()
<html><body>foo bar<p>foobar</p><?php echo "foobar"; ?></body></html>
Which version are you using?
rev78
Please provide any additional information below.
Easy fix; in function toString_attributes( ) surround:
foreach($this->attributes as $a => $v) {
$s .= ' '.$a.(((!$this->attribute_shorttag) || ($this->attributes[$a] !== $a)) ? '="'.htmlspecialchars($this->attributes[$a], ENT_QUOTES,$
}
with:
if(is_array($this->attributes)){
...
}
Original issue reported on code.google.com by [email protected]
on 2 Sep 2013 at 10:52
I've been using your excellent DOM parser for a project of mine recently, and
came across this bug:
In the latest version (r55), consider the following to be part of the HTML
input:
<b>0</b> zero
<b>1</b> one
<b>2</b> two
The output generated is then:
<b></b> zero
<b>1</b> one
<b>2</b> two
This is because of the following line in parse_text():
if ($this->status['text']) {
which needs to be
if ($this->status['text'] !== "") {
Because "0" obviously evaluates to false, the text contents of the <b> tag
never gets properly saved.
Original issue reported on code.google.com by [email protected]
on 30 Mar 2011 at 3:53
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.