This project has been unmaintained since 2016 and should not be used.
The original README is available for historical reference.
An HTML5 parsing library in pure C99
License: Apache License 2.0
This project has been unmaintained since 2016 and should not be used.
The original README is available for historical reference.
As well as tracy-e's bindings I've also written an alternative binding for Objective-C: https://github.com/programmingthomas/ObjectiveGumbo - we were both working on them at the same time. Whilst covering the basic features of Gumbo mine is also targeted at having utility functions for selecting elements in the tree by tag, class and ID (I eventually intend to add full jQuery like selectors).
I'd recommend using something like cffi instead, as this allows compatibility with both CPython and PyPy, is quicker with PyPy than ctypes, and allows one to rely upon the API rather than the ABI when only API stability is guaranteed.
Is there anything a contributor could do to help make the error reporting API public? Any tips on issues I might come accross if I just go ahead and try? I'd like to use Gumbo as the basis for a lint tool, so it seems helping with this would be a good start.
I've seen that the template element has not been implemented yet. So my question is: Do you have plans to integrate it? ... and if so: When?
Thanks!
Since Ubuntu 6.10, the default system shell, /bin/sh, has been changed from bash
to dash
. This change makes autogen.sh complain [: unexpected operator
.
One possible solution for Ubuntu user is to run sudo dpkg-reconfigure dash
and choose <No>
. This will make default system shell back to bash
(check by ls /bin/sh -al
).
Maybe we need to mention this in Installation section of README.md.
(I only test on Ubuntu 12.04 LTS)
Hi,
First of all, thank you very much for this html parser source code! It is very useful.
I guess this parser is not intended to be Windows robust-compliant (yet?), as from some comments on the source code. Nevertheless, after some warnings it compiles and parses a simple html file, with lots of errors (due to some ill-formed input file, under windows utf-8 files tend to be saved with a BOM signature).
So I've decided to try to make it working under Windows (Visual Studio 2013), and while trying to figure out the reasons behind the errors, I've spotted this piece of code that doesn't seem to work as you expect (under VC of course) (print_error() - error.c)
int bytes_written = vsnprintf(output->data + output->length,
remaining_capacity, format, args);
if (bytes_written > remaining_capacity) {
gumbo_string_buffer_reserve(
parser, output->capacity + bytes_written, output);
remaining_capacity = output->capacity - output->length;
bytes_written = vsnprintf(output->data + output->length,
remaining_capacity, format, args);
}
vsnprintf, according to the documentation on msdn, does return a negative value if the number of bytes that it want to write is greater than the bytes it actually can write into the provided buffer. It differs from the standard, which, of course, seems to be the followed behavior on this piece of code. Although, this function doesn't count any possible error returned by vsnprint, in which case it will return a negative value, on the standard implementation too.
If you think to support Windows compilers, I will keep posting similar issues whenever I find some otherwise, I might decide at some point, when I can get it working on a simple html file at least, to request a push if you prefer.
Thanks,
Fabio
It seems like Gumbo doesn't seem to parse custom tags.
For example:
<background></background>
Gumbo will return an empty tagName
, and nodeName
.
Is this expected?
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>bug</title>
</head>
<body>
<math id="bug">
ย <![CDATA[x<y]]>
</math>
</body>
</html>
run result
bug.element.children.type = GUMBO_NODE_TEXT
Why not equal GUMBO_NODE_CDATA ?
gumbo-parser can not be installed on mac os x 10.8.
run ./configure say "C++ compiler cannot create executables".
Hi,
<!--This is a comment-->
<h1>hello world!</h1>
After parsing above html GUMBO_NODE_COMMENT has no newline char at the end of node->v.text.original_text and there is no GUMBO_NODE_WHITESPACE next to comment and before h1 tag.
https://github.com/google/gumbo-parser/blob/master/src/gumbo.h#L369
says:
what does generatic mean? Perhaps you meant "generic" instead?
Namespaced attributes are not correctly handled:
#include <stdio.h>
#include <gumbo.h>
int
main(int argc, char **argv) {
const char *html = "<html xml:lang=\"en\"></html>";
GumboOutput *output;
GumboElement *html_element;
GumboAttribute *attribute;
output = gumbo_parse(html);
html_element = &output->root->v.element;
attribute = html_element->attributes.data[0];
printf("document %s\n", html);
printf("attribute name %s\n", attribute->name);
printf("attribute namespace %d\n", attribute->attr_namespace);
gumbo_destroy_output(&kGumboDefaultOptions, output);
return 0;
}
Result:
document <html xml:lang="en"></html>
attribute name xml:lang
attribute namespace 0
Expected:
document <html xml:lang="en"></html>
attribute name lang
attribute namespace 2
Currently benchmark.cc is calling clock_gettime with CLOCK_PROCESS_CPUTIME_ID as the first parameter.
This is not available in Mac OSX (or windows AFAIK), and results in failure to compile.
I'm considering sending a pull request, but am not sure how exactly to fix this.
Using #ifdefs to handle different platforms is something I tend to avoid at all costs.
Considering c++11 is an option here, using std::chrono::high_resolution_timer could help, but that should measure wall time and not cpu time.
std::clock could be a cross platform way to implement this, but seems to have less resolution than clock_gettime with CLOCK_PROCESS_CPUTIME_ID.
In the gumbo.h file,
struct _GumboNode {
/** The type of node that this is. /
GumboNodeType type;
GumboNode parent;
size_t index_within_parent;
GumboParseFlags parse_flags;
/** The actual node data. */
union {
GumboDocument document; // For GUMBO_NODE_DOCUMENT.
GumboElement element; // For GUMBO_NODE_ELEMENT.
GumboText text; // For everything else.
} v;
};
the union's name 'v' is equivocal. How about a better name (i.e data)?
This tag was recently added to HTML. It's got some...interesting parsing rules. Not sure what to link to, but at the least there's an "in template" insertion mode: http://www.whatwg.org/specs/web-apps/current-work/#parsing-main-intemplate
Hello! Thanks for your library. How can I get the internal Javascript code (the example is below) from script element?
Hey guys!
I've got some node bindings going here:
https://github.com/karlwestin/node-gumbo-parser
https://npmjs.org/package/gumbo-parser
if you like it, i'd love to have a link from the readme
Thanks!
Hi, just cloned gumbo-parser and ran the tests with gtests according to the README. There was a malloc error in one of the tests:
[----------] 28 tests from GumboTokenizerTest
[ RUN ] GumboTokenizerTest.HtmlTagIncludesAllTags
gumbo_test(68406) malloc: *** error for object 0x7f9f92410007: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
/bin/sh: line 1: 68406 Abort trap: 6 ${dir}$tst
FAIL: gumbo_test
==================================
1 of 1 test failed
Please report to [email protected]
==================================
This is on OS X 10.8.
The GumboElement
documentation seems to imply not and I can't think of a reason to do so; but I'd like to be sure as it's relevant for my method of handling unknown tags in my wrapper
I would like to point out that identifiers like "_GumboTokenizerError
" and "_GumboParser
" do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?
In the parser, there is a comment:
// NOTE(jdtang): Gumbo handles only UTF-8, so the encoding clause of the
// spec doesn't apply. If clients want to handle meta-tag re-encoding, they
// should specifically look for that string in the document and re-encode it
// before passing to Gumbo.
This UTF-8 only limitation should be documented. However, it is also true that to look for that string the client would need to implement a fair bit of the parser, to ensure they don't pick up something that looks like a meta element in a comment or attribute value or RAWTEXT element, etc.
Ideally, surprisingly enough, support would be added for other encodings (if only optionally, though perhaps the default Windows-1252 should be built in), probably taking an encoding argument to the parser for when a higher level provides encoding metadata (be it HTTP's Content-Type or from a database of purely UTF-8 content), doing all the encoding detection and such like itself when not specified. Would there be any interest in adding such behaviour?
I only need to get information which are in the head tag, but the parser needs to go through the whole document before I can retrieve them. It would be really useful to have a method parsing only the header.
On line 138 of error.c in gumbo_add_error() the test should read:
if (max_errors >= 0 && parser->_output->errors.length >= max_errors) {
It makes it easier to ensure everyone has the right version and that the pull has been vetted.
In reality, what you're doing with html5lib is walking the tree, and it's entirely possible to use this as the hook into the html5lib testsuite (see html5lib/tests/test_treewalker.py in [html5lib/html5lib-python]). For the sake of providing API compatibility with html5lib.parse
, html5lib should have some treewalker-to-treebuilder API (this is on my to-do list in my head and shortly have an issue over in that project).
The original_text
field of text nodes is incorrect:
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include <gumbo.h>
static void copy_string_piece(char *buf, size_t sz, GumboStringPiece piece);
int
main(int argc, char **argv) {
const char *html = "<html foo=\"42\"><body>foo</body></html>";
char buf[128];
GumboOutput *output;
GumboNode *node;
GumboElement *html_element, *body_element;
GumboText *text;
output = gumbo_parse(html);
html_element = &output->root->v.element;
node = html_element->children.data[1];
assert(node->type == GUMBO_NODE_ELEMENT);
body_element = &node->v.element;
node = body_element->children.data[0];
assert(node->type == GUMBO_NODE_TEXT);
text = &node->v.text;
copy_string_piece(buf, sizeof(buf), text->original_text);
printf("text %s\n", text->text);
printf("original text %s\n", buf);
gumbo_destroy_output(&kGumboDefaultOptions, output);
return 0;
}
static void
copy_string_piece(char *buf, size_t sz, GumboStringPiece piece) {
assert(piece.length < sz);
memcpy(buf, piece.data, piece.length);
buf[piece.length] = '\0';
}
Result:
text foo
original text foo</body></html>
Expected:
text foo
original text foo
Hi,
It would be really helpful to get the algorithm [1] implemented in gumbo. The parser doesn't deal with anything besides UTF-8, but still has most bits in place to implement the algo.
The C library installation doesn't seem to give any errors. However, .libs/libgumbo.so does not exist.
Line 11 in 2db1796
Because of this I am getting the following error when I try to import gumbo in python3. Any suggestions? I am on Mac OSX 10.9 with gcc --version:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.1.0
Thread model: posix
import gumbo
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/gumbo-0.9.1-py3.3.egg/gumbo/gumboc.py", line 32, in
os.path.dirname(file), '..', '..', '.libs', 'libgumbo.so'))
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ctypes/init.py", line 431, in LoadLibrary
return self._dlltype(name)
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ctypes/init.py", line 353, in init
self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/gumbo-0.9.1-py3.3.egg/gumbo/../../.libs/libgumbo.so, 6): image not foundDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/gumbo-0.9.1-py3.3.egg/gumbo/init.py", line 33, in
from gumbo.gumboc import *
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/gumbo-0.9.1-py3.3.egg/gumbo/gumboc.py", line 36, in
os.path.dirname(file), 'libgumbo.so'))
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ctypes/init.py", line 431, in LoadLibrary
return self._dlltype(name)
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/ctypes/init.py", line 353, in init
self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/gumbo-0.9.1-py3.3.egg/gumbo/libgumbo.so, 6): image not found
Gumbo's readme contains the following scary warning under "Non-Goals":
Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox.
I was wondering if you could clarify this. Is the implication that Gumbo may be vulnerable to buffer overflows or similar attacks? The readme also says Gumbo was tested on billions of pages from Google's index, which seems to imply that it at least handled that untrusted input well.
In other words, how paranoid should I be about this? What steps would be involved in hardening Gumbo, and how might contributors help?
Hi,
Is there helper that returns tag name[+length] for an element?
At this moment looks like I have to either use original_tag or create a map from GumboTag to tag name. original_tag wouldn't work for inserted elements, the map wouldn't work for unknown tags.
There seems to be a bug in libgumbo that allows duplicate attributes into the parse tree. Running the html5lib tree-construction tests via the lua-gumbo test runner gives the following output:
============================================================================
test/html5lib-tests/tree-construction/isindex.dat:16: Test 2 failed
============================================================================
Input:
<isindex name="A" action="B" prompt="C" foo="D">
Expected:
| <html>
| <head>
| <body>
| <form>
| action="B"
| <hr>
| <label>
| "C"
| <input>
| foo="D"
| name="isindex"
| <hr>
Received:
| <html>
| <head>
| <body>
| <form>
| action="B"
| <hr>
| <label>
| "C"
| <input>
| foo="D"
| name="A"
| name="isindex"
| <hr>
============================================================================
test/html5lib-tests/tree-construction/tests19.dat:137: Test 12 failed
============================================================================
Input:
<!doctype html><isindex name="foo">
Expected:
| <!DOCTYPE html>
| <html>
| <head>
| <body>
| <form>
| <hr>
| <label>
| "This is a searchable index. Enter search keywords: "
| <input>
| name="isindex"
| <hr>
Received:
| <!DOCTYPE html>
| <html>
| <head>
| <body>
| <form>
| <hr>
| <label>
| "This is a searchable index. Enter search keywords: "
| <input>
| name="foo"
| name="isindex"
| <hr>
============================================================================
test/html5lib-tests/tree-construction/tests2.dat:519: Test 43 failed
============================================================================
Input:
<isindex test=x name=x>
Expected:
| <html>
| <head>
| <body>
| <form>
| <hr>
| <label>
| "This is a searchable index. Enter search keywords: "
| <input>
| name="isindex"
| test="x"
| <hr>
Received:
| <html>
| <head>
| <body>
| <form>
| <hr>
| <label>
| "This is a searchable index. Enter search keywords: "
| <input>
| name="isindex"
| name="x"
| test="x"
| <hr>
Ran 1335 tests in 0.10s
Passed: 1217
Failed: 3
Skipped: 115
If I run the same tests via html5lib_adapter_test.py
, everything passes, but it seems to be because the duplicate name
attributes shown above are lost in the conversion to a Python dict.
Question: Is an implementation of the DOM library (http://www.w3.org/TR/DOM-Level-1/) a goal for this project? If so, would pull requests for an implementation of DOM be welcome, or is this something you want to handle internally?
I'm curious - does anyone know of good analytics/tracking features on GitHub? Right now, I can see the number of watchers/stars/forks and their activity, but I have no idea about bounce rate, search terms, what potential users are looking for, whether/why they give up, what fraction of lurkers end up contributing, etc. Does GitHub provide answers for any of these? Are there third-party solutions I could use?
I saw a couple references to products that require putting a beacon in the README.md file, but I don't know how reputable they are. Would folks get uncomfortable with tracking beacons like that?
I have a use case where I only need to tokenize a HTML document up to a certain point, not build a full parse tree. For this I'd like to propose the addition of a public tokenizer API to Gumbo.
Here is a first cut of what such an API could look like:
typedef struct {
/* encapsulates a struct GumboInternalParser */
void* _parser;
} GumboTokenizer;
GumboTokenizer* gumbo_tokenizer_init(
const char *buffer, size_t buffer_length);
/* a _with_options variant can be provided here */
bool gumbo_tokenizer_lex(
GumboTokenizer* tokenizer, GumboToken* output);
void gumbo_tokenizer_destroy(GumboTokenizer* tokenizer);
void gumbo_tokenizer_token_destroy(
GumboTokenizer* tokenizer, GumboToken* token);
Example usage:
GumboTokenizer *tokenizer = gumbo_tokenizer_init(text, text_size);
while (1) {
GumboToken token;
gumbo_tokenizer_lex(tokenizer, &token);
if (token.type == GUMBO_TOKEN_EOF)
break;
/* do stuff with token */
gumbo_tokenizer_token_destroy(tokenizer, &token);
}
gumbo_tokenizer_destroy(tokenizer);
The API would also require making the GumboTokenType
, GumboTokenDocType
, GumboTokenStartTag
and GumboToken
types public.
I'm not entirely happy with having to drag GumboTokenizer
around but the only alternative would seem to be a callback-based API which is a pain to use.
What do you think? I'm happy to submit a pull request after I get some review from other users.
i know it's stupid. but i'm still look for a c language library ,to parse html4 file. i googled and found that gumbo mabe the most updated c library to fit the job. so i have to ask.
When following the instructions, 'sudo make install' is failing due to a space in my working directories path. (Running OSX 10.8.4)
The result of parsing by gumbo for the following fragment is not
compatible with the specification.
<span><b></span></p>
It is parsed into the following DOM tree.
<span> <b> <p> <b> <b>
It should be parsed as the following DOM tree.
<span> <b> <p> <b>
It is caused by the "reconstruct_active_formatting_elements" at
line 2545 of parser.c.
if (!has_an_element_in_button_scope(parser, GUMBO_TAG_P)) { add_parse_error(parser, token); reconstruct_active_formatting_elements(parser); insert_element_of_tag_type( parser, GUMBO_TAG_P, GUMBO_INSERTION_CONVERTED_FROM_END_TAG); state->_reprocess_current_token = true; return false; }
There is no corresponding reconstruction in the specification.
Is gumbo provide APIs to support DOM manipulation such as createElement
, insertBefore
, removeAttribute
and so on ?
There are various known bugs in what was current of 0.95 โ if it is literally that, then by passing them you're violating the spec. It'd be nicer to use a git submodule for html5lib-tests, as then it's clear what revision of the testsuite you're currently using.
It seems the only way to get a tag name from an element where element->tag == GUMBO_TAG_UNKNOWN
is via the original_tag
field, but this also includes delimiters and attributes. For example: <invalid attr=value>text</invalid>
produces <invalid attr=value>
.
Is this by design? Is there any reliable/correct way to get just the tag name?
(Note: issue #24 seems related)
Including generated files in the repository leads to unwieldy diffs, making actual changes to source files hard to read (and find, amongst all the generated file diffs!), as well as bloating the size of the repository. Generated files also lead to excessive merge conflicts (even if the source file's merge conflict can be automatically resolved, the generated one might not be).
I have written an Objective-C bindings for gumbo : https://github.com/tracy-e/OCGumbo
You can add the link to readme if you like, THX!
I have been trying to use gumbo parser with BeautifulSoup adapter. I tried the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Hello</p>")
>>> soup.select("p")
[<p>Hello</p>]
>>> from gumbo.soup_adapter import parse as soup_parse
>>> soup = soup_parse("<p>Hello</p>")
>>> print soup
<html><head></head><body><p>Hello</p></body></html>
>>> soup.find("p")
<p>Hello</p>
>>> soup.select("p")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not callable
Apparently, select operation is not supported.
Given that, per #14 (comment), a CLA is needed on file, this seems worthwhile to have. A link to the file gets thrown up when submitting a PR then. See https://github.com/blog/1184-contributing-guidelines.
Hi,
parsing http://news.bbc.co.uk/ gives a parse tree with the <body>
element with a start_pos.offset
of 29409. The correct position (ignoring IE conditional comment madness) should be 32442. What is interesting is that the <body>
element has a parse_flags
value of 9 which would indicate that it was automatically inserted by the parser at some point.
I'm just getting started with Gumbo so I haven't attempted to debug this, but this should be enough information to reproduce the problem.
Hope this helps,
Martin
From a glance at it I found the following 2 issues:
a.) Starting in parser.cc line 910 we have something like
TEST_F(GumboParserTest, ComplicatedSelect) {
Parse("<select><div class=foo></div><optgroup><option>Option"
"</option><input></optgroup></select>");
GumboNode* body;
GetAndAssertBody(root_, &body);
ASSERT_EQ(2, GetChildCount(body));
/* ... */ }
According to the HTML5 spec the div inside the select should be ignored (or am I missing something?).
b.) There is also a small typo in line 1114 (again parser.cc) - I think not closing the table tag (it's just "</table" instead of "</table>") correctly is not by intention. This does not affect the test, but it confused me at first.
Otherwise thanks for putting this up!
The clean_text binary help page is displaying get_title instead of clean_text.
Please provide PHP bindings for Gumbo, because PHP one of the most popular languages which still lacks a good HTML5 parser (except html5lib).
There is a tutorial on how to make PHP extensions, but unfortunately I am not a C developer so I can't do it myself. Thanks in advance!
When I compile and run this test code:
#include <stdio.h>
#include <gumbo.h>
#define input "<h1 foo=one foo=two foo=three bar=four>Test</h1>"
int main(void) {
/* Error handling omitted for brevity */
GumboOutput *output = gumbo_parse(input);
GumboNode *body = (GumboNode *)output->root->v.element.children.data[1];
GumboNode *h1 = (GumboNode *)body->v.element.children.data[0];
GumboVector *attrs = &h1->v.element.attributes;
unsigned int i;
printf("Input:\n\n %s\n\nParsed attributes:\n\n", input);
for (i = 0; i < attrs->length; i++) {
GumboAttribute *attr = (GumboAttribute *)attrs->data[i];
printf(" #%d: name=%-8s value=%s\n", i, attr->name, attr->value);
}
printf("\n");
gumbo_destroy_output(&kGumboDefaultOptions, output);
return 0;
}
I get the following output:
Input:
<h1 foo=one foo=two foo=three bar=four>Test</h1>
Parsed attributes:
#0: name=foo value=one
#1: name=twofoo value=three
#2: name=bar value=four
I might be overlooking some finer details, but this doesn't seem right to me. Is it a bug? It produces the same output even if I quote the attribute values.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.