juliaweb / gumbo.jl Goto Github PK
View Code? Open in Web Editor NEWJulia wrapper around Google's gumbo C library for parsing HTML
License: Other
Julia wrapper around Google's gumbo C library for parsing HTML
License: Other
This is when I run Gumbo v0.8.0
on v"1.6.2"
on intel mac
julia> doc = parsehtml(read("index.html", String))
HTML Document:
<!DOCTYPE html>
when running Gumbo v0.8.0
on mac m1 1.7.0
julia> doc = parsehtml(read("index.html", String))
ERROR: UndefVarError: libgumbo not defined
Stacktrace:
[1] parsehtml(input::String; strict::Bool, preserve_whitespace::Bool)
@ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
[2] parsehtml(input::String)
@ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
[3] top-level scope
@ REPL[3]:1
Ex:
julia> parsehtml("<template v-slot:avatar><q-icn name='moo' /></template>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
<head></head>
<body></body>
</HTML>
vs
julia> parsehtml("<templatee v-slot:avatar><q-icn name='moo' /></templatee>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
<head></head>
<body>
<templatee v-slot:avatar="">
<q-icn name="moo"></q-icn>
</templatee>
</body>
</HTML>
Gumbo: 0.8.0
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, icelake-client)
Thanks
See also Discourse thread. I strongly suspect this is based on the original Gumbo library's behavior (not having tested that), or maybe even is specified as part of the HTML5 parsing algorithm. In the latter case, I guess I'll just have to deal with it; in the former, if it's a bug, perhaps Gumbo.jl
could still work around it somehow?
Anyway: The issue is that whitespace that is wrapped in tags disappears, contrary to how things are rendered in a browser, for example. In the following, I'm just using nodeText
from Cascadia
to extract the text; that may not be the best way to do it (and might even be related to the issue, though the whitespace does seem gone in the parsed HTML, too):
julia> using Gumbo, Cascadia
julia> x = parsehtml("<em>foo</em> bar<em> </em>baz")
HTML Document:
<!DOCTYPE >
<HTML>
<head></head>
<body>
<em>
foo
</em>
bar
<em></em>
baz
</body>
</HTML>
julia> nodeText(x.root)
"foo barbaz"
Here I would have wished for "foo bar baz"
, which is what a browser would display. The whitespace is not stripped if there's some non-whitespace in there:
julia> nodeText(parsehtml("foo<em> bar </em>baz").root)
"foo bar baz"
(Of course, using em
on whitespace doesn't make much sense; I've just come across it in the wild, and am losing spaces when scraping certain pages, needing to figure out a workaround that isn't too hacky.)
Hi!
Trying to install Gumbo. Got this build error. Calling Pkg.build directly gives same error:
INFO: Attempting to Create directory /Users/vishalgupta/.julia/v0.3/Gumbo/deps/src/gumbo-1.0
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
========================================================[ ERROR: Gumbo ]========================================================
failed process: Process(tar xzf /Users/vishalgupta/.julia/v0.3/Gumbo/deps/downloads/gumbo-1.0.tar.gz --directory=/Users/vishalgupta/.julia/v0.3/Gumbo/deps/src
, ProcessExited(1)) [1]
while loading /Users/vishalgupta/.julia/v0.3/Gumbo/deps/build.jl, in expression starting on line 19
========================================================[ BUILD ERRORS ]========================================================
WARNING: Gumbo had build errors.
Pkg.build(pkg)
deps/build.jl
scriptAny help appreciated.
I get the following error when adding Gumbo via Pkg.add("Gumbo") on a recent (today) installation of Julia from git
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
===============================================[ ERROR: Gumbo ]===============================================
LoadError: failed process: Process(tar xzf /home/ibanez/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz --directory=/home/ibanez/.julia/v0.4/Gumbo/deps/src
, ProcessExited(2)) [2]
while loading /home/ibanez/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19
===============================================[ BUILD ERRORS ]===============================================
WARNING: Gumbo had build errors.
Pkg.build("Gumbo")
deps/build.jl
script _
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |_'|||__'| | Commit f96c23c* (43 days old master)
|__/ | x86_64-w64-mingw32
julia> url="https://www.youtube.com/channel/UCdZwMpK-iWqCos46xPscDeg"
"https://www.youtube.com/channel/UCdZwMpK-iWqCos46xPscDeg"
julia>
julia> using HTTPClient.HTTPC
julia> using URIParser
julia> using Gumbo
julia> function customize_curl(curl)
cc = LibCURL.curl_easy_setopt(curl, LibCURL.CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:28.0)
Gecko/20100101 Firefox/28.0")
if cc != LibCURL.CURLE_OK
error ("CURLOPT_USERAGENT failed: " * LibCURL.bytestring(curl_easy_strerror(cc)))
end
end
customize_curl (generic function with 1 method)
julia>
julia> r = get(url,RequestOptions(
request_timeout=8.0,
callback=customize_curl))
ERROR: "Error executing request : Problem with the SSL CA cert (path? access rights?)"
in exec_as_multi at C:\Users\SAMSUNG2.julia\v0.4\HTTPClient\src\HTTPC.jl:702
in get at C:\Users\SAMSUNG2.julia\v0.4\HTTPClient\src\HTTPC.jl:371
julia>
Hi,
Installed Gumbo as in the readme, but when I try to run it I get error:
julia> using Gumbo
julia> parsehtml("<h1> Hi</h1>")
ERROR: error compiling parsehtml: error compiling __parsehtml#1__: error compiling document_from_gumbo: box: expected bits type as first argument
I'm trying to read an html file and change a specific value based on a julia script that is running. I figured that I would be able to modify the contents of the html file with this package, but I'm a little unsure of how to go about it.
Ideally, I'd be able to use the tree traversal to locate the specific element I want to modify, but after parsing the file and trying to do this a few times it seems that if I modify the element it isn't stored because it is only referenced to the document not the document itself.
My second option is to know the absolute position of the element in the file (which isn't hard write now as it is still small), and this seems like it should work.
However, once I get past that step I'm a little unsure of how to go about saving the modifications to the document back to the original file.
julia> using Gumbo
julia> using AbstractTrees
julia> using StringEncodings
julia> getpage(url) = parsehtml(String(read(download(url))))
getpage (generic function with 1 method)
ERROR: automatic download failed
What to do ?
julia> text_only(doc::HTMLDocument) = text_only(doc.root)
text_only (generic function with 2 methods)
julia> text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
text_only (generic function with 2 methods)
julia> get_page_text(url) = text_only(getpage(url))
get_page_text (generic function with 1 method)
julia> doc=parsehtml(decode(read(download("http://gazeta.pl")), "iso-8859-2"))
ERROR: automatic download failed (error: 2148270088): http://gazeta.pl
Stacktrace:
[1] download(::String, ::String) at .\interactiveutil.jl:598
[2] download(::String) at .\interactiveutil.jl:632
Paul
If I am parsing site: rp.pl (charset=Windows-1250) lost every national char. like
"t�umaczy"
"ustawie�"
corect is
"tłumaczy "
"ustawień"
How to do? How to read this site by this line
doc=parsehtml(String(read(download(url))))
Paul
Running into an issue of installing gumbo and a docker debian installation - relatively clean install
was trying to install Genie which a Gumbo dependency....
ProcessExited(77)) [77]
while loading /root/.julia/v0.6/Gumbo/deps/build.jl, in expression starting on line 19
are there any environmental variable that need to be set prior to use
tks
M
Hi there.
I use this in Alpine Linux, but I got this error:
julia> parsehtml("<h1> Hello, world! </h1>")
ERROR: could not load library "libgumbo.so.1"
Error loading shared library libgumbo.so.1: No such file or directory
Stacktrace:
[1] parsehtml(input::String; strict::Bool, preserve_whitespace::Bool)
@ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
[2] parsehtml(input::String)
@ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
[3] top-level scope
@ REPL[7]:1
julia>
What's wrong ? How could I use it in Alpine Linux?
Most likely this is not an issue in the Julia wrapper, but I'm wondering if you have any idea on how to solve this? Thank you!
I'm trying to parse polymer web components, but Gumbo chokes on them. Can it be "taught" how to handle extra elements?
genie> Gumbo.parsehtml("""<px-spinner size="100"></px-spinner>""")
genie> HTML Document:
<!DOCTYPE >
<HTML>
<head></head>
<body>
<px-spinner size="100" size="100"></px-spinner size="100">
</body>
</HTML>
'''
julia> url="http://rp.pl"
"http://rp.pl"
julia> doc=parsehtml(String(read(download(url))));
julia> for elem in PreOrderDFS(doc.root) println(tag(elem)) end
HTML
head
script
ERROR: MethodError: no method matching tag(::HTMLText)
Closest candidates are:
tag(::HTMLElement{T}) where T at C:\Users\PC.julia\packages\Gumbo\OhZJu\src\manipulation.jl:6
Stacktrace:
[1] top-level scope at .\REPL[86]:1 [inlined]
[2] top-level scope at .\none:0
'''
Thx, Paul
I'm assuming that it was just out of memory error, but Julia still crashed from segfault, so filing an issue.
The stacktrace is rather long, so I just included the start and end.
julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
julia> b = map(x->parsehtml(String(x.body)), a)
241-element Array{HTMLDocument,1}:
signal (11): Segmentation fault
in expression starting at none:0
page_metadata at /buildworker/worker/package_linux64/build/src/gc.h:448 [inlined]
gc_setmark_pool at /buildworker/worker/package_linux64/build/src/gc.c:751 [inlined]
gc_setmark at /buildworker/worker/package_linux64/build/src/gc.c:758 [inlined]
gc_mark_loop at /buildworker/worker/package_linux64/build/src/gc.c:2572
_jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:2902
jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3108
maybe_collect at /buildworker/worker/package_linux64/build/src/gc.c:827 [inlined]
jl_gc_pool_alloc at /buildworker/worker/package_linux64/build/src/gc.c:1142
iterate at ./compiler/ssair/ir.jl:393
iterate at ./compiler/ssair/ir.jl:385 [inlined]
replace_code_newstyle! at ./compiler/ssair/legacy.jl:71
optimize at ./compiler/optimize.jl:220
typeinf at ./compiler/typeinfer.jl:33
unknown function (ip: (nil))
Allocations: 86180299 (Pool: 86152106; Big: 28193); GC: 96
fish: '/home/sippycups/julia-1.5.2/bin…' terminated by signal SIGSEGV (Address boundary error)
Is there a possibility that we have LibXML2 bindings similar to this:
https://github.com/sevenval/gumbo-libxml
The conversion is very simple. We just need a node_convert
function similar to this to convert Gumbo nodes to EzXML nodes, and also a parsehtml5
function which uses that node_convert
internally, similar to this
Then we can override the original parsehtml
function to use parsehtml5
.
This can fix JuliaIO/EzXML.jl#146
Now that 0.5 is out of RC (
The following simple example runs fine using Julia 0.6.4.
using AbstractTrees
using Gumbo
for el in PreOrderDFS(parsehtml("""
<h1>
Foo
</h1>
"""))
@show el
end
However, on Julia 0.7.0-beta2.0 it goes down in flames.
ERROR: LoadError: TypeError: in Tuple, in parameter, expected Type, got HTMLText
Stacktrace:
[1] has_non_default_iterate(::HTMLText) at ./essentials.jl:833
[2] isiterable(::HTMLText) at ./essentials.jl:864
[3] children at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:26 [inlined]
[4] children at ~/.julia/packages/AbstractTrees/gbHm/src/traits.jl:36 [inlined]
[5] children at ~/.julia/packages/AbstractTrees/gbHm/src/traits.jl:38 [inlined]
[6] childstates at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:333 [inlined]
[7] childstates(::HTMLElement{:HTML}, ::HTMLText) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:35
[8] childstates(::AbstractTrees.ImplicitChildStates{HTMLElement{:HTML},AbstractTrees.ImplicitNodeStack{Any,Int64}}) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:41
[9] iterate(::AbstractTrees.ImplicitChildStates{HTMLElement{:HTML},AbstractTrees.ImplicitNodeStack{Any,Int64}}, ::Int64) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:46 (repeats 2 times)
[10] isempty at ./essentials.jl:721 [inlined]
[11] stepstate(::PreOrderDFS{HTMLElement{:HTML}}, ::AbstractTrees.ImplicitNodeStack{Any,Int64}) at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:446
[12] iterate(::PreOrderDFS{HTMLElement{:HTML}}, ::AbstractTrees.ImplicitNodeStack{Any,Int64}) at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:468
[13] top-level scope at gumbobug.jl:9 [inlined]
[14] top-level scope at ./<missing>:0
[15] include at ./boot.jl:317 [inlined]
[16] include_relative(::Module, ::String) at ./loading.jl:1034
[17] include(::Module, ::String) at ./sysimg.jl:29
[18] exec_options(::Base.JLOptions) at ./client.jl:234
[19] _start() at ./client.jl:427
in expression starting at gumbobug.jl:4
I am sure this is due to some changes that came with 0.7, but I am still working on catching up with the release so I am not entirely sure what is going wrong.
I got the following error message when executing:
Pkg.add("Gumbo")
Here the error:
INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo
WARNING: deprecated syntax "[a=>b, ...]" at /home/ronie/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /home/ronie/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================
LoadError: failed process: Process(`curl -f -o /home/ronie/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /home/ronie/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================
================================[ BUILD ERRORS ]================================
WARNING: Gumbo had build errors.
- packages with build errors remain installed in /home/ronie/.julia/v0.4
- build the package(s) and all dependencies with `Pkg.build("Gumbo")`
- build a single package by running its `deps/build.jl` script
================================================================================
INFO: Package database updated
Is the URL really dead or am I doing something wrong? =\
Hello,
I love what you did with this package.
Google's Gumbo Parser was archived on 16 February - will this affect this project?
Is there a roadmap of where this repository is going?
P. S. I am using this project in my work and am willing to contribute/help.
somethink wrong in Julia 4.0 with Gumbo , at 3.6 ids ok on my machine.
(file example.html is in needed dir)
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | || | | | (| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |_'|||__'| | Commit f96c23c* (5 days old master)
__/ | x86_64-w64-mingw32
ulia> using Base.Test
ulia> using Gumbo
ulia> let
doc = open("example.html") do example
example |> readall |> parsehtml
end
io = IOBuffer()
print(io, doc)
seek(io, 0)
newdoc = io |> readall |> parsehtml
@test newdoc == doc
end
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in anonymous at no file:6
....................... while...
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
For some reason Gumbo automatically closes <p>
elements.
Ex:
julia> output = """
<p class="repo-description">
<blockquote class="search_headline">
<p>
Foo
</p>
</blockquote>
</p>
"""
julia> "<p class=\"repo-description\">\n <blockquote class=\"search_headline\">\n <p>\n Foo\n </p>\n </blockquote>\n</p>\n"
julia> Gumbo.parsehtml(output)
julia> HTML Document:
<!DOCTYPE >
Gumbo.HTMLElement{:HTML}:
<HTML>
<head></head>
<body>
<p class="repo-description"></p>
<blockquote class="search_headline">
<p>
Foo
</p>
</blockquote>
<p></p>
</body>
</HTML>
"""
You can see in the above example how it closes <p class="repo-description">
and creates an empty <p></p>
where the closing tag was.
Other elements are fine:
julia> output = """
<div class="repo-description">
<blockquote class="search_headline">
<p>
Foo
</p>
</blockquote>
</div>
"""
julia> "<div class=\"repo-description\">\n <blockquote class=\"search_headline\">\n <p>\n Foo\n </p>\n </blockquote>\n</div>\n"
julia> Gumbo.parsehtml(output)
julia> HTML Document:
<!DOCTYPE >
Gumbo.HTMLElement{:HTML}:
<HTML>
<head></head>
<body>
<div class="repo-description">
<blockquote class="search_headline">
<p>
Foo
</p>
</blockquote>
</div>
</body>
</HTML>
Thank you
What wrong ? I expect list of all tags...
| | || | | | (| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |__'|||'_| | Commit f96c23c* (4 days old master)
|/ | x86_64-w64-mingw32
julia> using HTTPClient.HTTPC
julia> using Gumbo
julia> r=HTTPC.post("http://requestb.in/api/v1/bins", "")
HTTP Code :200
RequestTime :14.804
Headers :
Connection : keep-alive
Via : 1.1 vegur
Content-Length : 84
Date : Mon, 02 Mar 2015 15:14:06 GMT
Content-Type : application/json
Access-Control-Allow-Origin : *
Server : gunicorn/18.0
Length of body : 84
julia> r.body
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=84, maxsize=Inf, ptr=85,
mark=-1)
julia> s=bytestring(r.body)
"{"color": [100, 140, 100], "name": "195qmnk1", "request_count": 0, "private": false}"
julia> doc=parsehtml(s)
HTML Document:
Gumbo.HTMLElement{:HTML}:
{"color": [100, 140, 100], "name": "195qmnk1", "request_count": 0, "private": false}julia> for elem in preorder(doc.root)
println(tag(elem))
end
HTML
head
body
ERROR: MethodError: tag
has no method matching tag(::Gumbo.HTMLText)
in anonymous at no file:2
julia>
Paul
Hello,
I have an HTML file with a table and would like to convert it to a Julia DataFrame.
I was looking for a function similar to Python Pandas read_html
function (which directly output a list of DataFrame).
Unfortunately I don't see similar function in Julia ecosystem
In Gumbo doc I was looking for an example to iterate over rows and colums of each table
here is a basic HTML source file with 2 tables
<!DOCTYPE >
<HTML>
<head></head>
<body>
<h1>First table</h1>
<table>
<tbody>
<tr>
<th>
A
</th>
<th>
B
</th>
</tr>
<tr>
<td>
1
</td>
<td>
1.1
</td>
</tr>
<tr>
<td>
2
</td>
<td>
2.1
</td>
</tr>
</tbody>
</table>
<h1>Second table</h1>
<table>
<tbody>
<tr>
<th>
AA
</th>
<th>
BB
</th>
</tr>
<tr>
<td>
10
</td>
<td>
10.1
</td>
</tr>
<tr>
<td>
20
</td>
<td>
20.1
</td>
</tr>
</tbody>
</table>
</body>
</HTML>
I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl
Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.
Kind regards
PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia
Duplicate issue, not how I can delete it though..
Hi,
Does Gumbo work for you on v0.6? I'm getting an error coming from AbstractTrees - I was wondering if you found a workaround? I already opened an issue with AbstractTrees.
ERROR: LoadError: invalid subtyping in definition of PostOrderDFS
Stacktrace:
[1] include_from_node1(::String) at ./loading.jl:569
[2] include(::String) at ./sysimg.jl:14
[3] anonymous at ./<missing>:2
while loading /Users/adrian/.julia/v0.6/AbstractTrees/src/AbstractTrees.jl, in expression starting on line 521
ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: Failed to precompile AbstractTrees to /Users/adrian/.julia/lib/v0.6/AbstractTrees.ji.
compilecache(::String) at ./loading.jl:703
_require(::Symbol) at ./loading.jl:490
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
include(::String) at ./sysimg.jl:14
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
eval(::Expr) at /Users/adrian/.julia/v0.6/Genie/src/Renderer.jl:1
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
macro expansion at ./distributed/macros.jl:99 [inlined]
anonymous at ./<missing>:?
include_from_node1(::String) at ./loading.jl:569
include(::String) at ./sysimg.jl:14
(::Base.Distributed.##135#136{Base.#include,Tuple{String},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Base.#include,Tuple{String},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
(::Base.##504#506{Base.JLOptions})() at ./task.jl:335
while loading /Users/adrian/.julia/v0.6/Gumbo/src/manipulation.jl, in expression starting on line 3
while loading /Users/adrian/.julia/v0.6/Gumbo/src/Gumbo.jl, in expression starting on line 29
while loading /Users/adrian/.julia/v0.6/Flax/src/Flax.jl, in expression starting on line 6
while loading /Users/adrian/.julia/v0.6/Genie/src/Renderer.jl, in expression starting on line 10
while loading /Users/adrian/.julia/v0.6/Genie/src/Router.jl, in expression starting on line 4
while loading /Users/adrian/.julia/v0.6/Genie/src/AppServer.jl, in expression starting on line 6
while loading /Users/adrian/.julia/v0.6/Genie/src/Genie.jl, in expression starting on line 36
while loading /Users/adrian/Dropbox/Projects/_test_app3/genie.jl, in expression starting on line 28
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:340
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
(::Base.##504#506{Base.JLOptions})() at ./task.jl:335
Stacktrace:
[1] sync_end() at ./task.jl:287
[2] macro expansion at ./task.jl:303 [inlined]
[3] process_options(::Base.JLOptions) at ./client.jl:279
[4] _start() at ./client.jl:371
Hi! I'm using Julia 0.5.0 because a bug in the Package Manager on 0.4.6 version. When I run Pkg.add("Gumbo")
, the building process ends with this errror:
LoadError: MethodError: no method matching keys(::Array{Pair{Symbol,Symbol},1})
Closest candidates are:
keys(::Associative{K,V})
while loading /home/yagox/.julia/v0.5/Gumbo/deps/build.jl, in expression starting on line 19
Thank you! :D
The installation for Gumbo.jl currently fails on travis CI because the server jamesporter.me is down.
https://travis-ci.org/gher-ulg/PhysOcean.jl/jobs/413749067#L498
Thank you for your help and this very useful package!
Dependency points to http://jamesporter.me/julia/gumbo-1.0.tar.gz
but this URL now returns a 404
Hello,
Gumbo have a parsehtml
function but reading HTML from file can be do using
parsehtml(read(filename, String))
Maybe a readhtml
function should be defined as
readhtml(filename, args... ; kwargs...) = parsehtml(read(filename, String), args...; kwargs...)
An other API idea could be (to mimic read
API) to have the following functions:
parse(HTMLDocument, input)
and
read(filename, HTMLDocument)
What is your opinion about such API idea?
Kind regards
Along with making pretty printing of inline elements all on one line.
I know this distinction is not technically valid anymore in HTML5, but it's still a thing people care about.
Version 0.6.0 (2017-06-19 13:05 UTC)
Official http://julialang.org/ release
x86_64-w64-mingw32
julia> using Gumbo
julia> url="http://wp.pl"
"http://wp.pl"
julia> getpage(url) = parsehtml(String(read(download(url))))
getpage (generic function with 1 method)
julia> text_only(doc::HTMLDocument) = text_only(doc.root)
text_only (generic function with 1 method)
julia> text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
text_only (generic function with 2 methods)
julia> get_page_text(url) = text_only(getpage(url))
get_page_text (generic function with 1 method)
julia> doc=parsehtml(String(read(download(url))));
julia> text_only(doc.root[2])
ERROR: UndefVarError: Leaves not defined
Stacktrace:
[1] text_only(::Gumbo.HTMLElement{:body}) at .\REPL[5]:1
julia> typeof(doc)
Gumbo.HTMLDocument
Thx, Paul
In the current version of Gumbo, it seems to be impossible to use the children
method without qualifying it. This seems perverse when Gumbo
extends children
from AbstractTrees
. I don't yet have a satisfying solution to this.
julia> using Gumbo
julia> using AbstractTrees
julia> const ex = parsehtml("""
<html>
<head></head>
<body>
<p>a<strong>b</strong>c</p>
</body>
</html>
""");
julia> for n in children(ex.root)
println(n)
end
WARNING: both AbstractTrees and Gumbo export "children"; uses of it in module Main must be qualified
ERROR: UndefVarError: children not defined
in anonymous at ./<missing>:?
The whole display situation has ended up a bit of a mess.
I'd also like to think about display
and MIME types, e.g. how should Gumbo types be shown in Jupyter notebooks.
Right now we silently discard these.
I have a BinaryBuilder setup for Gumbo that builds the library on all platforms. This should make it easier and more reliable to support.
The repo is here: https://github.com/aviks/GumboBuilder
@porterjamesj happy to add you to the repo.
One question: In the repo above, I'm building the current master from https://github.com/google/gumbo-parser. In your current build.jl, you are downloading a version you call 1.0
from a private server. However, Gumbo's current released version seems to be 0.10.1
, released about three years ago. Which version of the code are you actually running?
I'll submit a PR with the build.jl changes that are needed, after I get some clarity on the question above.
You probably know already, but just in case - it's broken on v0.6 due to broken AbstractTrees.jl
Is it possible to remove the dependency?
================================[ ERROR: Gumbo ]================================
LoadError: MethodError: no method matching keys(::Array{Pair{Symbol,Symbol},1})
Closest candidates are:
keys(::Associative{K,V}) at dict.jl:118
while loading /home/travis/.julia/v0.5/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================
Hi all,
I'm the original author of this package, but as is probably obvious to those of you who use it heavily, I don't really have the bandwidth or interest to maintain it anymore.
In practice, maintenance has been done by an ad-hoc group of JuliaWeb organization members for a while now. I'm opening this issue mostly to acknowledge this, formally hand-off maintenance, and so those interested can have a quick conversation to settle on a person or group of people who are going to maintain this package going forward.
cc'ing a few people who I feel like should be involved or might be interested in the outcome: @aviks @pfitzseb @essenciary
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Building Gumbo → `~/.julia/packages/Gumbo/HKeb2/deps/build.log`
┌ Error: Error building `Gumbo`:
│ ERROR: LoadError: MethodError: no method matching similar(::Dict{String,String})
│ Closest candidates are:
│ similar(!Matched::Array{T,1}) where T at array.jl:327
│ similar(!Matched::Array{T,2}) where T at array.jl:328
│ similar(!Matched::Array{T,1}, !Matched::Type) where T at array.jl:329
│ ...
│ Stacktrace:
│ [1] adjust_env(::Dict{String,String}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:388
│ [2] lower(::BinDeps.AutotoolsDependency, ::BinDeps.SynchronousStepCollection) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:431
│ [3] |(::BinDeps.SynchronousStepCollection, ::BinDeps.AutotoolsDependency) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:328
│ [4] generate_steps(::BinDeps.LibraryDependency, ::Autotools, ::Dict{Symbol,Any}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:634
│ [5] satisfy!(::BinDeps.LibraryDependency, ::Array{DataType,1}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:944
│ [6] satisfy!(::BinDeps.LibraryDependency) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:922
│ [7] top-level scope at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:977
│ [8] include at ./boot.jl:317 [inlined]
│ [9] include_relative(::Module, ::String) at ./loading.jl:1038
│ [10] include(::Module, ::String) at ./sysimg.jl:29
│ [11] include(::String) at ./client.jl:388
│ [12] top-level scope at none:0
│ in expression starting at /home/mrg/.julia/packages/Gumbo/HKeb2/deps/build.jl:19
└ @ Pkg.Operations ~/Desktop/juliamaster/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:1068
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca* (2018-08-08 20:58 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
How to extract the contents of elements such as div ?
I suspect all "body" of divs..
julia> for elem in preorder(body)
#println(elem)
if typeof(elem)==HTMLElement{:div} push!(divy,(elem)) end
end
julia> unique(divy)
289-element Array{Any,1}:
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Paul
The URL of this package does not match that stored in METADATA.jl.
cc: @porterjamesj
Hello!
I saw this strange behaviour, which can be useful sometimes but also rather dangerous in many cases: some tags which must not be autoclosing but autoclosed propagate until the parent closing tag, changing the tree structure.
test = """<p>A simple <em>paragraph</em> with <br/> a <b>bad</b> <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
<head></head>
<body>
<p>
A simple
<em>
paragraph
</em>
with
<br></br>
a
<b>
bad
</b>
<a href="ref">
link
<em>
(which does not exist)
</em>
!
</a>
</p>
</body>
</HTML>
I think that being more conservative and just putting the closing tag just after, without comprising the following text, is more secure.
I think this result (I write myself for example) would have been more consistent:
…
with
<br></br>
a
<b>
bad
</b>
<a href="ref"></a>
link
<em>
(which does not exist)
</em>
!
…
Another example, more visible:
test = """<p>A simple <em>paragraph</em> with <br/> a <b/>bad bold and a bad <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
<head></head>
<body>
<p>
A simple
<em>
paragraph
</em>
with
<br></br>
a
<b>
bad bold and a bad
<a href="ref">
link
<em>
(which does not exist)
</em>
!
</a>
</b>
</p>
</body>
</HTML>
I don’t know if it is a bug or a feature, but in the latter case, maybe an argument to change this behaviour at will would be nice.
Thank you for your work, anyway!
How to save to disk HTMLElement ?
save to jld makes errors and exclusion julia
r = HTTPC.get(url,RequestOptions(
request_timeout=8.0,
callback=customize_curl))
page = bytestring(r.body)
doc = parsehtml(page)
body = getBody(doc) # function getBody from Your page
julia> body
HTMLElement{:body}:
Hi,
I was trying to use Gumbo.jl to manipulate the DOM by changing tags and content of nodes. However, I am confused with some code.
doc = parsehtml("""
<html>
<head>
<title>Title</title>
</head>
<body>
<span>this is a span 1</span>
<div>
<h1>this is a heading</h1>
<span>this is a span 2</span>
</div>
</body>
</html>
""");
If I wanted to change/ replace the first <span>
node to a <abc>
node
julia> elem = doc.root[2][1]
HTMLElement{:span}:
<span>
this is a span 1
</span>
The following does not work
julia> elem = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)
HTMLElement{:abc}:
<abc>
this is a span 1
</abc>
julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
<head>
<title>
Title
</title>
</head>
<body>
<span>
this is a span 1
</span>
<div>
<h1>
this is a heading
</h1>
<span>
this is a span 2
</span>
</div>
</body>
</HTML>
I have to assign this in the parents children for it to work. For which I have to know the position of the node in the parent node.
elem.parent.children[1] = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)
julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
<head>
<title>
Title
</title>
</head>
<body>
<abc>
this is a span 1
</abc>
<div>
<h1>
this is a heading
</h1>
<span>
this is a span 2
</span>
</div>
</body>
</HTML>
Is this behavior intended? I thought both parents children and the child node should point to the same location.
Also I see that the information about position in parent is present in index_within_parent
. I was wondering if it would be possible to add this information for each node in addition to parents children and attributes. If we have this information then we could overcome the above issue.
struct Node{T}
gntype::Int32 # enum
parent::Ptr{Node}
index_within_parent::Csize_t
parse_flags::Int32 # enum
v::T
end
Please let me know your thoughts or if I am approaching this entirely in the wrong direction. Is there a more straight forward way to manipulate the Nodes?
The URL of this package does not match that stored in METADATA.jl.
cc: @porterjamesj
What to do ? Is posible define new tags ? how ?
Win7 64
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.0.5 (2019-09-09)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using HTTP,Gumbo,AbstractTrees
julia> url=url="http://bbc.com";
julia> r=HTTP.request("GET", url; retries=4, cookies=true);
julia> doc=parsehtml(String(r.body));
julia> tag.(doc.root[:])
2-element Array{Symbol,1}:
:head
:body
julia>
julia> for elem in StatelessBFS(doc.root) println(tag(elem)) end
HTML
head
body
meta
meta
meta
title
script
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
link
link
link
link
link
link
meta
meta
link
script
script
script
script
script
script
script
link
script
script
style
script
script
script
script
script
script
script
script
script
script
script
script
script
script
script
link
script
script
script
link
script
script
noscript
div
div
div
div
script
script
div
header
div
div
script
script
script
script
div
script
script
script
script
script
script
ERROR: MethodError: no method matching tag(::HTMLText)
Closest candidates are:
tag(::HTMLElement{T}) where T at C:\Users\Julai1_0_5\.julia\packages\Gumbo\G7Qbw\src\manipulation.jl:6
Stacktrace:
[1] top-level scope at .\REPL[6]:1 [inlined]
[2] top-level scope at .\none:0
julia>
Thanks Paul
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.