Coder Social home page Coder Social logo

gumbo.jl's People

Contributors

andreasnoack avatar aviks avatar essenciary avatar femtocleaner[bot] avatar hhaensel avatar iainnz avatar jiahao avatar juliatagbot avatar lorenzoh avatar markmont avatar pfitzseb avatar porterjamesj avatar scls19fr avatar ssfrr avatar staticfloat avatar stev47 avatar tkelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gumbo.jl's Issues

ERROR: UndefVarError: libgumbo not defined macbook m1

This is when I run Gumbo v0.8.0 on v"1.6.2" on intel mac

julia> doc = parsehtml(read("index.html", String))
HTML Document:
<!DOCTYPE html>

when running Gumbo v0.8.0 on mac m1 1.7.0

julia> doc = parsehtml(read("index.html", String))
ERROR: UndefVarError: libgumbo not defined
Stacktrace:
 [1] parsehtml(input::String; strict::Bool, preserve_whitespace::Bool)
   @ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
 [2] parsehtml(input::String)
   @ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
 [3] top-level scope
   @ REPL[3]:1

Gumbo strips away template tags

Ex:

julia> parsehtml("<template v-slot:avatar><q-icn name='moo' /></template>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
  <head></head>
  <body></body>
</HTML>

vs

julia> parsehtml("<templatee v-slot:avatar><q-icn name='moo' /></templatee>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
  <head></head>
  <body>
    <templatee v-slot:avatar="">
      <q-icn name="moo"></q-icn>
    </templatee>
  </body>
</HTML>

Gumbo: 0.8.0

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, icelake-client)

Thanks

Tagged whitespace disappears

See also Discourse thread. I strongly suspect this is based on the original Gumbo library's behavior (not having tested that), or maybe even is specified as part of the HTML5 parsing algorithm. In the latter case, I guess I'll just have to deal with it; in the former, if it's a bug, perhaps Gumbo.jl could still work around it somehow?

Anyway: The issue is that whitespace that is wrapped in tags disappears, contrary to how things are rendered in a browser, for example. In the following, I'm just using nodeText from Cascadia to extract the text; that may not be the best way to do it (and might even be related to the issue, though the whitespace does seem gone in the parsed HTML, too):

julia> using Gumbo, Cascadia

julia> x = parsehtml("<em>foo</em> bar<em> </em>baz")
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <em>
      foo
    </em>
    bar
    <em></em>
    baz
  </body>
</HTML>

julia> nodeText(x.root)
"foo barbaz"

Here I would have wished for "foo bar baz", which is what a browser would display. The whitespace is not stripped if there's some non-whitespace in there:

julia> nodeText(parsehtml("foo<em> bar </em>baz").root)
"foo bar baz"

(Of course, using em on whitespace doesn't make much sense; I've just come across it in the wild, and am losing spaces when scraping certain pages, needing to figure out a workaround that isn't too hacky.)

Build Error on OSX

Hi!

Trying to install Gumbo. Got this build error. Calling Pkg.build directly gives same error:

INFO: Attempting to Create directory /Users/vishalgupta/.julia/v0.3/Gumbo/deps/src/gumbo-1.0
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
========================================================[ ERROR: Gumbo ]========================================================

failed process: Process(tar xzf /Users/vishalgupta/.julia/v0.3/Gumbo/deps/downloads/gumbo-1.0.tar.gz --directory=/Users/vishalgupta/.julia/v0.3/Gumbo/deps/src, ProcessExited(1)) [1]
while loading /Users/vishalgupta/.julia/v0.3/Gumbo/deps/build.jl, in expression starting on line 19

========================================================[ BUILD ERRORS ]========================================================

WARNING: Gumbo had build errors.

  • packages with build errors remain installed in /Users/vishalgupta/.julia/v0.3
  • build a package and all its dependencies with Pkg.build(pkg)
  • build a single package by running its deps/build.jl script

Any help appreciated.

Failed to add package

I get the following error when adding Gumbo via Pkg.add("Gumbo") on a recent (today) installation of Julia from git

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
===============================================[ ERROR: Gumbo ]===============================================

LoadError: failed process: Process(tar xzf /home/ibanez/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz --directory=/home/ibanez/.julia/v0.4/Gumbo/deps/src, ProcessExited(2)) [2]
while loading /home/ibanez/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19

===============================================[ BUILD ERRORS ]===============================================

WARNING: Gumbo had build errors.

  • packages with build errors remain installed in /home/ibanez/.julia/v0.4
  • build the package(s) and all dependencies with Pkg.build("Gumbo")
  • build a single package by running its deps/build.jl script

How works with https ?

           _

_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |_'|||__'| | Commit f96c23c* (43 days old master)
|__/ | x86_64-w64-mingw32

julia> url="https://www.youtube.com/channel/UCdZwMpK-iWqCos46xPscDeg"
"https://www.youtube.com/channel/UCdZwMpK-iWqCos46xPscDeg"

julia>

julia> using HTTPClient.HTTPC

julia> using URIParser

julia> using Gumbo

julia> function customize_curl(curl)
cc = LibCURL.curl_easy_setopt(curl, LibCURL.CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:28.0)
Gecko/20100101 Firefox/28.0")
if cc != LibCURL.CURLE_OK
error ("CURLOPT_USERAGENT failed: " * LibCURL.bytestring(curl_easy_strerror(cc)))
end
end
customize_curl (generic function with 1 method)

julia>

julia> r = get(url,RequestOptions(
request_timeout=8.0,
callback=customize_curl))
ERROR: "Error executing request : Problem with the SSL CA cert (path? access rights?)"
in exec_as_multi at C:\Users\SAMSUNG2.julia\v0.4\HTTPClient\src\HTTPC.jl:702
in get at C:\Users\SAMSUNG2.julia\v0.4\HTTPClient\src\HTTPC.jl:371

julia>

error compiling __parsehtml#

Hi,

Installed Gumbo as in the readme, but when I try to run it I get error:

julia> using Gumbo
julia> parsehtml("<h1> Hi</h1>")
ERROR: error compiling parsehtml: error compiling __parsehtml#1__: error compiling document_from_gumbo: box: expected bits type as first argument

I have a few LD warnings in the build

Implement pretty printing of elements and documents

I'm trying to read an html file and change a specific value based on a julia script that is running. I figured that I would be able to modify the contents of the html file with this package, but I'm a little unsure of how to go about it.
Ideally, I'd be able to use the tree traversal to locate the specific element I want to modify, but after parsing the file and trying to do this a few times it seems that if I modify the element it isn't stored because it is only referenced to the document not the document itself.
My second option is to know the absolute position of the element in the file (which isn't hard write now as it is still small), and this seems like it should work.
However, once I get past that step I'm a little unsure of how to go about saving the modifications to the document back to the original file.

what to do ? ERROR: automatic download failed (error: 2148270088): http://gazeta.pl

julia> using Gumbo

julia> using AbstractTrees

julia> using StringEncodings

julia> getpage(url) = parsehtml(String(read(download(url))))
getpage (generic function with 1 method)
ERROR: automatic download failed
What to do ?
julia> text_only(doc::HTMLDocument) = text_only(doc.root)
text_only (generic function with 2 methods)

julia> text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
text_only (generic function with 2 methods)

julia> get_page_text(url) = text_only(getpage(url))
get_page_text (generic function with 1 method)

julia> doc=parsehtml(decode(read(download("http://gazeta.pl")), "iso-8859-2"))
ERROR: automatic download failed (error: 2148270088): http://gazeta.pl
Stacktrace:
[1] download(::String, ::String) at .\interactiveutil.jl:598
[2] download(::String) at .\interactiveutil.jl:632

Paul

what to do with charset=Windows-1250 ?

If I am parsing site: rp.pl (charset=Windows-1250) lost every national char. like
"t�umaczy"
"ustawie�"
corect is
"tłumaczy "
"ustawień"

How to do? How to read this site by this line
doc=parsehtml(String(read(download(url))))
Paul

Gumbo install on 0.6

Running into an issue of installing gumbo and a docker debian installation - relatively clean install

was trying to install Genie which a Gumbo dependency....

ProcessExited(77)) [77]
while loading /root/.julia/v0.6/Gumbo/deps/build.jl, in expression starting on line 19

are there any environmental variable that need to be set prior to use

tks
M

Not working in Alpine Linux or julia musl.

Hi there.

I use this in Alpine Linux, but I got this error:

julia> parsehtml("<h1> Hello, world! </h1>")
ERROR: could not load library "libgumbo.so.1"
Error loading shared library libgumbo.so.1: No such file or directory
Stacktrace:
 [1] parsehtml(input::String; strict::Bool, preserve_whitespace::Bool)
   @ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
 [2] parsehtml(input::String)
   @ Gumbo ~/.julia/packages/Gumbo/aBmWO/src/conversion.jl:4
 [3] top-level scope
   @ REPL[7]:1

julia>

What's wrong ? How could I use it in Alpine Linux?

Wrong parsing of non standard (polymer) tags

Most likely this is not an issue in the Julia wrapper, but I'm wondering if you have any idea on how to solve this? Thank you!

I'm trying to parse polymer web components, but Gumbo chokes on them. Can it be "taught" how to handle extra elements?

genie> Gumbo.parsehtml("""<px-spinner size="100"></px-spinner>""")
genie> HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <px-spinner size="100" size="100"></px-spinner size="100">
  </body>
</HTML>

ERROR: MethodError: no method matching tag(::HTMLText)

'''
julia> url="http://rp.pl"
"http://rp.pl"

julia> doc=parsehtml(String(read(download(url))));

julia> for elem in PreOrderDFS(doc.root) println(tag(elem)) end
HTML
head
script
ERROR: MethodError: no method matching tag(::HTMLText)
Closest candidates are:
tag(::HTMLElement{T}) where T at C:\Users\PC.julia\packages\Gumbo\OhZJu\src\manipulation.jl:6
Stacktrace:
[1] top-level scope at .\REPL[86]:1 [inlined]
[2] top-level scope at .\none:0

'''
Thx, Paul

Segfault from `map(x->parsehtml(String(x.body)), xs)`

I'm assuming that it was just out of memory error, but Julia still crashed from segfault, so filing an issue.
The stacktrace is rather long, so I just included the start and end.

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
julia> b = map(x->parsehtml(String(x.body)), a)
241-element Array{HTMLDocument,1}:

signal (11): Segmentation fault
in expression starting at none:0
page_metadata at /buildworker/worker/package_linux64/build/src/gc.h:448 [inlined]
gc_setmark_pool at /buildworker/worker/package_linux64/build/src/gc.c:751 [inlined]
gc_setmark at /buildworker/worker/package_linux64/build/src/gc.c:758 [inlined]
gc_mark_loop at /buildworker/worker/package_linux64/build/src/gc.c:2572
_jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:2902
jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3108
maybe_collect at /buildworker/worker/package_linux64/build/src/gc.c:827 [inlined]
jl_gc_pool_alloc at /buildworker/worker/package_linux64/build/src/gc.c:1142
iterate at ./compiler/ssair/ir.jl:393
iterate at ./compiler/ssair/ir.jl:385 [inlined]
replace_code_newstyle! at ./compiler/ssair/legacy.jl:71
optimize at ./compiler/optimize.jl:220
typeinf at ./compiler/typeinfer.jl:33

unknown function (ip: (nil))
Allocations: 86180299 (Pool: 86152106; Big: 28193); GC: 96
fish: '/home/sippycups/julia-1.5.2/bin…' terminated by signal SIGSEGV (Address boundary error)

Tag new release?

Now that 0.5 is out of RC (😉) it seems like a good time to tag a release of Gumbo with the 0.5 deprecation warning fixes.

“TypeError: in Tuple, in parameter, expected Type, got HTMLText” when iterating on 0.7.0-beta2.0

The following simple example runs fine using Julia 0.6.4.

using AbstractTrees
using Gumbo

for el in PreOrderDFS(parsehtml("""
        <h1>
            Foo
        </h1>
        """))
    @show el
end

However, on Julia 0.7.0-beta2.0 it goes down in flames.

ERROR: LoadError: TypeError: in Tuple, in parameter, expected Type, got HTMLText
Stacktrace:
 [1] has_non_default_iterate(::HTMLText) at ./essentials.jl:833
 [2] isiterable(::HTMLText) at ./essentials.jl:864
 [3] children at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:26 [inlined]
 [4] children at ~/.julia/packages/AbstractTrees/gbHm/src/traits.jl:36 [inlined]
 [5] children at ~/.julia/packages/AbstractTrees/gbHm/src/traits.jl:38 [inlined]
 [6] childstates at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:333 [inlined]
 [7] childstates(::HTMLElement{:HTML}, ::HTMLText) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:35
 [8] childstates(::AbstractTrees.ImplicitChildStates{HTMLElement{:HTML},AbstractTrees.ImplicitNodeStack{Any,Int64}}) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:41
 [9] iterate(::AbstractTrees.ImplicitChildStates{HTMLElement{:HTML},AbstractTrees.ImplicitNodeStack{Any,Int64}}, ::Int64) at ~/.julia/packages/AbstractTrees/gbHm/src/implicitstacks.jl:46 (repeats 2 times)
 [10] isempty at ./essentials.jl:721 [inlined]
 [11] stepstate(::PreOrderDFS{HTMLElement{:HTML}}, ::AbstractTrees.ImplicitNodeStack{Any,Int64}) at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:446
 [12] iterate(::PreOrderDFS{HTMLElement{:HTML}}, ::AbstractTrees.ImplicitNodeStack{Any,Int64}) at ~/.julia/packages/AbstractTrees/gbHm/src/AbstractTrees.jl:468
 [13] top-level scope at gumbobug.jl:9 [inlined]
 [14] top-level scope at ./<missing>:0
 [15] include at ./boot.jl:317 [inlined]
 [16] include_relative(::Module, ::String) at ./loading.jl:1034
 [17] include(::Module, ::String) at ./sysimg.jl:29
 [18] exec_options(::Base.JLOptions) at ./client.jl:234
 [19] _start() at ./client.jl:427
in expression starting at gumbobug.jl:4

I am sure this is due to some changes that came with 0.7, but I am still working on catching up with the release so I am not entirely sure what is going wrong.

URL dead (again)?

I got the following error message when executing:

Pkg.add("Gumbo")

Here the error:

INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo

WARNING: deprecated syntax "[a=>b, ...]" at /home/ronie/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /home/ronie/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================

LoadError: failed process: Process(`curl -f -o /home/ronie/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /home/ronie/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: Gumbo had build errors.

 - packages with build errors remain installed in /home/ronie/.julia/v0.4
 - build the package(s) and all dependencies with `Pkg.build("Gumbo")`
 - build a single package by running its `deps/build.jl` script

================================================================================
INFO: Package database updated

Is the URL really dead or am I doing something wrong? =\

Google's Gumbo Parser officially archived

Hello,

I love what you did with this package.

Google's Gumbo Parser was archived on 16 February - will this affect this project?

Is there a roadmap of where this repository is going?

P. S. I am using this project in my work and am willing to contribute/help.

somethink wrong in Julia 4.0 with Gumbo

somethink wrong in Julia 4.0 with Gumbo , at 3.6 ids ok on my machine.
(file example.html is in needed dir)
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" for help.
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |_'|||__'| | Commit f96c23c* (5 days old master)
__/ | x86_64-w64-mingw32

ulia> using Base.Test

ulia> using Gumbo

ulia> let
doc = open("example.html") do example
example |> readall |> parsehtml
end
io = IOBuffer()
print(io, doc)
seek(io, 0)
newdoc = io |> readall |> parsehtml
@test newdoc == doc
end

in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in print_to_string at string.jl:23
in print at C:\Users\SAMSUNG2.julia\v0.4\Gumbo\src\io.jl:88
in anonymous at no file:6

....................... while...

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Printing indentation is incorrect

For some reason Gumbo automatically closes <p> elements.

Ex:

julia> output = """
       <p class="repo-description">
         <blockquote class="search_headline">
           <p>
             Foo
           </p>
         </blockquote>
       </p>
       """
julia> "<p class=\"repo-description\">\n  <blockquote class=\"search_headline\">\n    <p>\n      Foo\n    </p>\n  </blockquote>\n</p>\n"

julia> Gumbo.parsehtml(output)
julia> HTML Document:
<!DOCTYPE >
Gumbo.HTMLElement{:HTML}:
<HTML>
  <head></head>
  <body>
    <p class="repo-description"></p>
    <blockquote class="search_headline">
      <p>
      Foo
    </p>
    </blockquote>
    <p></p>
  </body>
</HTML>
"""

You can see in the above example how it closes <p class="repo-description"> and creates an empty <p></p> where the closing tag was.

Other elements are fine:

julia> output = """
       <div class="repo-description">
         <blockquote class="search_headline">
           <p>
             Foo
           </p>
         </blockquote>
       </div>
       """
julia> "<div class=\"repo-description\">\n  <blockquote class=\"search_headline\">\n    <p>\n      Foo\n    </p>\n  </blockquote>\n</div>\n"

julia> Gumbo.parsehtml(output)
julia> HTML Document:
<!DOCTYPE >
Gumbo.HTMLElement{:HTML}:
<HTML>
  <head></head>
  <body>
    <div class="repo-description">
      <blockquote class="search_headline">
        <p>
      Foo
    </p>
      </blockquote>
    </div>
  </body>
</HTML>

Thank you

What wrong ? ERROR: MethodError: `tag` has no method matching tag(::Gumbo.HTMLText) in anonymous at no file:2

What wrong ? I expect list of all tags...

| | || | | | (| | | Version 0.4.0-dev+3600 (2015-02-25 15:26 UTC)
/ |__'|||'_| | Commit f96c23c* (4 days old master)
|
/ | x86_64-w64-mingw32

julia> using HTTPClient.HTTPC

julia> using Gumbo

julia> r=HTTPC.post("http://requestb.in/api/v1/bins", "")
HTTP Code :200
RequestTime :14.804
Headers :
Connection : keep-alive
Via : 1.1 vegur
Content-Length : 84
Date : Mon, 02 Mar 2015 15:14:06 GMT
Content-Type : application/json
Access-Control-Allow-Origin : *
Server : gunicorn/18.0
Length of body : 84

julia> r.body
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=84, maxsize=Inf, ptr=85,
mark=-1)

julia> s=bytestring(r.body)
"{"color": [100, 140, 100], "name": "195qmnk1", "request_count": 0, "private": false}"

julia> doc=parsehtml(s)
HTML Document:

Gumbo.HTMLElement{:HTML}:

{"color": [100, 140, 100], "name": "195qmnk1", "request_count": 0, "private": false}

julia> for elem in preorder(doc.root)
println(tag(elem))
end
HTML
head
body
ERROR: MethodError: tag has no method matching tag(::Gumbo.HTMLText)
in anonymous at no file:2

julia>
Paul

Convert HTML file with table(s) to DataFrame

Hello,

I have an HTML file with a table and would like to convert it to a Julia DataFrame.

I was looking for a function similar to Python Pandas read_html function (which directly output a list of DataFrame).

Unfortunately I don't see similar function in Julia ecosystem

In Gumbo doc I was looking for an example to iterate over rows and colums of each table

here is a basic HTML source file with 2 tables

<!DOCTYPE >
<HTML>
  <head></head>
  <body>

    <h1>First table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            A
          </th>
          <th>
            B
          </th>
        </tr>
        <tr>
          <td>
            1
          </td>
          <td>
            1.1
          </td>
        </tr>
        <tr>
          <td>
            2
          </td>
          <td>
            2.1
          </td>
        </tr>
      </tbody>
    </table>

    <h1>Second table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            AA
          </th>
          <th>
            BB
          </th>
        </tr>
        <tr>
          <td>
            10
          </td>
          <td>
            10.1
          </td>
        </tr>
        <tr>
          <td>
            20
          </td>
          <td>
            20.1
          </td>
        </tr>
      </tbody>
    </table>

  </body>
</HTML>

I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl

Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.

Kind regards

PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia

Broken on Julia v0.6

Hi,

Does Gumbo work for you on v0.6? I'm getting an error coming from AbstractTrees - I was wondering if you found a workaround? I already opened an issue with AbstractTrees.

ERROR: LoadError: invalid subtyping in definition of PostOrderDFS
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] anonymous at ./<missing>:2
while loading /Users/adrian/.julia/v0.6/AbstractTrees/src/AbstractTrees.jl, in expression starting on line 521
ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: Failed to precompile AbstractTrees to /Users/adrian/.julia/lib/v0.6/AbstractTrees.ji.
compilecache(::String) at ./loading.jl:703
_require(::Symbol) at ./loading.jl:490
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
include(::String) at ./sysimg.jl:14
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
eval(::Expr) at /Users/adrian/.julia/v0.6/Genie/src/Renderer.jl:1
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
include_from_node1(::String) at ./loading.jl:569
eval(::Module, ::Any) at ./boot.jl:235
_require(::Symbol) at ./loading.jl:483
require(::Symbol) at ./loading.jl:398
macro expansion at ./distributed/macros.jl:99 [inlined]
anonymous at ./<missing>:?
include_from_node1(::String) at ./loading.jl:569
include(::String) at ./sysimg.jl:14
(::Base.Distributed.##135#136{Base.#include,Tuple{String},Array{Any,1}})() at ./distributed/remotecall.jl:314
run_work_thunk(::Base.Distributed.##135#136{Base.#include,Tuple{String},Array{Any,1}}, ::Bool) at ./distributed/process_messages.jl:56
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
(::Base.##504#506{Base.JLOptions})() at ./task.jl:335
while loading /Users/adrian/.julia/v0.6/Gumbo/src/manipulation.jl, in expression starting on line 3
while loading /Users/adrian/.julia/v0.6/Gumbo/src/Gumbo.jl, in expression starting on line 29
while loading /Users/adrian/.julia/v0.6/Flax/src/Flax.jl, in expression starting on line 6
while loading /Users/adrian/.julia/v0.6/Genie/src/Renderer.jl, in expression starting on line 10
while loading /Users/adrian/.julia/v0.6/Genie/src/Router.jl, in expression starting on line 4
while loading /Users/adrian/.julia/v0.6/Genie/src/AppServer.jl, in expression starting on line 6
while loading /Users/adrian/.julia/v0.6/Genie/src/Genie.jl, in expression starting on line 36
while loading /Users/adrian/Dropbox/Projects/_test_app3/genie.jl, in expression starting on line 28
#remotecall_fetch#140(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:340
remotecall_fetch(::Function, ::Base.Distributed.LocalProcess, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:339
#remotecall_fetch#144(::Array{Any,1}, ::Function, ::Function, ::Int64, ::String, ::Vararg{String,N} where N) at ./distributed/remotecall.jl:367
remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{Any,N} where N) at ./distributed/remotecall.jl:367
(::Base.##504#506{Base.JLOptions})() at ./task.jl:335
Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] process_options(::Base.JLOptions) at ./client.jl:279
 [4] _start() at ./client.jl:371

Build Error (MethodError: No matching keys)

Hi! I'm using Julia 0.5.0 because a bug in the Package Manager on 0.4.6 version. When I run Pkg.add("Gumbo"), the building process ends with this errror:

LoadError: MethodError: no method matching keys(::Array{Pair{Symbol,Symbol},1})
Closest candidates are:
  keys(::Associative{K,V})
while loading /home/yagox/.julia/v0.5/Gumbo/deps/build.jl, in expression starting on line 19

Thank you! :D

URL dead?

Dependency points to http://jamesporter.me/julia/gumbo-1.0.tar.gz but this URL now returns a 404

readhtml function (or read(fname, HTMLDocument))

Hello,

Gumbo have a parsehtml function but reading HTML from file can be do using

parsehtml(read(filename, String))

Maybe a readhtml function should be defined as

readhtml(filename, args... ; kwargs...) = parsehtml(read(filename, String), args...; kwargs...)

An other API idea could be (to mimic read API) to have the following functions:

parse(HTMLDocument, input)

and

read(filename, HTMLDocument)

What is your opinion about such API idea?

Kind regards

What wrong ? ERROR: UndefVarError: Leaves not defined

Version 0.6.0 (2017-06-19 13:05 UTC)
Official http://julialang.org/ release
x86_64-w64-mingw32

julia> using Gumbo

julia> url="http://wp.pl"
"http://wp.pl"

julia> getpage(url) = parsehtml(String(read(download(url))))
getpage (generic function with 1 method)

julia> text_only(doc::HTMLDocument) = text_only(doc.root)
text_only (generic function with 1 method)

julia> text_only(frag) = join([text(leaf) for leaf in Leaves(frag) if leaf isa HTMLText], " ")
text_only (generic function with 2 methods)

julia> get_page_text(url) = text_only(getpage(url))
get_page_text (generic function with 1 method)

julia> doc=parsehtml(String(read(download(url))));

julia> text_only(doc.root[2])
ERROR: UndefVarError: Leaves not defined
Stacktrace:
[1] text_only(::Gumbo.HTMLElement{:body}) at .\REPL[5]:1

julia> typeof(doc)
Gumbo.HTMLDocument

Thx, Paul

both AbstractTrees and Gumbo export "children"

In the current version of Gumbo, it seems to be impossible to use the children method without qualifying it. This seems perverse when Gumbo extends children from AbstractTrees. I don't yet have a satisfying solution to this.

julia> using Gumbo

julia> using AbstractTrees

julia> const ex = parsehtml("""
                            <html>
                            <head></head>
                            <body>
                            <p>a<strong>b</strong>c</p>
                            </body>
                            </html>
                            """);

julia> for n in children(ex.root)
          println(n)
       end
WARNING: both AbstractTrees and Gumbo export "children"; uses of it in module Main must be qualified
ERROR: UndefVarError: children not defined
 in anonymous at ./<missing>:?

Cleanup show, print, etc.

The whole display situation has ended up a bit of a mess.

I'd also like to think about display and MIME types, e.g. how should Gumbo types be shown in Jupyter notebooks.

Gumbo Build

I have a BinaryBuilder setup for Gumbo that builds the library on all platforms. This should make it easier and more reliable to support.

The repo is here: https://github.com/aviks/GumboBuilder

@porterjamesj happy to add you to the repo.

One question: In the repo above, I'm building the current master from https://github.com/google/gumbo-parser. In your current build.jl, you are downloading a version you call 1.0 from a private server. However, Gumbo's current released version seems to be 0.10.1, released about three years ago. Which version of the code are you actually running?

I'll submit a PR with the build.jl changes that are needed, after I get some clarity on the question above.

Broken on v0.6

You probably know already, but just in case - it's broken on v0.6 due to broken AbstractTrees.jl
Is it possible to remove the dependency?

Fails to build with 0.5

================================[ ERROR: Gumbo ]================================
LoadError: MethodError: no method matching keys(::Array{Pair{Symbol,Symbol},1})
Closest candidates are:
  keys(::Associative{K,V}) at dict.jl:118
while loading /home/travis/.julia/v0.5/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================

Maintainership

Hi all,

I'm the original author of this package, but as is probably obvious to those of you who use it heavily, I don't really have the bandwidth or interest to maintain it anymore.

In practice, maintenance has been done by an ad-hoc group of JuliaWeb organization members for a while now. I'm opening this issue mostly to acknowledge this, formally hand-off maintenance, and so those interested can have a quick conversation to settle on a person or group of people who are going to maintain this package going forward.

cc'ing a few people who I feel like should be involved or might be interested in the outcome: @aviks @pfitzseb @essenciary

1.0 error on build

  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Building Gumbo → `~/.julia/packages/Gumbo/HKeb2/deps/build.log`
┌ Error: Error building `Gumbo`: 
│ ERROR: LoadError: MethodError: no method matching similar(::Dict{String,String})
│ Closest candidates are:
│   similar(!Matched::Array{T,1}) where T at array.jl:327
│   similar(!Matched::Array{T,2}) where T at array.jl:328
│   similar(!Matched::Array{T,1}, !Matched::Type) where T at array.jl:329
│   ...
│ Stacktrace:
│  [1] adjust_env(::Dict{String,String}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:388
│  [2] lower(::BinDeps.AutotoolsDependency, ::BinDeps.SynchronousStepCollection) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:431
│  [3] |(::BinDeps.SynchronousStepCollection, ::BinDeps.AutotoolsDependency) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/BinDeps.jl:328
│  [4] generate_steps(::BinDeps.LibraryDependency, ::Autotools, ::Dict{Symbol,Any}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:634
│  [5] satisfy!(::BinDeps.LibraryDependency, ::Array{DataType,1}) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:944
│  [6] satisfy!(::BinDeps.LibraryDependency) at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:922
│  [7] top-level scope at /home/mrg/.julia/packages/BinDeps/Z6fwm/src/dependencies.jl:977
│  [8] include at ./boot.jl:317 [inlined]
│  [9] include_relative(::Module, ::String) at ./loading.jl:1038
│  [10] include(::Module, ::String) at ./sysimg.jl:29
│  [11] include(::String) at ./client.jl:388
│  [12] top-level scope at none:0
│ in expression starting at /home/mrg/.julia/packages/Gumbo/HKeb2/deps/build.jl:19
└ @ Pkg.Operations ~/Desktop/juliamaster/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:1068

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca* (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

How to extract the contents of elements such as div

How to extract the contents of elements such as div ?
I suspect all "body" of divs..

julia> for elem in preorder(body)
#println(elem)
if typeof(elem)==HTMLElement{:div} push!(divy,(elem)) end
end

julia> unique(divy)
289-element Array{Any,1}:
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}
Gumbo.HTMLElement{:div}

Paul

Strange behaviour with autoclosing tags

Hello!

I saw this strange behaviour, which can be useful sometimes but also rather dangerous in many cases: some tags which must not be autoclosing but autoclosed propagate until the parent closing tag, changing the tree structure.

test = """<p>A simple <em>paragraph</em> with <br/> a <b>bad</b> <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <p>
      A simple
      <em>
        paragraph
      </em>
      with
      <br></br>
      a
      <b>
        bad
      </b>
      
      <a href="ref">
        link
        <em>
          (which does not exist)
        </em>
        !
      </a>
    </p>
  </body>
</HTML>

I think that being more conservative and just putting the closing tag just after, without comprising the following text, is more secure.

I think this result (I write myself for example) would have been more consistent:

…
      with
      <br></br>
      a
      <b>
        bad
      </b>
      
      <a href="ref"></a>
        link
        <em>
          (which does not exist)
        </em>
        !
…

Another example, more visible:

test = """<p>A simple <em>paragraph</em> with <br/> a <b/>bad bold and a bad <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <p>
      A simple
      <em>
        paragraph
      </em>
      with
      <br></br>
      a
      <b>
        bad bold and a bad
        <a href="ref">
          link
          <em>
            (which does not exist)
          </em>
          !
        </a>
      </b>
    </p>
  </body>
</HTML>

I don’t know if it is a bug or a feature, but in the latter case, maybe an argument to change this behaviour at will would be nice.

Thank you for your work, anyway!

How to save to disk HTMLElement ?

How to save to disk HTMLElement ?

save to jld makes errors and exclusion julia

r = HTTPC.get(url,RequestOptions(
request_timeout=8.0,
callback=customize_curl))
page = bytestring(r.body)
doc = parsehtml(page)
body = getBody(doc) # function getBody from Your page
julia> body
HTMLElement{:body}:

/\*

Information about node position in parent

Hi,
I was trying to use Gumbo.jl to manipulate the DOM by changing tags and content of nodes. However, I am confused with some code.

doc = parsehtml("""
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span>this is a span 1</span>
        <div>
        <h1>this is a heading</h1>
        <span>this is a span 2</span>
        </div>
    </body>
</html>
""");

If I wanted to change/ replace the first <span> node to a <abc> node

julia> elem = doc.root[2][1]
HTMLElement{:span}:
<span>
  this is a span 1
</span>

The following does not work

julia> elem = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)
HTMLElement{:abc}:
<abc>
  this is a span 1
</abc>

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <span>
      this is a span 1
    </span>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

I have to assign this in the parents children for it to work. For which I have to know the position of the node in the parent node.

elem.parent.children[1] = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <abc>
      this is a span 1
    </abc>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

Is this behavior intended? I thought both parents children and the child node should point to the same location.

Also I see that the information about position in parent is present in index_within_parent. I was wondering if it would be possible to add this information for each node in addition to parents children and attributes. If we have this information then we could overcome the above issue.

struct Node{T}
    gntype::Int32  # enum
    parent::Ptr{Node}
    index_within_parent::Csize_t
    parse_flags::Int32  # enum
    v::T
end

Please let me know your thoughts or if I am approaching this entirely in the wrong direction. Is there a more straight forward way to manipulate the Nodes?

ERROR: MethodError: no method matching tag(::HTMLText)

What to do ? Is posible define new tags ? how ?
Win7 64


               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.5 (2019-09-09)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using HTTP,Gumbo,AbstractTrees

julia> url=url="http://bbc.com";

julia> r=HTTP.request("GET", url; retries=4, cookies=true);

julia> doc=parsehtml(String(r.body));

julia> tag.(doc.root[:])
2-element Array{Symbol,1}:
 :head
 :body

julia>

julia> for elem in StatelessBFS(doc.root) println(tag(elem)) end
HTML
head
body
meta
meta
meta
title
script
meta
meta
meta
meta
meta
meta
meta
meta
meta
meta
link
link
link
link
link
link
meta
meta
link
script
script
script
script
script
script
script
link
script
script
style
script
script
script
script
script
script
script
script
script
script
script
script
script
script
script
link
script
script
script
link
script
script
noscript
div
div
div
div
script
script
div
header
div
div
script
script
script
script
div
script
script
script
script
script
script
ERROR: MethodError: no method matching tag(::HTMLText)
Closest candidates are:
  tag(::HTMLElement{T}) where T at C:\Users\Julai1_0_5\.julia\packages\Gumbo\G7Qbw\src\manipulation.jl:6
Stacktrace:
 [1] top-level scope at .\REPL[6]:1 [inlined]
 [2] top-level scope at .\none:0

julia>

Thanks Paul

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.