yorickpeterse / oga Goto Github PK

Oga is an XML/HTML parser written in Ruby.

License: Mozilla Public License 2.0

Ruby 91.83% HTML 1.25% C 0.06% Java 0.03% Shell 0.06% Ragel 6.78%

ruby xml html parser xml-parser html-parser

oga's Introduction

Oga

NOTE: my spare time is limited which means I am unable to dedicate a lot of time on Oga. If you're interested in contributing to FOSS, please take a look at the open issues and submit a pull request to address them where possible.

Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for parsing, modifying and querying documents (using XPath expressions). Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms. To achieve better performance Oga uses a small, native extension (C for MRI/Rubinius, Java for JRuby).

Oga provides an API that allows you to safely parse and query documents in a multi-threaded environment, without having to worry about your applications blowing up.

From Wikipedia:

Oga: A large two-person saw used for ripping large boards in the days before power saws. One person stood on a raised platform, with the board below him, and the other person stood underneath them.

The name is a pun on Nokogiri.

Versioning Policy

Oga uses the version format MAJOR.MINOR (e.g. 2.1). An increase of the MAJOR version indicates backwards incompatible changes were introduced. The MINOR version is only increased when changes are backwards compatible, regardless of whether those changes are bugfixes or new features. Up until version 1.0 the code should be considered unstable meaning it can change (and break) at any given moment.

APIs explicitly tagged as private (e.g. using Ruby's private keyword or YARD's @api private tag) are not covered by these rules.

Examples

Parsing a simple string of XML:

Oga.parse_xml('<people><person>Alice</person></people>')

Parsing XML using strict mode (disables automatic tag insertion):

Oga.parse_xml('<people>foo</people>', :strict => true) # works fine
Oga.parse_xml('<people>foo', :strict => true)          # throws an error

Parsing a simple string of HTML:

Oga.parse_html('<link rel="stylesheet" href="foo.css">')

Parsing an IO handle pointing to XML (this also works when using Oga.parse_html):

handle = File.open('path/to/file.xml')

Oga.parse_xml(handle)

Parsing an IO handle using the pull parser:

handle = File.open('path/to/file.xml')
parser = Oga::XML::PullParser.new(handle)

parser.parse do |node|
  parser.on(:text) do
    puts node.text
  end
end

Using an Enumerator to download and parse an XML document on the fly:

enum = Enumerator.new do |yielder|
  HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|
    yielder << chunk
  end
end

document = Oga.parse_xml(enum)

Parse a string of XML using the SAX parser:

class ElementNames
  attr_reader :names

  def initialize
    @names = []
  end

  def on_element(namespace, name, attrs = {})
    @names << name
  end
end

handler = ElementNames.new

Oga.sax_parse_xml(handler, '<foo><bar></bar></foo>')

handler.names # => ["foo", "bar"]

Querying a document using XPath:

document = Oga.parse_xml <<-EOF
<people>
  <person id="1">
    <name>Alice</name>
    <age>28</name>
  </person>
</people>
EOF

# The "xpath" method returns an enumerable (Oga::XML::NodeSet) that you can
# iterate over.
document.xpath('people/person').each do |person|
  puts person.get('id') # => "1"

  # The "at_xpath" method returns a single node from a set, it's the same as
  # person.xpath('name').first.
  puts person.at_xpath('name').text # => "Alice"
end

Querying the same document using CSS:

document = Oga.parse_xml <<-EOF
<people>
  <person id="1">
    <name>Alice</name>
    <age>28</name>
  </person>
</people>
EOF

# The "css" method returns an enumerable (Oga::XML::NodeSet) that you can
# iterate over.
document.css('people person').each do |person|
  puts person.get('id') # => "1"

  # The "at_css" method returns a single node from a set, it's the same as
  # person.css('name').first.
  puts person.at_css('name').text # => "Alice"
end

Modifying a document and serializing it back to XML:

document = Oga.parse_xml('<people><person>Alice</person></people>')
name     = document.at_xpath('people/person[1]/text()')

name.text = 'Bob'

document.to_xml # => "<people><person>Bob</person></people>"

Querying a document using a namespace:

document = Oga.parse_xml('<root xmlns:x="foo"><x:div></x:div></root>')
div      = document.xpath('root/x:div').first

div.namespace # => Namespace(name: "x" uri: "foo")

Features

Support for parsing XML and HTML(5)
- DOM parsing
- Stream/pull parsing
- SAX parsing
Low memory footprint
High performance (taking into account most work happens in Ruby)
Support for XPath 1.0
CSS3 selector support
XML namespace support (registering, querying, etc)
Windows support

Requirements

Ruby	Required	Recommended
MRI	>= 2.3.0	>= 2.6.0
JRuby	>= 1.7	>= 1.7.12
Rubinius	Not supported
Maglev	Not supported
Topaz	Not supported
mruby	Not supported

Maglev and Topaz are not supported due to the lack of a C API (that I know of) and the lack of active development of these Ruby implementations. mruby is not supported because it's a very different implementation all together.

To install Oga on MRI or Rubinius you'll need to have a working compiler such as gcc or clang. Oga's C extension can be compiled with both. JRuby does not require a compiler as the native extension is compiled during the Gem building process and bundled inside the Gem itself.

Thread Safety

Oga does not use a unsynchronized global mutable state. As a result of this you can parse/create documents concurrently without any problems. Modifying documents concurrently can lead to bugs as these operations are not synchronized.

Some querying operations will cache data in instance variables, without synchronization. An example is Oga::XML::Element#namespace which will cache an element's namespace after the first call.

In general it's recommended to not use the same document in multiple threads at the same time.

Namespace Support

Oga fully supports parsing/registering XML namespaces as well as querying them using XPath. For example, take the following XML:

<root xmlns="http://example.com">
    <bar>bar</bar>
</root>

If one were to try and query the bar element (e.g. using XPath root/bar) they'd end up with an empty node set. This is due to <root> defining an alternative default namespace. Instead you can query this element using the following XPath:

*[local-name() = "root"]/*[local-name() = "bar"]

Alternatively, if you don't really care where the <bar> element is located you can use the following:

descendant::*[local-name() = "bar"]

And if you want to specify an explicit namespace URI, you can use this:

descendant::*[local-name() = "bar" and namespace-uri() = "http://example.com"]

Like Nokogiri, Oga provides a way to create "dynamic" namespaces. That is, Oga allows one to query the above document as following:

document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')

document.xpath('x:root/x:bar', namespaces: {'x' => 'http://example.com'})

Moreover, because Oga assigns the name "xmlns" to default namespaces you can use this in your XPath queries:

document = Oga.parse_xml('<root xmlns="http://example.com"><bar>bar</bar></root>')

document.xpath('xmlns:root/xmlns:bar')

When using this you can still restrict the query to the correct namespace URI:

document.xpath('xmlns:root[namespace-uri() = "http://example.com"]/xmlns:bar')

HTML5 Support

Oga fully supports HTML5 including the omission of certain tags. For example, the following is parsed just fine:

<li>Hello
<li>World

This is effectively parsed into:

<li>Hello</li>
<li>World</li>

One exception Oga makes is that it does not automatically insert html, head and body tags. Automatically inserting these tags requires a distinction between documents and fragments as a user might not always want these tags to be inserted if left out. This complicates the user facing API as well as complicating the parsing internals of Oga. As a result I have decided that Oga does not insert these tags when left out.

A more in depth explanation can be found here: #98

Documentation

The documentation is best viewed on the documentation website.

{file:CONTRIBUTING Contributing}
{file:changelog Changelog}
{file:migrating_from_nokogiri Migrating From Nokogiri}
{Oga::XML::Parser XML Parser}
{Oga::XML::SaxParser XML SAX Parser}
{file:xml_namespaces XML Namespaces}

Why Another HTML/XML parser?

Currently there are a few existing parser out there, the most famous one being Nokogiri. Another parser that's becoming more popular these days is Ox. Ruby's standard library also comes with REXML.

The sad truth is that these existing libraries are problematic in their own ways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works because of the non concurrent nature of MRI, on JRuby it works because it's implemented as Java. Nokogiri also uses libxml2 which is a massive beast of a library, is not thread-safe and problematic to install on certain platforms (apparently). I don't want to compile libxml2 every time I install Nokogiri either.

To give an example about the issues with Nokogiri on Rubinius (or any other Ruby implementation that is not MRI or JRuby), take a look at these issues:

Some of these have been fixed, some have not. The core problem remains: Nokogiri acts in a way that there can be a large number of places where it might break due to throwing around void pointers and what not and expecting that things magically work. Note that I have nothing against the people running these projects, I just heavily, heavily dislike the resulting codebase one has to deal with today.

Ox looks very promising but it lacks a rather crucial feature: parsing HTML (without using a SAX API). It's also again a C extension making debugging more of a pain (at least for me).

I just want an XML/HTML parser that I can rely on stability wise and that is written in Ruby so I can actually debug it. In theory it should also make it easier for other Ruby developers to contribute.

License

All source code in this repository is subject to the terms of the Mozilla Public License, version 2.0 unless stated otherwise. A copy of this license can be found the file "LICENSE" or at https://www.mozilla.org/MPL/2.0/.

oga's People

Contributors

Stargazers

Watchers

Forkers

headius beatrichartz ttasanen krasnoukhov polamjag lightguard abotalov ltk lunalium tgxworld jeffreybaird altmetric deivid-rodriguez dfockler pinwheeler jarredholman sylg anton-ivan jkthorne jbr paulrajgithub dsisnero sferik scotchi kbarrette vinchu gittmaan pikachuexe mfssangadji lloeki zumoshi nakabonne iainbeeston djberg96 bearerpipelinetest iq-scm plurimath gemmaro tjgfernandes faisal

oga's Issues

Lexing/parsing of XML declaration tags

Oga should be able to lex/parse tags such as <?xml version="1.0" ?>. An easy way would be to simply treat everything between <?xml and ?> as plain text.

Example input:

<?xml version="1.0"?>
<atom:feed>
  <!-- Normally here would be source, title, author, id, etc ... -->

  <link rel="hub" href="http://myhub.example.com/endpoint" />
  <link rel="self" href="http://publisher.example.com/happycats.xml" />
  <updated>2008-08-11T02:15:01Z</updated>

  <!-- Example of a full entry. -->
  <entry>
    <title>Heathcliff</title>
    <link href="http://publisher.example.com/happycat25.xml" />
    <id>http://publisher.example.com/happycat25.xml</id>
    <updated>2008-08-11T02:15:01Z</updated>
    <content>
      What a happy cat. Full content goes here.
    </content>
  </entry>

  <!-- Example of an entity that isn't full/is truncated. This is implied
       by the lack of a <content> element and a <summary> element instead. -->
  <entry >
    <title>Heathcliff</title>
    <link href="http://publisher.example.com/happycat25.xml" />
    <id>http://publisher.example.com/happycat25.xml</id>
    <updated>2008-08-11T02:15:01Z</updated>
    <summary>
      What a happy cat!
    </summary>
  </entry>

  <!-- Meta-data only; implied by the lack of <content> and
       <summary> elements. -->
  <entry>
    <title>Garfield</title>
    <link rel="alternate" href="http://publisher.example.com/happycat24.xml" />
    <id>http://publisher.example.com/happycat25.xml</id>
    <updated>2008-08-11T02:15:01Z</updated>
  </entry>

  <!-- Context entry that's meta-data only and not new. -->
  <entry>
    <title>Nermal</title>
    <link rel="alternate" href="http://publisher.example.com/happycat23s.xml" />
    <id>http://publisher.example.com/happycat25.xml</id>
    <updated>2008-07-10T12:28:13Z</updated>
  </entry>

</atom:feed>

Comments surrounding elements mess up the lexer

The following code:

Oga::XML::Parser.new(<<-EOF).parse
<!--was-->
<term tid="t2" type="open" lemma="be" pos="V" morphofeat="VBD">
  <span>
    <target id="w2"/>
  </span>
</term>
<!--Foobar-->
EOF

Returns the following document tree:

Document(
  doctype: nil
  xml_declaration: nil
  children: [
    Comment(text: "was-->\n<term tid=\"t2\" type=\"open\" lemma=\"be\" pos=\"V\" morphofeat=\"VBD\">\n  <span>\n    <target id=\"w2\"/>\n  </span>\n</term>\n<!--Foobar")
    Text(text: "\n")
])

It seems that the second comment () messes things up. In the above case the lexer spits out the following:

[[:T_COMMENT, "was-->\n<term tid=\"t2\" type=\"open\" lemma=\"be\" pos=\"V\" morphofeat=\"VBD\">\n  <span>\n    <target id=\"w2\"/>\n  </span>\n</term>\n<!--Foobar", 1], [:T_TEXT, "\n", 1]]

nil exception, i'm not sure what's causing it

doc = <<EOS
  <a:parent xmlns:a='http://example.com/A' xmlns:b='http://example.com/B'>
    <a:child>
      <b:grandchild />
    </a:child>
  </a:parent>
EOS
oxml = Oga.parse_xml(doc)

oxml.to_xml

/Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/attribute.rb:85:in `to_xml': undefined method `name' for nil:NilClass (NoMethodError)
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/element.rb:219:in `block in to_xml'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/element.rb:218:in `each'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/element.rb:218:in `to_xml'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/node_set.rb:59:in `block in each'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/node_set.rb:59:in `each'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/node_set.rb:59:in `each'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/document.rb:61:in `map'
    from /Users/jrochkind/.gem/ruby/1.9.3/gems/oga-0.1.1/lib/oga/xml/document.rb:61:in `to_xml'

Parsing of comments

This depends on #4. Given the input  the AST should be something along the lines of:

s(:document,
  s(:comment, 'foo'))

descendant-or-self performance is really bad

When querying the document found in https://gist.github.com/YorickPeterse/a81067a5a77e9d8ece2c using expression //wf it seems to take a really long time to complete. On the other hand, KAF/text/wf is nearly instant.

SAX interface

This is probably a feature request. I'm maintainer of sax-machine, which provides neat mappings for XML to Ruby objects. Also, sax-machine is a core part of feedjira gem which is the best solution to parse RSS/Atom feeds in Ruby.

It would be really nice to have oga as a backend for sax-machine and feedjira. But current sax-machine implementation requires parsing library to have SAX-like interface.

Looking forward to hear your opinion on this!

Better APIs for modifying DOM trees

Right now Oga only provides basic APIs for modifying DOM documents. For example, you can inject elements into a node set. However, when you have a single element there is no easy way to place something before/after it. The following should be added:

API for adding elements before a specific element
API for adding elements after a specific element

Improve API documentation

Better documentation should be added for (at least):

namespace usage and registration, especially when manually assigning namespaces
modifying documents (the README example is pretty basic)
general examples (e.g. for the most common use cases)
~~error output/handling~~ not available yet in 0.2
manually creating DOM trees directly from Ruby
correct YARD @param tags where needed

Consider removing source XML/HTML from parser errors

Right now the parser includes source code in error messages. That is, when an error occurs up to 5 lines before and after the problematic line are included in the error message. While this makes the error messages a bit more useful it has a few problems:

It requires a line number to be present for the current token. If I decide to disable this by default (see #51) then we wouldn't show the source code by default anyway
Lines are trimmed at 80 characters per line to keep things tidy. However, if the problem occurs at column 184 you wouldn't see it in the error message
When the input is an IO/StringIO instance the input has to be re-read in order to get the raw lines out. This is annoying if the input source is actually an IO instance (or Enumerator) wrapping an HTTP request (or other network action)
The code for this is just stupid complex

Honestly I implemented this as a gimmick in the first place. Perhaps now I have found myself enough reasons to just nuke this whole thing.

Move node_type into the pull parser

The node_type method found on various XML nodes (e.g. XML::Element#node_type) should be moved into the XML::PullParser class. This is something used only by the pull-parser, it doesn't make sense to expose it publicly as a method.

Common XML memory exhaustion attacks

https://pypi.python.org/pypi/defusedxml something to keep in mind when designing Oga.

Channel-like parsing API

The article https://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog discusses an interesting topic: a pull-parsing API where you can define the node types to act on. This is similar to most Channel APIs where you specify a action to occur for a certain data type.

I'll have to look into up to what extend I'll include this in Oga but it is definitely something I want to be available one way or another. There are two options that I can think of:

Provide an API to run actions on node types (e.g. something like on(:h3) { ... }
Provide an API that can do the above but can also filter by attributes/nesting. This should probably be built on top of the above API (instead of being the same API).

Sample:

parser = Oga::PullParser.new('<p><a href="#">Foo</a></p>')

parser.on(:a, :href => '#') do |node|
  p node.text # => "Foo"
end

parser.iterate

Poorly formatted HTML stalls the parser.

Oga version 0.1.1

These:

Oga.parse_xml('</')

Oga.parse_html('</')

stall the parser completely.

Nokogiri Migration Guide

To ease the migration of Nokogiri to Oga a guide should be created outlining the various changes required to one's code to use Oga.

I've decided to do this instead of creating a Nokogiri compatible API. This is due to the latter being a rather fragile solution. That is, if some third-party Gem were to use Nokogiri and your own project the Oga compatibility layer then those two could conflict.

Said guide would have to at least discuss the following:

Method call changes used for creating/parsing documents
The methods to use as replacements for Nokogiri's #css and #xpath methods (if they happen to be named differently)
Types used by various methods (e.g. what the replacement is for Nokogiri::XML::Fragment)
Much more

Encode documents according to XML declaration tags / HTML charset tags

Currently Oga just uses Ruby's default encoding when parsing documents. XML documents however can specify the document encoding using XML declaration tags (e.g. <?xml encoding="..." ?>). In HTML documents a <meta charset="..." /> can also be used for this.

I'm not sure if Oga should automatically change the encoding or not. For example, what should happen if the encoding value is not recognized by Ruby? It could also have some performance drawbacks when parsing. First the document would be encoded in the default encoding, then it would be changed to whatever the document specifies.

In case I decided not to add this I should at least document the rationale behind that choice.

Setup JRuby extension

Currently the native-ext branch contains a C extension for the XML lexer. This same setup should be ported over to JRuby. Because I don't want to maintain the same Ragel grammar in 2 places I'll be generating JRuby/C code using this grammar. The following steps are involved:

Port over the C extension to a JRuby extension, keep the grammars separate for now. Once done, test everything to make sure it's working.
Port over the setup to some kind of "template" system that generates the C and JRuby code. This should remove the need for 2 identical Ragel grammars.

Basically the Ragel grammar would contain a set of placeholders that will be replaced with either C or Java code. The resulting file is then passed to Ragel and compiled.

output empty elements as empty elements?

Nokogiri will output an empty element as an empty element (no closing tag, xml empty element syntax)

nxml = Nokogiri::XML("<document><element /></document>")
nxml.to_xml 

# =>
# <?xml version=\"1.0\"?>
# <document>
#   <element/>
# </document>

Oga will output as an opening and closing tag, just containing no content.

oxml = Oga.parse_xml("<document><element /></document>")
oxml.to_xml

# =>
# <document><element></element></document>

Would it make sense for oga to do the same thing here? If noticing an element that has completely empty content (not even a text node), serialize as an XML empty element (no closing tag)?

Other discrepencies noticed while preparing this: nokogiri automatically always outputs the <?xml> declaration -- should oga?

Nokogiri also adds newlines and indents -- would it make sense for some kind of pretty-printing to be at least an optional feature in oga?

Using IO instances as lexer input

The lexer should be able to take an IO object as input. This would allow Oga to parse files as a stream without requiring users to first load the entire file into memory.

Querying elements with default namespaces is painful

When an element has a custom default namespace associated, querying said element becomes painful. Due to the namespace being present you have to specify a wildcard as the namespace name in order to retrieve it:

xml = '<foo xmlns="bar"></foo>'
doc = Oga.parse_xml(xml)

doc.xpath('foo') # => NodeSet()
doc.xpath('*:foo') # => NodeSet(Element(name: "foo" namespace: Namespace(name: "xmlns" uri: "bar") attributes: [Attribute(name: "xmlns" value: "bar")]))

To solve this the XPath evaluator should only test namespace names if a name is explicitly set.

cc @hannesfostie

Streaming APIs

Currently when one tries to lex a large XML file (say, 100MB) memory usage of the process will continue to grow until it gets OOM killed. The same would most likely happen to the parser and anything built on top of it.

In order to support lexing/parsing of large files the various APIs (or at least the lexer and parser) should support streaming of input. In case of the lexer this would mean that tokens are not stored in an Array but instead passed to a block of some sort. Rough sketch:

lexer = Oga::Lexer.new

lexer.stream('.....') do |token|
  # do something with the current token
end

Nokogiri Pain Points

When I started sharing the word of my work on Oga various developers remarked that they were very happy with a pure Ruby XML/HTML parser. I found this a bit surprising as I've always assumed people were generally happy enough with Nokogiri (at least before they started shipping libxml). To be more specific, I've not come across a lot of negative articles/resources about Nokogiri.

As a result of this I'll be using this issue to keep track of requests/suggestions/problems people currently have with Nokgiri and XML/HTML parsing in general. In particular I'd like to know what people dislike about Nokogiri to see if I can whip together something for that.

In other words, if there's something about Nokogiri that absolutely pisses you off please specify so in a comment below.

Lexing of comments

The lexer should support comments. These should be treated as special tokens instead of T_TEXT.

Given this input:

<!-- foo -->

The resulting tokens should be something along the lines of this:

[:T_COMMENT_START, '<!--']
[:T_SPACE, ' ']
[:T_TEXT, 'foo']
[:T_SPACE, ' ']
[:T_COMMENT_END, '-->']

Xml Parsing Architecture

Not really an issue but here's a good article on one of the fastest parsers which has similar ideas to vtd-xml

http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html

Top-level namespaces take precedence over more deeply nested ones

Code:

document = Oga.parse_xml <<-EOF
<root xmlns:x="1">
  <div xmlns:x="2">
    <x:text>Foo</x:text>
  </div>
</root>
EOF

document.at_xpath('root/div/x:text').available_namespaces

This should should the second namespace, but instead it returns the first one. This is due to the way available_namespaces merges namespaces together: it starts at the bottom, adds namespaces, then moves further up. This results in higher-level namespaces overwriting more deeply nested ones (instead of the other way around).

Unescape HTML entities in attributes (SAX)

This is a bit different to a #49, basically because it's related only to SAX parsing differences comparing to Nokogiri.

Consider following example:

require 'nokogiri'
require 'oga'

xml = '<link rel="alternate" type="text/html" href="http://example.com/?param1=1&amp;param2=2" />'

class NokogiriHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attrs)
    puts attrs.last.last
  end
end

class OgaHandler
  def on_element(namespace, name, attrs)
    puts attrs.last
  end
end

puts '--- Nokogiri (without replace_entities)'
parser = Nokogiri::XML::SAX::Parser.new(NokogiriHandler.new)
parser.parse(xml)

puts '--- Nokogiri (with replace_entities)'
parser = Nokogiri::XML::SAX::Parser.new(NokogiriHandler.new)
parser.parse(xml) do |ctx|
  ctx.replace_entities = true
end

puts '--- Oga'
Oga.sax_parse_xml(OgaHandler.new, xml)

The output is:

--- Nokogiri (without replace_entities)
http://example.com/?param1=1&#38;param2=2
--- Nokogiri (with replace_entities)
http://example.com/?param1=1&param2=2
--- Oga
http://example.com/?param1=1&amp;param2=2

As you Nokogiri provides an option to make HTML entities to be unescaped in resulting attributes. The question is this relevant to Oga and how it should be handled?

Optimize XML parser

The various XML parser benchmarks show a dramatic decrease in performance (= increase in execution time) compared to the XML lexer. While it is expected for execution timings to increase a bit I did not expect them to go from 500 ms to around 3,5 seconds. With an empty grammar (that is, no associated Ruby code) Racc would still take around 1,8 seconds to process 10MB of XML. This is not acceptable.

There are two possible causes here:

Racc is simply not fast enough
I'm a dumbass and messed things up somewhere down the line

While option 2 is certainly possible (it usually is the case) I'm slowly beginning to suspect Racc itself is problematic as well. To figure this out I should set up a few Racc benchmarks and see how it behaves outside of Oga. This includes some form of benchmarking for both the Ruby and C code (as most of Racc happens in C).

In case Racc is deemed to be the culprit there are 3 options to choose from:

Improve Racc itself where possible, the most logical solution
Fork Racc, fix it and try to get as many changes merged in upstream Racc
Ditch Racc completely and use something such as Lemon instead.

Option 1 is the most desirable as it would allow others to also easily benefit from the potential tweaks.

Option 2 should only be used in case the Racc maintainers refuse to merge changes or otherwise disagree with them.

Option 3 is probably going to be the most webscale option. Sadly this means the XML lexer and parser will have to be written in pure C and coupled to Ruby using FFI in order to support JRuby. This in turn means that somehow Ruby would have to start managing the memory of the FFI pointers and free them the moment they and their associated documents go out of scope.

Unable to install on JRuby 1.7.15

Found while working on pittmesh/kismet-gpxsml#2...

[colin@kid kismet-gpsxml (oga)]$ gem install oga
Building native extensions.  This could take a while...
ERROR:  Error installing oga:
        ERROR: Failed to build gem native extension.

    /Users/colin/.rvm/rubies/jruby-1.7.15/bin/jruby extconf.rb
NotImplementedError: C extension support is not enabled. Pass -Xcext.enabled=true to JRuby or set JRUBY_OPTS.

   (root) at /Users/colin/.rvm/rubies/jruby-1.7.15/lib/ruby/shared/mkmf.rb:8
  require at org/jruby/RubyKernel.java:1065
   (root) at /Users/colin/.rvm/rubies/jruby-1.7.15/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1
   (root) at extconf.rb:1


Gem files will remain installed in /Users/colin/.rvm/gems/jruby-1.7.15@gpsxml/gems/oga-0.1.1 for inspection.
Results logged to /Users/colin/.rvm/gems/jruby-1.7.15@gpsxml/gems/oga-0.1.1/ext/c/gem_make.out

[colin@kid kismet-gpsxml (oga)]$ ruby --version
jruby 1.7.15 (1.9.3p392) 2014-09-03 82b5cc3 on Java HotSpot(TM) 64-Bit Server VM 1.7.0_21-b12 +jit [darwin-x86_64]

Support Enumerator as an input source

The Enumerator class can be used to stream data without having to actually write a dedicated class for it. For example:

enum = Enumerator.new do |yielder| 
  HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk| 
    yielder << chunk
  end
end

document = Oga.parse_xml(enum)

Optimize checking of capitalized HTML void elements

Currently the lexer contains the following code to deal with capitalized HTML void element names: https://github.com/YorickPeterse/oga/blob/master/lib/oga/xml/lexer.rb#L325. This results in an extra String allocation for every open tag of an element (String#downcase calls String#dup internally).

The most likely solution here is to expand Oga::XML::HTML_VOID_ELEMENTS so that it contains all possible permutations of the element names. This has a bit of a "wtf" ring to it, but this will save quite the amount of allocations.

should namespace-uri() work with default namespaces?

I may be wrong or confused about what the right or intended behavior here is. But.

require 'oga'

doc = <<-EOS
  <root>
    <container xmlns="http://example.org/default">
      <element />
    </container>
  </root>
EOS

oxml = Oga.parse_xml(doc)
puts oxml.xpath("//*[namespace-uri()='http://example.org/default']").size

oga returns 0 elements there.

nokogiri/libxml returns 2 elements (the <container> and <element>) with the same document and xpath query.

oga namespace-uri() seems like it does not take account of default namespaces. Should it?

Lexing CDATA tags

Currently working on this but both the lexer and HTML parser should support CDATA tags. CDATA tags should be broken up into 3 components:

The start tag <![CDATA[
The body, which can be anything but the end tag
The end tag ]]>

Worth mentioning, the following is valid (and currently not lexed properly):

<![CDATA[]]]]>

This is also valid:

<![CDATA[]]]>

And so is this:

<![CDATA[foo]]]]>

It's worth noting that CDATA tags are not part of the HTML specification. However, since the lexer will be re-used for XML/SGML it should support it. Seeing how there are websites out there that slap CDATA into their HTML (in particular in xhtml documents as it is valid there) the HTML parser should also support it.

Unescape HTML entities

Not sure if this is a bug, but consider following example:

require 'nokogiri'
require 'oga'

xml = '<content>&lt;div&gt;OMG&lt;/div&gt;</content>'

puts '--- Nokogiri'
puts Nokogiri::XML(xml).children.first.text

puts '--- Oga'
puts Oga.parse_xml(xml).children.first.text

The output is:

--- Nokogiri
<div>OMG</div>
--- Oga
&lt;div&gt;OMG&lt;/div&gt;

It looks like Nokogiri does unescape HTML entities in a node text by default. The question is should Oga do the same?

Lexing doctypes

Both the lexer and the HTML parser should support the processing of doctypes. Currently both already have basic support for doctypes but this should be improved so that it's easier to extract information such as the doctype URL.

Lexing of regular tags

Currently there is very, very basic support for lexing HTML tags. The resulting tokens aren't exactly useful though as T_TEXT is used for pretty much everything.

Given the tag <p class="foo">Bar</p> the resulting tokens should probably be something along the lines of the following:

[:T_SMALLER, '<']
[:T_TAG, 'p']
[:T_SPACE, ' ']
[:T_TEXT, 'class']
[:T_EQUALS, '=']
[:T_DQUOTE, '"']
[:T_TEXT, 'foo']
[:T_DQUOTE, '"']
[:T_GREATER, '>']
...

I'm not really sure yet if attribute names/values should already be treated as such on lexer level or whether they should be treated as regular text. In both cases it doesn't really make things easier in the parser.

Predicate doesn't take effect when descendant is specified on the same node

I don't trust my understanding of XPath yet. Is this a bug? I expect the last set to be empty:

document = Oga.parse_xml('<a></a>')

document.xpath('a')                 # => NodeSet(Element(name: "a"))
document.xpath('a[@b]')             # => NodeSet()
document.xpath('child::a')          # => NodeSet(Element(name: "a"))
document.xpath('child::a[@b]')      # => NodeSet()
document.xpath('descendant::a')     # => NodeSet(Element(name: "a"))
document.xpath('descendant::a[@b]') # => NodeSet(Element(name: "a"))

Fully capitalized <BR> yields error.

Oga version 0.1.1

I'm not sure this behavior is expected, but I tried to parse some HTML which included <BR>.

Oga.parse_html('<BR>')

I got:

Racc::ParseError: Unexpected $end with value false on line 1:

I expected:

=> Document(
  children: NodeSet(Element(name: "BR"))
)

Notes: these work as expected.

Oga.parse_html('<br>')
Oga.parse_html('<br/>')
Oga.parse_html('<BR/>')

Syntax Error Handling

Oga should be capable of dealing with invalid XML/HTML up to a certain extend. To deal with this an input correction system would have to be implemented at the lowest level possible.

Initially I thought about implementing this between the lexer and the parser. The problem however is is that tokens from the lexer are emitted one by one instead as a whole. As a result of this you can not keep track of what context you're currently in without hindering performance. An alternative solution is to do this on parser (Racc) level.

Regardless of where it takes place this system should be capable of correcting most common mistakes. However, I don't want to sacrifice too much for the sake of forgiving invalid input. In other words, there has to be a balance between forgivingness and correctness.

Count newlines in C/Java or disable by default

Currently the lexer counts newlines in Ruby for T_TEXT nodes. This isn't exactly the most efficient way of doing things (sadly). For example, with this procedure in place the lexer can chew through 10MB of XML in ~520 ms. Removing this results in the lexer processing the same amount of data in ~420 ms.

Considering this is a feature only used when displaying parser errors it's a waste to always have this running. There are two solutions to this problem:

Move the counting of lines to C/Java. I'm not a huge fan of this as this will result in similar (but different) code being implemented in both C and Java to achieve the same result.
Disable counting of newlines by default, only enable if some sort of debug option is set.

In case option 2 is chosen the parser's on_error method has to be modified so that it doesn't display source code in error messages if there's no line number.

after_element doesn't receive meaningful arguments in the SAX API

In the SAX API the return value of on_element is reset, this in turn means that after_element only receives nil as an argument (#31 (comment)). Instead this method should receive at least the element name.

Clean up XML lexer

Before I continue working on XPath support the XML lexer should be cleaned up. Every Ragel action should only omit a single token. The actions for lexing elements should also be fixed as they are currently set up in a rather hacky way. As a result of this input such as > is ignored when it should be lexed as T_TEXT.

default namespaces lost in to_xml round-trip

doc = <<-EOS
  <root xmlns="http://example.org/default">
    <element />
  </root>
  EOS
oxml = Oga.parse_xml(doc)
puts oxml.to_xml

#puts: 
#  <root>
#    <element></element>
#  </root>

I think it ought to preserve the default namespace decleration on to_xml output?

Missing space serializing attributes

#!/usr/bin/ruby

require 'oga'

input = <<EOS
<doc>
  <thing x="3" y="5"/>
</doc>
EOS

doc = Oga.parse_xml(input)

node = doc.at_xpath("//thing")

node.attr("x").value = "45"

puts doc.to_xml

outputs:

<doc>
  <thing x="45"y="5"></thing>
</doc>

I don't think that's valid XML, it's definitely not very pretty XML, there should be a space after the quoted attribute value before the next attribute.

CSS Selector Handling

Similar to #10 Oga should be able to handle CSS selectors for #9 . These will be built on top of XPath selectors. That is, a given CSS selector will be compiled into its XPath equivalent.

For this the following has to be added:

document root xpath not recognized

xpath / should select the root node.

/ selects the document root (which is always the parent of the document element)
http://www.w3.org/TR/xpath/

Assume:

str = <<EOS
    <root>
        <div>
          <foo/>
        </div>
    </root>
EOS
doc = Oga.parse_xml(str)
doc.xpath("/")
# should return the <root> node, instead raises Racc::ParseError

Works as expected in nokogiri.

Lexing inline Javascript/CSS

Inline Javascript/CSS should be lexed as T_TEXT. Right now it seems to be processed as if it is XML.

Support for the HTML5 syntax

This will be a tricky one due the W3, after taking too many drugs, deciding to completely butcher the HTML syntax. For example, every void element (e.g. <link>) can omit the closing /> and instead use >. In other words, this is now valid:

<link href="...">

This will require either a dedicated lexer/parser or an extra flag. In case of the latter this flag would be used to indicate HTML5 parsing mode. This in turn will require a lookup table to figure out where to allow this. Sadly this will mean the lexer/parser is no longer context free.

DOM API

A DOM API should be introduced that acts similar to Nokogiri (in certain ways). The DOM API should be built on top of the AST returned by Oga::Parser. It should allow the querying of elements using XPath and CSS. The latter would be built on top of XPath.

Rough sketch of the DOM API:

document = Oga::DOM::HTML.new('<p class="foo">Hello</p>')

document.css('p.foo').length # => 1
document.xpath('//p[@class="foo"]').length # => 1

TODO list:

XPath handling: #10
CSS selector handling: #11
DOM API (shocking)
- Proper API for modifying the tree (unlike the crap Nokogiri offers), in particular adding new nodes has to work nicely
- Better system for storing next, previous parent and child nodes. Instead of storing nodes directly in to each other a node should only contain its current index in its containing node set. Using this index we can quickly figure out surrounding nodes without having to keep 1238719381 references around.
- Methods for querying the tree using XPath/CSS selectors.
- Pass the root document to each node (required for absolute XPath expressions)

Lexing XML attribute values with newlines triggers memory allocation errors

This only happens when an IO instance is used as the input. Simple repro:

Oga.parse_xml(StringIO.new("<foo bar='\n10'></foo>"))

This results in:

NoMemoryError: failed to allocate memory
from /home/yorickpeterse/Private/Projects/ruby/oga/lib/oga/xml/lexer.rb:139:in `advance_native'

Digging around, it seems that ts in the C code is set to NULL when processing attribute values (= strings). Not sure yet what the heck is going on.

XPath Handling

For #9 Oga needs to be able to parse XPath queries into an AST. Said AST then has to be consumed in order to figure out what nodes to pull from a DOM document.

For this the following would have to be added:

The AST consumer would be fed a DOM document and, based on the XPath query, returns a list of nodes.

Ox like errors?

I'm digging this already 👍

This is sort of a feature request, but when parsing XML with Ox There are some pretty useful errors.

invalid_div = "<div>Test"
result = Ox.parse(invalid_div)
# Ox::ParseError: invalid format, document not terminated at line 1, column 16 [parse.c:527]
# ...

Currently with Oga it looks like

invalid_div = "<div>Test"
result = Oga.parse_html(invalid_div)
# Racc::ParseError: Unexpected $end with value false on line 1:
# ...