kbrw / sweet_xml Goto Github PK

License: MIT License

Elixir 100.00%

xml stream elixir xpath

sweet_xml's Introduction

SweetXml

SweetXml is a thin wrapper around :xmerl. It allows you to convert a char_list or xmlElement record as defined in :xmerl to an elixir value such as map, list, string, integer, float or any combination of these.

Installation

Add dependency to your project's mix.exs:

def deps do
  [{:sweet_xml, "~> 0.7.4"}]
end

SweetXml depends on :xmerl. On some Linux systems, you might need to install the package erlang-xmerl.

Examples

Given an XML document such as below:

<?xml version="1.05" encoding="UTF-8"?>
<game>
  <matchups>
    <matchup winner-id="1">
      <name>Match One</name>
      <teams>
        <team>
          <id>1</id>
          <name>Team One</name>
        </team>
        <team>
          <id>2</id>
          <name>Team Two</name>
        </team>
      </teams>
    </matchup>
    <matchup winner-id="2">
      <name>Match Two</name>
      <teams>
        <team>
          <id>2</id>
          <name>Team Two</name>
        </team>
        <team>
          <id>3</id>
          <name>Team Three</name>
        </team>
      </teams>
    </matchup>
    <matchup winner-id="1">
      <name>Match Three</name>
      <teams>
        <team>
          <id>1</id>
          <name>Team One</name>
        </team>
        <team>
          <id>3</id>
          <name>Team Three</name>
        </team>
      </teams>
    </matchup>
  </matchups>
</game>

We can do the following:

import SweetXml
doc = "..." # as above

Get the name of the first match:

result = doc |> xpath(~x"//matchup/name/text()") # `sigil_x` for (x)path
assert result == 'Match One'

Get the XML record of the name of the first match:

result = doc |> xpath(~x"//matchup/name"e) # `e` is the modifier for (e)ntity
assert result == {:xmlElement, :name, :name, [], {:xmlNamespace, [], []},
        [matchup: 2, matchups: 2, game: 1], 2, [],
        [{:xmlText, [name: 2, matchup: 2, matchups: 2, game: 1], 1, [],
          'Match One', :text}], [],
        ...}

Get the full list of matchup name:

result = doc |> xpath(~x"//matchup/name/text()"l) # `l` stands for (l)ist
assert result == ['Match One', 'Match Two', 'Match Three']

Get a list of winner-id by attributes:

result = doc |> xpath(~x"//matchup/@winner-id"l)
assert result == ['1', '2', '1']

Get a list of matchups with different map structure:

result = doc |> xpath(
  ~x"//matchups/matchup"l,
  name: ~x"./name/text()",
  winner: [
    ~x".//team/id[.=ancestor::matchup/@winner-id]/..",
    name: ~x"./name/text()"
  ]
)
assert result == [
  %{name: 'Match One', winner: %{name: 'Team One'}},
  %{name: 'Match Two', winner: %{name: 'Team Two'}},
  %{name: 'Match Three', winner: %{name: 'Team One'}}
]

Or directly return a mapping of your liking:

result = doc |> xmap(
  matchups: [
    ~x"//matchups/matchup"l,
    name: ~x"./name/text()",
    winner: [
      ~x".//team/id[.=ancestor::matchup/@winner-id]/..",
      name: ~x"./name/text()"
    ]
  ],
  last_matchup: [
    ~x"//matchups/matchup[last()]",
    name: ~x"./name/text()",
    winner: [
      ~x".//team/id[.=ancestor::matchup/@winner-id]/..",
      name: ~x"./name/text()"
    ]
  ]
)
assert result == %{
  matchups: [
    %{name: 'Match One', winner: %{name: 'Team One'}},
    %{name: 'Match Two', winner: %{name: 'Team Two'}},
    %{name: 'Match Three', winner: %{name: 'Team One'}}
  ],
  last_matchup: %{name: 'Match Three', winner: %{name: 'Team One'}}
}

The ~x Sigil

Warning ! Because we use xmerl internally, only XPath 1.0 paths are handled.

In the above examples, we used the expression ~x"//some/path" to define the path. The reason is it allows us to more precisely specify what is being returned.

~x"//some/path"

without any modifiers, xpath/2 will return the value of the entity if the entity is of type xmlText, xmlAttribute, xmlPI, xmlComment as defined in :xmerl
~x"//some/path"e

e stands for (e)ntity. This forces xpath/2 to return the entity with which you can further chain your xpath/2 call
~x"//some/path"l

'l' stands for (l)ist. This forces xpath/2 to return a list. Without l, xpath/2 will only return the first element of the match
~x"//some/path"k

'k' stands for (k)eyword. This forces xpath/2 to return a Keyword instead of a Map.
~x"//some/path"el - mix of the above
~x"//some/path"s

's' stands for (s)tring. This forces xpath/2 to return the value as string instead of a char list.
~x"//some/path"S

'S' stands for soft (S)tring. This forces xpath/2 to return the value as string instead of a char list, but if node content is incompatible with a string, set "".
~x"//some/path"o

'o' stands for (o)ptional. This allows the path to not exist, and will return nil.
~x"//some/path"sl - string list.
~x"//some/path"i

'i' stands for (i)nteger. This forces xpath/2 to return the value as integer instead of a char list.
~x//some/path"I

'I' stands for soft (I)nteger. This forces xpath/2 to return the value as integer instead of a char list, but if node content is incompatible with an integer, set 0.
~x"//some/path"f

'f' stands for (f)loat. This forces xpath/2 to return the value as float instead of a char list.
~x//some/path"F

'F' stands for soft (F)loat. This forces xpath/2 to return the value as float instead of a char list, but if node content is incompatible with a float, set 0.0.
~x"//some/path"il - integer list.

If you use the optional modifier o together with a soft cast modifier (uppercase), then the value is set to nil when the value is not compatible for instance ~x//some/path/text()"Fo return nil if the text is not a number.

Also in the examples section, we always import SweetXml first. This makes x_sigil available in the current scope. Without it, instead of using ~x, you can use the %SweetXpath struct

assert ~x"//some/path"e == %SweetXpath{path: '//some/path', is_value: false, is_list: false, cast_to: false}

Note the use of char_list in the path definition.

Namespace support

Given a XML document such as below

<?xml version="1.05" encoding="UTF-8"?>
<game xmlns="http://example.com/fantasy-league" xmlns:ns1="http://example.com/baseball-stats">
  <matchups>
    <matchup winner-id="1">
      <name>Match One</name>
      <teams>
        <team>
          <id>1</id>
          <name>Team One</name>
          <ns1:runs>5</ns1:runs>
        </team>
        <team>
          <id>2</id>
          <name>Team Two</name>
          <ns1:runs>2</ns1:runs>
        </team>
      </teams>
    </matchup>
  </matchups>
</game>

We can do the following:

import SweetXml
xml_str = "..." # as above
doc = parse(xml_str, namespace_conformant: true)

Note the fact that we explicitly parse the XML with the namespace_conformant: true option. This is needed to allow nodes to be identified in a prefix independent way.

We can use namespace prefixes of our preference, regardless of what prefix is used in the document:

result = doc
  |> xpath(~x"//ff:matchup/ff:name/text()"
           |> add_namespace("ff", "http://example.com/fantasy-league"))

assert result == 'Match One'

We can specify multiple namespace prefixes:

result = doc
  |> xpath(~x"//ff:matchup//bb:runs/text()"
           |> add_namespace("ff", "http://example.com/fantasy-league")
           |> add_namespace("bb", "http://example.com/baseball-stats"))

assert result == '5'

From Chaining to Nesting

Here's a brief explanation to how nesting came about.

Chaining

Both xpath and xmap can take an :xmerl XML record as the first argument. Therefore you can chain calls to these functions like below:

doc
|> xpath(~x"//li"l)
|> Enum.map fn (li_node) ->
  %{
    name: li_node |> xpath(~x"./name/text()"),
    age: li_node |> xpath(~x"./age/text()")
  }
end

Mapping to a structure

Since the previous example is such a common use case, SweetXml allows you just simply do the following

doc
|> xpath(
  ~x"//li"l,
  name: ~x"./name/text()",
  age: ~x"./age/text()"
)

Nesting

But what you want is sometimes more complex than just that, SweetXml thus also allows nesting

doc
|> xpath(
  ~x"//li"l,
  name: [
    ~x"./name",
    first: ~x"./first/text()",
    last: ~x"./last/text()"
  ],
  age: ~x"./age/text()"
)

Transform By

Sometimes we need to transform the value to what we need, SweetXml supports that via transform_by/2

doc = "<li><name><first>john</first><last>doe</last></name><age>30</age></li>"

result = doc |> xpath(
  ~x"//li"l,
  name: [
    ~x"./name",
    first: ~x"./first/text()"s |> transform_by(&String.capitalize/1),
    last: ~x"./last/text()"s |> transform_by(&String.capitalize/1)
  ],
  age: ~x"./age/text()"i
)

^result = [%{age: 30, name: %{first: "John", last: "Doe"}}]

The same can be used to break parsing code into reusable functions that can be used in nesting:

doc = "<li><name><first>john</first><last>doe</last></name><age>30</age></li>"

parse_name = fn xpath_node ->
  xpath_node |> xmap(
    first: ~x"./first/text()"s |> transform_by(&String.capitalize/1),
    last: ~x"./last/text()"s |> transform_by(&String.capitalize/1)
  )
end

result = doc |> xpath(
  ~x"//li"l,
  name: ~x"./name" |> transform_by(parse_name),
  age: ~x"./age/text()"i
)

^result = [%{age: 30, name: %{first: "John", last: "Doe"}}]

For more examples, please take a look at the tests and help.

Streaming

SweetXml now also supports streaming in various forms. Here's a sample XML doc. Notice the certain lines have XML tags that span multiple lines:

<?xml version="1.05" encoding="UTF-8"?>
<html>
  <head>
    <title>XML Parsing</title>
    <head><title>Nested Head</title></head>
  </head>
  <body>
    <p>Neato €</p><ul>
      <li class="first star" data-index="1">
        First</li><li class="second">Second
      </li><li
            class="third">Third</li>
    </ul>
    <div>
      <ul>
        <li>Forth</li>
      </ul>
    </div>
    <special_match_key>first star</special_match_key>
  </body>
</html>

Working with `File.stream!/1`

Working with streams is exactly the same as working with binaries:

File.stream!("file_above.xml") |> xpath(...)

`SweetXml` element streaming

Once you have a file stream, you may not want to work with the entire document to save memory:

file_stream = File.stream!("file_above.xml")

result = file_stream
|> stream_tags([:li, :special_match_key])
|> Stream.map(fn
    {_, doc} ->
      xpath(doc, ~x"./text()")
  end)
|> Enum.to_list

assert result == ['\n        First', 'Second\n      ', 'Third', 'Forth', 'first star']

Warning: In case of large document, you may want to use the discard option to avoid memory leak.

result = file_stream
|> stream_tags([:li, :special_match_key], discard: [:li, :special_match_key])

Security

Whenever you have to deal with some XML that was not generated by your system (untrusted document), it is highly recommended that you separate the parsing step from the mapping step, in order to be able to prevent some default behavior through options. You can check the doc for SweetXml.parse/2 for more details. The current recommendations are:

doc |> parse(dtd: :none) |> xpath(spec, subspec)
enum |> stream_tags(tags, dtd: :none)

Copyright and License

SweetXml source code is licensed under the MIT License.

CONTRIBUTING

Hi, and thank you for wanting to contribute. Please refer to the centralized information available at: https://github.com/kbrw#contributing

sweet_xml's People

Contributors

Stargazers

Watchers

Forkers

dgoldie adamkittelson xbrukner viniciussbs seantanly benwilson512 frost remiq aaronjensen substantial intellicore elsatko binarytemple ericmj pezra lucidstack antoinereyt motot digideskio atonse wleborgne batbabydev jeanparpaillon kr00lix data-twister maratgaliev tomciopp yurgon schneiderderek edmaarcosta beiping96 peek-travel mhanberg krishna-goteti nbap qcam smpallen99 thbar nl3v axelson rmoorman fusillicode lukeledet chadfennell hissssst alexjuca kontomatik swingcloud nacengineer gilacost kianmeng taxjar euranova tank-bohr jgwmaxwell arcanemachine vitortrin hamidb80 joaothallis

sweet_xml's Issues

xpath returns charlist instead of integer

Well, this is pretty weird... Is there any reason I get '0' instead of 0 ?

iex(33)> count = "<posts count=\"0\" offset=\"0\"/>" |> xpath(~x"//posts/@count"l)
['0']
iex(34)> List.first count
'0'
iex(35)> i List.first count
Term
  '0'
Data type
  List
Description
  This is a list of integers that is printed as a sequence of characters
  delimited by single quotes because all the integers in it represent valid
  ASCII characters. Conventionally, such lists of integers are referred to as
  "charlists" (more precisely, a charlist is a list of Unicode codepoints,
  and ASCII is a subset of Unicode).
Raw representation
  [48]
Reference modules
  List
iex(36)>

Unexpected behaviour for XPath `s` modifier

Version 0.6.6

Using the README example:

iex(3)> result = doc |> xpath(~x"//matchup/name/text()")
'Match One'
iex(4)> result = doc |> xpath(~x"//matchup/name/text()"s)
"Match OneMatch TwoMatch Three"

For the s modifier, I would expect the result: "Match One"

New release / Hex package ?

Hex package is 0.6.5 and has been pushed on february 2017.

Is it planned to have a new package released ?

Inline DTD allows XML bomb attack

This is Wiki page for the vulnerability, it is a very well known XML parser vulnerability: https://en.wikipedia.org/wiki/Billion_laughs_attack.

To replicate this issue in SweetXml, you can do the following in an iex session and watch in the observer:

4 hours later, and it's still running! Memory usage is slowly climbing past 900MB, scheduler 1 utilization hovers 70%.

I was looking into xmerl a bit to see if there's a way to disable inline DTD when using xpath before opening this issue, but I'm not familiar enough with it yet. Hoping someone else may be able to chime in. The closest thing I could find was in the release notes for xmerl 1.2.3 there's an option to turn off external DTD parsing. That sounds like that wouldn't solve this issue though, because internal DTD is the problem.

Maybe something can be done by changing the arguments passed to xmerl_xpath.string/3?

One suggestion to make this safer from @ellispritchard is to set the max heap size for the process calling xmerl functions: http://erlang.org/doc/man/erlang.html#process_flag_max_heap_size

Cleanup repo

Update build status and / or use GH actions for it

Selecting nodes based on value of attributes

Given the following xml:

<Names>
        <Translation Language="de">german</Translation>
        <Translation Language="nl">dutch</Translation>
        <Translation Language="en">english</Translation>
        <Translation Language="fr">french</Translation>
    </Names>

How can I get the value of the "de" node?

I tried:

german: ~x"./Names/Translation/@Language/text()"

and

german: ~x"./Names/Translation/@Language='de'/text()"

But I do only get

** (exit) {:primary_expression, {:not_implemented, {:comp, :=, {:path, :rel, {:refine, {:refine, {:refine, {:step, {:self, {:node_type, :node}, []}}, {:step, {:child, {:name, {:Names, [], 'Names'}}, []}}}, {:step, {:child, {:name, {:Translation, [], 'Translation'}}, []}}}, {:step, {:attribute, {:name, {:Language, [], 'Language'}}, []}}}}, {:path, :rel, {:step, {:child, {:name, {:de, [], 'de'}}, []}}}}}}

XPath error


import SweetXml
SweetXml
iex(2)> doc = "<h1><a>Some linked title</a></h1>"
"<h1><a>Some linked title</a></h1>"
iex(3)> doc |> xpath(~x"//a/text()")
** (UndefinedFunctionError) function :xmerl_scan.string/2 is undefined (module :xmerl_scan is not available)
    :xmerl_scan.string('<h1><a>Some linked title</a></h1>', [])
    lib/sweet_xml.ex:230: SweetXml.parse/2
    lib/sweet_xml.ex:418: SweetXml.xpath/2
iex(3)> doc |> xpath(~x"//a/text()")

Looping via xmap

<Query>
        <cars>
            <bmw>
                <specs>
                    <model>5 seriesw</model>
                    <engine>4.4</engine>
                </specs>
            </bmw>
            <bmw>
                <specs>
                    <model>3 seriesw</model>
                    <engine>3.0</engine>
                </specs>
            </bmw>
        </cars>
</Query>

Im parsing this via SweetXml

request_body |> parse |> xmap(
  cars: [
    ~x"//Query/cars"k,
    bmw: [
      ~x"./bmw"kl,
      specs: [
        ~x"./specs"k,
          model: ~x"./model/text()"s,
          engine: ~x"./engin/text()"s
      ]
    ]
  ]
)

I want to get data like this format:

[
  Query: [
   cars: [
     bmw: [ 
        specs: [
          model: "5series"
          engine: 4,4
        ]
    ], 
    bmw: [ 
        specs: [
          model: "5series"
          engine: 4,4
        ]
      ]
    ]
  ]
]

Problem is that I get

[cars: [bmw: [specs: [[model: "5 seriesw", engin: ""]]]]]

Its not looping on Query/cars/*

Debugging Help in error logs

I tried digging in a bit to see if I could add this, but I couldn't figure it out. This test encapsulates the behavior that I'm looking for.

A problem that I have is that if the XML document is not the shape that I am expecting e.g. missing a non-optional field, my process crashes. I log the error and stack trace and throw the correlated message on my rabbitMQ error queue. But, there is no way to quickly say what the problem is.

The stack trace doesn't say the line were the missing element caused the problem, typically: [error] ** (Protocol.UndefinedError) protocol Enumerable not implemented for nil of type Atom.

If anyone could help point me in the right direction, that would be great, too.

  test "gives useful info with xpath with route that doesn't exist", %{simple: simple} do
    fun = fn -> 
      xpath(simple, ~x"//ListBucketResult",
        name: ~x"./Name/text()"s,
        is_truncated: ~x"./IsTruncated/text()"s,
        owner: [
            ~x"./Owner",
            id: ~x"./ID/text()"s]) == nil
    end

    expected = "ElementNotFound: <ListBucketResult> is not found in <html>"

    assert capture_log(fun) =~ expected
  end

Could not start application xmerl: could not find application file: xmerl.app

Hello when I install your package and run

iex -S mix

I get the following error message from iex:

** (Mix) Could not start application xmerl: could not find application file: xmerl.app

Am I missing something?
I use elixir 1.2 and OTP 18

Is it possible to turn an xmlElement into text

For example:

<features>
  <feature>example 1</feature>
  <feature>example 1</feature>
</features>

xpath(doc, ~x"./features") |> to_string()

And I would like it to print the exact block in XML format instead of an xmlElement

<features>
  <feature>example 1</feature>
  <feature>example 1</feature>
</features>

XPath doesn't match elements using `xmlns`

With this XML:

<things xmlns:foo="http://example.com/foo/" xmlns="http://example.com/foo/">
  <bar />
</things>

This XPath matches as expected //things

But this incorrectly fails to match the same elements: //foo:things despite the xmlns being the same as xmlns:foo

Try it out here: https://www.freeformatter.com/xpath-tester.html

Listing the troubles with the SweetXpath modifiers

I'm listing here the github issues regarding the behavior of the modifiers.
As a lot of information is lost between the call of xpath/xmap, the internal call of :xmerl_xpath.string, and the final result, a lot of issues have been raised, and some merge requests have been proposed.

At the moment, I am not able to conceive a "peaceful" path of evolution. The library has a lot of cruft, and the recent evolutions regarding security, while remaining backward compatible, contribute negatively to the usage of the library. People wanting to benefit from them must put in more effort to bring restrictions, while I believe the effort for the user should be when we want relaxed security.
So I tend to want go toward a new set of API, but that's no different from an entirely new version, aka a new library.

Anyway, the main topic:

#62
#53
#37
#23
#20
#14
#10
#8
#4

#87
#80
#72
#70
#33 (to dig)
#30
#17

Considering the timeline of the merge requests, it feels understandable that the modifiers evolved to the current state.

I'll see if I have the time in the coming weeks to formalize the needs around this feature.

Escaped HEX codes break the parser

Elixir 1.5.2
sweet_xml 0.6.5

The following XML file will break the parser

SweetXml.parse("<test>Some Text&#x1A;</test>")

The error is:

** (exit) {:fatal, {{:error, {:wfc_Legal_Character, 26}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 22}}}
    (xmerl) xmerl_scan.erl:4117: :xmerl_scan.fatal/2
    (xmerl) xmerl_scan.erl:2908: :xmerl_scan.scan_char_ref_hex/3
    (xmerl) xmerl_scan.erl:2572: :xmerl_scan.scan_content/11
    (xmerl) xmerl_scan.erl:2133: :xmerl_scan.scan_element/12
    (xmerl) xmerl_scan.erl:575: :xmerl_scan.scan_document/2
    (xmerl) xmerl_scan.erl:291: :xmerl_scan.string/2
    (sweet_xml) lib/sweet_xml.ex:233: SweetXml.parse/2
[error] 3429- fatal: {:error, {:wfc_Legal_Character, 26}}

It is incorrectly unescaping the escaped HEX value for ASCII 26, which is an invalid XML character.

I have worked around the issue for my own needs.

Selecting nodes based on value of attributes.

I would like to select nodes based on attribute values.

What I have so far is

   result = file_stream
   |> stream_tags([:matchup])
   |> Stream.map(fn
       {_, doc} ->
         xpath(doc, ~x"///name/@teamtype='TEAM_ONE'"l)
     end)
   |> Enum.to_list

Right now I get the error

** (exit) {:primary_expression, {:not_implemented, {:comp, :=, {:path, :abs, {:refine, {:step, {:descendant_or_self, {:node_type, :node}, []}}, {:step, {:attribute, {:name, {:teamtype, [], 'teamtype'}}, []}}}}, {:literal, 'TEAM_ONE'}}}}

I would like to access the value of all nodes with a teamtype attribute that equals “TEAM_ONE”

<game>
 <matchups>
   <matchup winner-id="1">
     <name>Match One</name>
     <teams>
       <team>
         <id>1</id>
         <name teamtype="TEAM_ONE">Team One</name>
       </team>
       <team>
         <id>2</id>
         <name teamtype="TEAM_ONE">Team Two</name>
       </team>
     </teams>
   </matchup>
   <matchup winner-id="2">
     <name>Match Two</name>
     <teams>
       <team>
         <id>2</id>
         <name>Team Two</name>
       </team>
       <team>
         <id>3</id>
         <name>Team Three</name>
       </team>
     </teams>
   </matchup>
   <matchup winner-id="1">
     <name>Match Three</name>
     <teams>
       <team>
         <id>1</id>
         <name>Team One</name>
       </team>
       <team>
         <id>3</id>
         <name>Team Three</name>
       </team>
     </teams>
   </matchup>
 </matchups>
</game>

How to handle parsing errors?

Right now I have to do something like this (not sure if right):

defp parse_xml(doc) do
  try do
    points = xpath(doc, ~x"//trk/trkseg/trkpt"l, lat: ~x"./@lat"F, lng: ~x"./@lon"F, ele: ~x"./ele/text()"I)
    {:ok, points}
  catch
    :exit, _ -> {:error, "unable to parse .gpx file"}
  end
end

Why doesn't it follow something like this:

case xpath(doc, ~"whatever") do
  {:ok, result} -> do something
  {:error, message} -> do something else
end

While this works it still dumps error into stdout 20:34:30.188 [error] 3904- fatal: :expected_element_start_tag. Kinda gross.

I'm kinda new to this whole elixir thing, maybe I'm doing something wrong? I looked into Task but it doesn't appear to solve this issue.

Logger.warn/1 is deprecated on Elixir 1.15

Logger.warn/1 is deprecated on Elixir 1.15 and should (conditionally ?) be replaced with Logger.warning/2

==> sweet_xml
Compiling 2 files (.ex)
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  (sweet_xml 0.7.3) lib/sweet_xml/options.ex:107: SweetXml.Options.set_up/2

library doesn't work with OTP 20 (i was wrong. please use the hex.pm version number -- not git tag for latest version)

===

I'm putting this here as a warning because new users could potentially waste a lot of time on this like i did. The latest version of this library can be found on hex.pm NOT in the list of git tags on this repo. There is simply no way to know this unless someone tells you.

===

warning: the dependency :sweet_xml requires Elixir "~> 1.0.0-rc2" but you are running on v1.5.0

followed by a long list of deprecations warnings:

excerpt:

warning: parentheses are required when piping into a function call. For example:

    foo 1 |> bar 2 |> baz 3

is ambiguous and should be written as

    foo(1) |> bar(2) |> baz(3)

Ambiguous pipe found at:
  lib/sweet_xml.ex:374

warning: String.to_char_list/1 is deprecated, use String.to_charlist/1
  lib/sweet_xml.ex:150

warning: variable "self" does not exist and is being expanded to "self()", please use parentheses to remove the ambiguity or change the variable name
  lib/sweet_xml.ex:302

warning: variable "make_ref" does not exist and is being expanded to "make_ref()", please use parentheses to remove the ambiguity or change the variable name
  lib/sweet_xml.ex:302

warning: Dict.put/3 is deprecated, use the Map module for working with maps or the Keyword module for working with keyword lists
  lib/sweet_xml.ex:451

warning: Dict.put/3 is deprecated, use the Map module for working with maps or the Keyword module for working with keyword lists
  lib/sweet_xml.ex:456

followed by an error that breaks compilation:

Could not start application markdown: could not find application file: markdown.app

"** (Mix) Could not start application sweet_xml: could not find application file: sweet_xml.app" error after 'uninstalling'

By "'uninstalling'", I mean that I installed SweetXml, but with one particular branch of my project's Git repo checked-out. I've switched to another second branch that does NOT include adding SweetXml but when I try to run my project via Mix (via iex -S mix specifically), I get this error:

** (Mix) Could not start application sweet_xml: could not find application file: sweet_xml.app

I tried mix clean && mix deps.clean --unused && mix deps.get && mix compile and the app compiles fine.

AFAICT, there are NO references to SweetXml in any Elixir files with the second branch checked-out.

I did finally try mix deps.clean --all && mix deps.get and that resolved the error, but that was mildly painful as it took a few minutes.

Any ideas on what might have persisted in my project's repo that caused the app to fail to start after 'uninstalling' SweetXml?

Some possibly relevant lines from my project's .gitignore file:

/_build
/deps

Simple to_map(xmlElement) should be made available

Need: quickly parse (into map) xml block at a given root without knowing in advance what the exact structure is.

Was easy to do in Ruby by using .each iterators and adding element names and content to maps, but I fail to see how that can be currently done with sweet_xml. All examples assume predefined xpaths.

I have a simple working code that traverses a raw entity and recursively extracts elements, text and attributes into a usable map but did not want to submit it if such feature is already available in some form.

Is there such a feature, and if not, are you open to adding it?

[Proposal] Add support for `transform_by` to handle complex value transformation

This opens up the possibility to perform complex value transformation which cannot be handled simply by adding sigil_x/2 modifiers.

Consider the following example, which we can tag a transformation function to a sweet_xpath spec, which is applied just before xpath returns. In the example, we able to capitalize the name and return a Range type for wind_speed.

    iex> import SweetXml
    iex> string_to_range = fn str ->
    ...>     [first, last] = str |> String.split("-", trim: true) |> Enum.map(&String.to_integer/1)
    ...>     first..last
    ...>   end
    iex> doc = "<weather><zone><name>north</name><wind-speed>5-15</wind-speed></zone></weather>"
    iex> doc
    ...> |> xpath(
    ...>      ~x"//weather/zone"l,
    ...>      name: ~x"//name/text()"s |> transform_by(&String.capitalize/1),
    ...>      wind_speed: ~x"./wind-speed/text()"s |> transform_by(string_to_range)
    ...>    )
    [%{name: "North", wind_speed: 5..15}]

In fact with this, we can split up the parsing code for different xml entities into separate functions and apply them to construct different nested structures as needed.

I can put up a pull request for this proposal. :)

(RuntimeError) DTD not allowed: lol1

Tested on Elixir 1.13.4-otp-25 / Erlang 25.0.3.

git clone https://github.com/kbrw/sweet_xml
cd sweet_xml
mix deps.get

$ mix test
warning: use Mix.Config is deprecated. Use the Config module instead                             
  config/config.exs:4                                                                            
                                                                                                 
...........................................                                                      
17:38:50.785 [error] Process #PID<0.277.0> raised an exception
** (RuntimeError) DTD not allowed: lol1                                                 
    (sweet_xml 0.7.3) lib/sweet_xml/options.ex:55: anonymous fn/6 in SweetXml.Options.handle_dtd/2
    (xmerl 1.3.29) xmerl_scan.erl:1972: :xmerl_scan.scan_entity/2                                    
    (xmerl 1.3.29) xmerl_scan.erl:1693: :xmerl_scan.scan_markup_decl/2
    (xmerl 1.3.29) xmerl_scan.erl:1278: :xmerl_scan.scan_doctype3/3  
    (xmerl 1.3.29) xmerl_scan.erl:730: :xmerl_scan.scan_prolog/4     
    (xmerl 1.3.29) xmerl_scan.erl:571: :xmerl_scan.scan_document/2
    (xmerl 1.3.29) xmerl_scan.erl:294: :xmerl_scan.string/2           
                                                                                                 
17:38:50.788 [error] Process #PID<0.284.0> raised an exception  
** (SweetXml.DTDError) DTD not allowed: lol1                                                     
    (sweet_xml 0.7.3) lib/sweet_xml/options.ex:55: anonymous fn/6 in weetXml.Options.handle_dtd/2      
    (xmerl 1.3.29) xmerl_scan.erl:1972: :xmerl_scan.scan_entity/2
    (xmerl 1.3.29) xmerl_scan.erl:1693: :xmerl_scan.scan_markup_decl/2
    (xmerl 1.3.29) xmerl_scan.erl:1278: :xmerl_scan.scan_doctype3/3
    (xmerl 1.3.29) xmerl_scan.erl:730: :xmerl_scan.scan_prolog/4
    (xmerl 1.3.29) xmerl_scan.erl:571: :xmerl_scan.scan_document/2
    (xmerl 1.3.29) xmerl_scan.erl:294: :xmerl_scan.string/2  
....
17:38:50.893 [error] Process #PID<0.293.0> raised an exception
** (SweetXml.DTDError) no external entity allowed
    (sweet_xml 0.7.3) lib/sweet_xml/options.ex:23: anonymous fn/2 in SweetXml.Options.handle_dtd/2
    (xmerl 1.3.29) xmerl_scan.erl:1304: :xmerl_scan.fetch_and_parse/3
    (xmerl 1.3.29) xmerl_scan.erl:2002: :xmerl_scan.scan_entity_def/3
    (xmerl 1.3.29) xmerl_scan.erl:1964: :xmerl_scan.scan_entity/2
    (xmerl 1.3.29) xmerl_scan.erl:1693: :xmerl_scan.scan_markup_decl/2
    (xmerl 1.3.29) xmerl_scan.erl:1278: :xmerl_scan.scan_doctype3/3
    (xmerl 1.3.29) xmerl_scan.erl:730: :xmerl_scan.scan_prolog/4
    (xmerl 1.3.29) xmerl_scan.erl:571: :xmerl_scan.scan_document/2
    (xmerl 1.3.29) xmerl_scan.erl:294: :xmerl_scan.string/2
.......

Finished in 0.3 seconds (0.00s async, 0.3s sync)
16 doctests, 38 tests, 0 failures

& split the result in two strings

I face a problem with escaped " or & As an example I change the test input data and this is the results I get:

  1) test xpath with sweet_xpath as only argment (SweetXmlTest)
     test/sweet_xml_test.exs:37
     Assertion with == failed
     code: result == ['O&ne', 'Two', 'Three', 'Four', 'Five']
     lhs:  ['O', '&ne', 'Two', 'Three', 'Four', 'Five']
     rhs:  ['O&ne', 'Two', 'Three', 'Four', 'Five']
     stacktrace:
       test/sweet_xml_test.exs:48

HTML entities in element content confuses xpath

HTML entities in the element content appear to confuse xpath. It either seems to truncate the string on certain valid entities (eg, <) or blows up entirely.

Example failures:
_the_following_data_ |> SweetXml.xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", part: ~x"./text()")

<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&lt;separator&gt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>

<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&dlt;separator&xgt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>

Remove the ampersands in the loginReturn bodies and the query works.

Ets problem with Xmerl "Too many db tables"

** (SystemLimitError) a system limit has been reached
    (stdlib) :ets.new(:rules, [:set, :public])
    (xmerl) xmerl_scan.erl:418: :xmerl_scan.initial_state/2
    (xmerl) xmerl_scan.erl:300: :xmerl_scan.int_string/4
    (xmerl) xmerl_scan.erl:291: :xmerl_scan.string/2
    (sweet_xml) lib/sweet_xml.ex:233: SweetXml.parse/2
    (sweet_xml) lib/sweet_xml.ex:421: SweetXml.xpath/2
    (sweet_xml) lib/sweet_xml.ex:454: SweetXml.xpath/3
    (ex_aws) lib/ex_aws/sqs/parsers.ex:99: ExAws.SQS.Parsers.parse/2
10:22:19.693 [error] ** Too many db tables **

Probably this link provides a solution?

Or is this a problem within the usage by ExAws?

parsing utf-8 encoding xml string

add scan option to xmerl_scan, otherwise it will raise exception when xml doc contains utf-8 characters:
:xmerl_scan.string(doc, [encoding: 'iso-10646-utf-1'])

Memory leak in streaming support

Just tried to parse 4GB file (http://www.litres.ru/static/ds/detailed_data.xml.gz) with stream_tags and got memory leaks until BEAM crashes.

My code example:

File.stream!("/home/ssbb/tmp/detailed_data.xml", [:utf8])
|> stream_tags(:art)
|> Stream.each(fn {tag, doc} ->
  IO.inspect DateTime.utc_now
end)
|> Stream.run

https://f001.backblazeb2.com/file/ssbb-me/1478737792.png

crash_dump.zip

Search and replace XML value

Hello, I have a nested value that I'd like to search and replace. I've read the README.md and I'm unsure how to go about it. Any help or pointers in the right direction would be immensely appreciated. Thank you.

For example:

<response>
          <record>
            <id>123</id>
          </record>
</response>

\r by itself is mapped to \n which causes things like AWS S3 Key names to have information loss when parsed by SweetXxml

Can SweetXml be configured / motivated to not map single \r characters to \n? This causes issues when parsing XML form AWS S3 storage service when object names contain \r (which although not smart to use, is legal to use).

Working with this XML returned from Amazon AWS S3 storage service:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>obfuscated_bucket_name</Name>
  <Prefix/>
  <NextContinuationToken>1ILUW_obfuscated_continuation_token_1yMUM</NextContinuationToken>
  <KeyCount>1000</KeyCount>
  <MaxKeys>1000</MaxKeys>
  <Delimiter/>
  <IsTruncated>true</IsTruncated>
  <Contents>
    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;</Key>
    <LastModified>2022-03-25T23:52:22.122Z</LastModified>
    <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;inmiddle</Key>
    <LastModified>2022-03-26T23:52:22.122Z</LastModified>
    <ETag>"c41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

I used the SweetXml xpath snippet from lib/ex_aws/s3/parsers.ex in the ex_aws/ex_aws_s3 project, as-is:

iex(19)>  xml
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\">\n  <Name>obfuscated_bucket_name</Name>\n  <Prefix></Prefix>\n  <NextContinuationToken>1ILUW_obfuscated_continuation_token_1yMUM</NextContinuationToken>\n  <KeyCount>1000</KeyCount>\n  <MaxKeys>1000</MaxKeys>\n  <Delimiter></Delimiter>\n  <IsTruncated>true</IsTruncated>\n  <Contents>\n    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;</Key>\n    <LastModified>2022-03-25T23:52:22.122Z</LastModified>\n    <ETag>&quot;d41d8cd98f00b204e9800998ecf8427e&quot;</ETag>\n    <Size>0</Size>\n    <StorageClass>STANDARD</StorageClass>\n  </Contents>\n  <Contents>\n    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;inmiddle</Key>\n    <LastModified>2022-03-26T23:52:22.122Z</LastModified>\n    <ETag>&quot;c41d8cd98f00b204e9800998ecf8427e&quot;</ETag>\n    <Size>0</Size>\n    <StorageClass>STANDARD</StorageClass>\n  </Contents>\n</ListBucketResult>"

iex(22)> parsed_body =
...(22)>         xml |> SweetXml.xpath(~x"//ListBucketResult",
...(22)>           name: ~x"./Name/text()"s,
...(22)>           is_truncated: ~x"./IsTruncated/text()"s,
...(22)>           prefix: ~x"./Prefix/text()"s,
...(22)>           marker: ~x"./Marker/text()"s,
...(22)>           next_continuation_token: ~x"./NextContinuationToken/text()"s,
...(22)>           key_count: ~x"./KeyCount/text()"s,
...(22)>           max_keys: ~x"./MaxKeys/text()"s,
...(22)>           next_marker: ~x"./NextMarker/text()"s,
...(22)>           contents: [
...(22)>             ~x"./Contents"l,
...(22)>             key: ~x"./Key/text()"s,
...(22)>             last_modified: ~x"./LastModified/text()"s,
...(22)>             e_tag: ~x"./ETag/text()"s,
...(22)>             size: ~x"./Size/text()"s,
...(22)>             storage_class: ~x"./StorageClass/text()"s,
...(22)>             owner: [
...(22)>               ~x"./Owner"o,
...(22)>               id: ~x"./ID/text()"s,
...(22)>               display_name: ~x"./DisplayName/text()"s
...(22)>             ]
...(22)>           ],
...(22)>           common_prefixes: [
...(22)>             ~x"./CommonPrefixes"l,
...(22)>             prefix: ~x"./Prefix/text()"s
...(22)>           ]
...(22)>         )
%{
  common_prefixes: [],
  contents: [
    %{
      e_tag: "\"d41d8cd98f00b204e9800998ecf8427e\"",
      key: "workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -> MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon\n",
      last_modified: "2022-03-25T23:52:22.122Z",
      owner: nil,
      size: "0",
      storage_class: "STANDARD"
    },
    %{
      e_tag: "\"c41d8cd98f00b204e9800998ecf8427e\"",
      key: "workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -> MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon\ninmiddle",
      last_modified: "2022-03-26T23:52:22.122Z",
      owner: nil,
      size: "0",
      storage_class: "STANDARD"
    }
  ],
  is_truncated: "true",
  key_count: "1000",
  marker: "",
  max_keys: "1000",
  name: "obfuscated_bucket_name",
  next_continuation_token: "1ILUW_obfuscated_continuation_token_1yMUM",
  next_marker: "",
  prefix: ""
}

Note that the S3 Key's with \r in them have their \r characters mapped to \n. In general this is probably "right" and in conformance with the XML specs wrt end of line handling, but I think we should ideally be able to use SweetXml in some way against this type of input and have the \r and other similar characters mapped to their XML escape equivalents (e.g.,  in the \r case).

Can SweetXml be used in a way in which this mapping of \r to \n is not done?

possible leak of sweet_xml library

Hi everybody,

I'm a newbie in elixir.
I wrote a little program to parse osm-xml data and used this sweet_xml library.

I've already posted an issue on StackOverflow. Please this this post. They said that it might be a problem of this sweet_xml library, but it could also be a problem of the way the erlang VM works.

Any any suggestions are welcome.

bug in casting values to integer or float

Hi,

I'm trying to parse a XML feed with sweet_xml. Due to the nature of the web ;-) there may be feeds that are invalid. In this case I had a feed with a missing element, and thus the XPath expression could not be resolved;

However, this clause matches for an integer cast in this case:

defp to_cast(value, :integer, _is_opt?), do: String.to_integer(to_string(value))

This leads to an argument error because String.to_integer gets called with an empty string:

** (ArgumentError) argument error
    :erlang.binary_to_integer("")
    (sweet_xml) lib/sweet_xml.ex:665: SweetXml.to_cast/3
    (sweet_xml) lib/sweet_xml.ex:441: SweetXml.xpath/2
    (sweet_xml) lib/sweet_xml.ex:531: anonymous fn/3 in SweetXml.xmap/3
    (elixir) lib/map.ex:791: Map.get_and_update/3
    (sweet_xml) lib/sweet_xml.ex:531: SweetXml.xmap/3
    (sweet_xml) lib/sweet_xml.ex:530: SweetXml.xmap/3

Is this a thing you will want to fix? All other XPath expressions don't lead to a crash, so it seems like a bug to me. On the other hand, the input is invalid; what's the right thing to do here?

To fix this one could add another function clause for empty strings in casts to float or integer; but then it would fail if the input string is not a number.

That's why I didn't create a pull request, I'm really not sure what your thoughts are about this.

& splits the result into two strings

As pointed out here: #17

Throws exception when file starts with byte order marker.

The byte order marker is not recommended, however it is a part of utf-8 files. Currently, if a file is started with BOM, the parser errors expecting an element. Is this something that this library can support? Or is there some thing this prevents it from parsing from xmerl.

stream_tags hangs when used with Stream.take

Hello! Thanks for making sweet_xml!

The following code hangs:

"<feed></feed>" |> stream_tags(:feed) |> Stream.take(1) |> Enum.to_list

... but removing the Stream.take lets the code run to completion as expected:

"<feed></feed>" |> stream_tags(:feed) |> Enum.to_list

Is this project still maintained?

Hello,
there have been no new commits since Feb 24, 2019 and some of the PRs (most importantly #53 and #36) are left unresolved.
We (Kontomatik) would love to to take the maintenance of sweet_xml upon ourselves if you no longer have the time/need/want to do it.

Broken nested nodes order when taking text()

I am trying to achieve simple (I thought so) task. I have xml:

<title>
Hello <hlword>world</hlword>!
</title>

What I need is to convert this to string:

Hello world!
Or even better if I could do something like:
Hello **world**!

For second option for now I am even don`t know now hot to implement this in most handy way (mostly because I am new to elixir). But even the first option became unsolvable task for me, because when I am trying to perform

doc |> xpath(~x"//title/descendant-or-self::*/text()"s)

I get broken order of elements, something like world!Hello (can not tell exactly what order I will get in this situation, but usually this not even "reverse", but something absolutely random (but nested element is almost always going last)).
I don't know if it is sweet_xml's problem in breaking the order of gotten text() of nodes, but at least may be someone can help me in this task solving using this library.

Error with Erlang 19.2 and Elixir 1.4.0

Hi,

We start to have the following error since we bumped our versions of Erlang and Elixir:

[error] Process #PID<0.212.0> raised an exception
** (CaseClauseError) no case clause matching: {:halted, []}
    (sweet_xml) lib/sweet_xml.ex:582: anonymous fn/4 in SweetXml.continuation_opts/2
    (xmerl) xmerl_scan.erl:825: :xmerl_scan.scan_misc/4
    (xmerl) xmerl_scan.erl:578: :xmerl_scan.scan_document/2
    (xmerl) xmerl_scan.erl:291: :xmerl_scan.string/2

The error happens when we run Elixir 1.4.0, SweetXml 0.6.4, and Erlang 19.0 or 19.2. If we run the same code on Elixir 1.3.2, SweetXml 0.6.4, and Erlang 19.0 or 19.2 it works properly.

You can find a repository reproducing the error here: https://github.com/kdisneur/sweat_xml.
The README includes the error logs we have after trying the same script on different versions of Erlang and Elixir.

I hope the error report is clear enough and don't hesitate to ask if I can be of any help :)

Thanks

How to access attributes?

Thanks for the great package!

I'm new to elxir so I'm sure I'm missing something very obvious but I cant figure out how to access the id and url in the example xml below?

<propertyList>
  <property>
    <uniqueID>R2-796018</uniqueID>
    <headline>Lorem ipsum dolor</headline>
    <description>...</description>
    <images>
      <img id="1" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="2" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="3" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="4" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
    </images>
  </property>
  <property>
    <uniqueID>R1-495025</uniqueID>
    <headline>Exerci quaestio ad</headline>
    <description>...</description>
    <images>
      <img id="1" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="2" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="3" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
      <img id="4" url="/path/to/image.jpg" modTime="2016-08-01-14:30:31" />
    </images>
  </property>
</propertyList>

So far I have this which works perfectly but I'm lost with the images.

  defp decode_response(body) do
    body
    |> xpath(~x"//property"l)
    |> Enum.map fn (property) ->
      %{
        unique_id: listing |> xpath(~x"./uniqueID/text()"s),
        headline: listing |> xpath(~x"./headline/text()"s),
        description: listing |> xpath(~x"./description/text()"s),
        # images: listing |> xpath(~x"./objects/img[@url]" ?????) 
      }
    end
  end

Thanks!

{:not_a_core_function, :substring}

Steps to reproduce

xml = "<foo multiLine=\"1\">bar</foo>\n"
xml |> xpath(~x"substring(name(//*), 1)"s)

Expected result

"foo"

Actual result

** (exit) {:not_a_core_function, :substring}
        (xmerl) xmerl_xpath_lib.erl:52: :xmerl_xpath_lib.primary_expr/2
        (xmerl) xmerl_xpath.erl:361: :xmerl_xpath.eval_primary_expr/2
        (xmerl) xmerl_xpath.erl:156: :xmerl_xpath.string/5
    (sweet_xml) lib/sweet_xml.ex:654: SweetXml.get_current_entities/2
    (sweet_xml) lib/sweet_xml.ex:441: SweetXml.xpath/2

Environments

Erlang/OTP 19
Elixir 1.4.0
SweetXml 0.6.5

How to get the raw XML content under a node

I want to parse the result of a SPARQL XML result like this:

<sparql xmlns="http://www.w3.org/2005/sparql-results#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
  <head>
    <variable name="blurb"/>
  </head>
  <results>
    <result>
      <binding name="blurb">
        <literal datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral">
          <p xmlns="http://www.w3.org/1999/xhtml">My name is <b>alice</b></p>
        </literal>
      </binding>
    </result>
  </results>
</sparql>

and get everything under the literal node as a string, i.e. My name is alice.

Is this possible with SweetXML somehow?

The optional modifier produces empty strings instead of nil

Strings don't behave as described by the README:

"If you use the optional modifier o together with a soft cast modifier (uppercase), then the value is set to nil"

~x"./NAME/text()"So produces an empty string instead of nil.

Someone attempted to fix this with #53, but it would've been better to open an issue to discuss possible solutions and backwards compatibility.

Please create releases for new versions

I recently installed sweet_xml and put in version 0.3.0 since that was what I found in the releases pages. After upgrading to elixir 1.4, I had a few issues – only after lots of troubleshooting in slack, I found that the latest version is actually 0.6.3.

Is it possible to create releases for each new version? Or is there a better way to know which is the latest version? (without looking at the mix.exs file – because that doesn't necessarily mean that's the latest version available)

XSD validation

Hi!
Validating XML using XSD is a good feature. In xmerl we can see http://erlang.org/doc/man/xmerl_xsd.html#validate-2 . How should I proceed to implement it ?

warning: String.to_char_list/1 is deprecated, use String.to_charlist/1

While using the following versions of elixir/erlang:

$ elixir --version
Erlang/OTP 20 [erts-9.0.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.5.1

I encounter the following warnings:

$ iex -S mix phx.server
Erlang/OTP 20 [erts-9.0.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

==> sweet_xml
Compiling 1 file (.ex)
warning: String.to_char_list/1 is deprecated, use String.to_charlist/1
  lib/sweet_xml.ex:192

warning: Kernel.to_char_list/1 is deprecated, use Kernel.to_charlist/1
  lib/sweet_xml.ex:210

warning: Kernel.to_char_list/1 is deprecated, use Kernel.to_charlist/1
  lib/sweet_xml.ex:210

Generated sweet_xml app

There is a pull request (#49) that seems to correct those warnings.

Excessive memory usage

Hello! We're using SweetXML in production and we've been having some excessive memory usage that we've not been able to debug.

In our latest text an 80MB XML file uses >9GB of memory, this causes the VM to crash.

We are parsing XML in a format like this:

<?xml version='1.0' encoding='utf-8'?>
<ArrayOfCommercialDetail xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <CommercialDetail>
    <Status>text</Status>
    <ActionDate>text</ActionDate>
    <MusicDetails>
      <Title>text</Title>
      <Arranger></Arranger>
      <Composer>text</Composer>
      <Duration>text</Duration>
    </MusicDetails>
    <ActionedBy>text</ActionedBy>
    <ClockNumber>text</ClockNumber>
    <VODFinalAction>text</VODFinalAction>
    <FinalAction>text</FinalAction>
    <CommercialRestrictions>
      <Restriction>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Restriction>
    </CommercialRestrictions>
    <VODFinalActionId>text</VODFinalActionId>
    <CommercialPresentations>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
    </CommercialPresentations>
    <CommercialArtists>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
    </CommercialArtists>
    <FinalActionId>text</FinalActionId>
    <StatusId>text</StatusId>
  </CommercialDetail>

  <!-- Many more CommercialDetail here... -->

</ArrayOfCommercialDetail>

We parse this XML like so:

    data =
      xpath(
        xml,
        ~x"//CommercialDetail"l,
        clock_number: ~x"./ClockNumber/text()"s,
        actioned_at: ~x"./ActionDate/text()"s,
        presentation_codes: ~x"./CommercialPresentations/Presentation/Code/text()"sl,
        restriction_codes: ~x"./CommercialRestrictions/Restriction/Code/text()"sl,
        status: ~x"./Status/text()"s,
        vod_final_action_id: ~x"./VODFinalActionId/text()"s,
        final_action_id: ~x"./FinalActionId/text()"s
      )

Here's the load charts while iterating over and parsing XML files, and then discarding the result. It spikes each time the XML is parsed

What are we doing wrong here?

Extra note: We re-wrote this code to use the streaming API which used slightly less memory. Most our XML will not have newlines in it so this seemed to not be the rather path for a solution, and we would expect lower memory usage from the eager and streaming API.

After digging into the source it seems that memory spikes when :xmerl_scan.string/1 is called.

Thanks,
Louis

Unicode problems with xmerl

First I want to thank you for this really useful library, but I run into an error when I pipe text containing non-ASCII characters into the xpath/2 function. I'm unable to resolve this problem so I hope that you have an idea to fix this.

Interactive Elixir (1.0.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> import SweetXml
nil
iex(2)> "<title>Hallöchen</title>" |> xpath(~x"//title/text()")
3414- fatal: {error,{wfc_Legal_Character,{error,{bad_character,246}}}}
** (exit) {:fatal, {{:error, {:wfc_Legal_Character, {:error, {:bad_character, 246}}}}, {:file, :file_name_unknown}, {:line, 1}, {:col, 14}}}
    xmerl_scan.erl:4102: :xmerl_scan.fatal/2
    xmerl_scan.erl:2703: :xmerl_scan.scan_char_data/5
    xmerl_scan.erl:2615: :xmerl_scan.scan_content/11
    xmerl_scan.erl:2128: :xmerl_scan.scan_element/12
    xmerl_scan.erl:570: :xmerl_scan.scan_document/2
    xmerl_scan.erl:286: :xmerl_scan.string/2
    lib/sweet_xml.ex:133: SweetXml.parse/1
    lib/sweet_xml.ex:177: SweetXml.xpath/2

Get first element name

Hi Ive spent day to just catch the first element of the xml
for example how to catch "cars" from this example:

<cars>
  <car>
    <make>audi</make>
    <model>a4</model>
  </car>
</cars>

Markdown dependency

Is the devinus/markdown dep listed in mix.exs used for anything in production? markdown relies on hoedown which has a nif, and nifs make cross compilation a greater pain when making releases.

Can markdown be marked only: :dev?

xpath(~x"/some/path"l) transforms :xmlText nodes into char lists

"some nested node content" |> xpath(~x"/node()/node()"l)

returns

['some nested ',
 {:xmlElement, :em, :em, [], {:xmlNamespace, [], []}, [p: 1], 2, [],
  [{:xmlText, [em: 2, p: 1], 1, [], 'node', :text}], [],
  '/some/directory', :undeclared}, ' content']

instead of

[{:xmlText, [p: 1], 1, [], 'some nested ', :text},
 {:xmlElement, :em, :em, [], {:xmlNamespace, [], []}, [p: 1], 2, [],
  [{:xmlText, [em: 2, p: 1], 1, [], 'node', :text}], [],
  '/some/directory', :undeclared},
 {:xmlText, [p: 1], 3, [], ' content', :text}]

Note how the text nodes have been unwrapped into char-lists. Is this correct behavior? I'm running into this issue when passing an element around via transform_by and performing further matching.