tafia / quick-xml Goto Github PK

View Code? Open in Web Editor NEW

1.1K 15.0 228.0 3.3 MB

Rust high performance xml reader and writer

License: MIT License

Rust 100.00% HTML 0.01%

xml-parser writer serialization deserialization html xml pull-parser performance-xml

quick-xml's Introduction

quick-xml

High performance xml pull reader/writer.

The reader:

is almost zero-copy (use of Cow whenever possible)
is easy on memory allocation (the API provides a way to reuse buffers)
support various encoding (with encoding feature), namespaces resolution, special characters.

Syntax is inspired by xml-rs.

Example

Reader

use quick_xml::events::Event;
use quick_xml::reader::Reader;

let xml = r#"<tag1 att1 = "test">
                <tag2><!--Test comment-->Test</tag2>
                <tag2>Test 2</tag2>
             </tag1>"#;
let mut reader = Reader::from_str(xml);
reader.config_mut().trim_text(true);

let mut count = 0;
let mut txt = Vec::new();
let mut buf = Vec::new();

// The `Reader` does not implement `Iterator` because it outputs borrowed data (`Cow`s)
loop {
    // NOTE: this is the generic case when we don't know about the input BufRead.
    // when the input is a &str or a &[u8], we don't actually need to use another
    // buffer, we could directly call `reader.read_event()`
    match reader.read_event_into(&mut buf) {
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
        // exits the loop when reaching end of file
        Ok(Event::Eof) => break,

        Ok(Event::Start(e)) => {
            match e.name().as_ref() {
                b"tag1" => println!("attributes values: {:?}",
                                    e.attributes().map(|a| a.unwrap().value)
                                    .collect::<Vec<_>>()),
                b"tag2" => count += 1,
                _ => (),
            }
        }
        Ok(Event::Text(e)) => txt.push(e.unescape().unwrap().into_owned()),

        // There are several other `Event`s we do not consider here
        _ => (),
    }
    // if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
    buf.clear();
}

Writer

use quick_xml::events::{Event, BytesEnd, BytesStart};
use quick_xml::reader::Reader;
use quick_xml::writer::Writer;
use std::io::Cursor;

let xml = r#"<this_tag k1="v1" k2="v2"><child>text</child></this_tag>"#;
let mut reader = Reader::from_str(xml);
reader.config_mut().trim_text(true);
let mut writer = Writer::new(Cursor::new(Vec::new()));
loop {
    match reader.read_event() {
        Ok(Event::Start(e)) if e.name().as_ref() == b"this_tag" => {

            // crates a new element ... alternatively we could reuse `e` by calling
            // `e.into_owned()`
            let mut elem = BytesStart::new("my_elem");

            // collect existing attributes
            elem.extend_attributes(e.attributes().map(|attr| attr.unwrap()));

            // copy existing attributes, adds a new my-key="some value" attribute
            elem.push_attribute(("my-key", "some value"));

            // writes the event to the writer
            assert!(writer.write_event(Event::Start(elem)).is_ok());
        },
        Ok(Event::End(e)) if e.name().as_ref() == b"this_tag" => {
            assert!(writer.write_event(Event::End(BytesEnd::new("my_elem"))).is_ok());
        },
        Ok(Event::Eof) => break,
        // we can either move or borrow the event to write, depending on your use-case
        Ok(e) => assert!(writer.write_event(e).is_ok()),
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
    }
}

let result = writer.into_inner().into_inner();
let expected = r#"<my_elem k1="v1" k2="v2" my-key="some value"><child>text</child></my_elem>"#;
assert_eq!(result, expected.as_bytes());

Serde

When using the serialize feature, quick-xml can be used with serde's Serialize/Deserialize traits. The mapping between XML and Rust types, and in particular the syntax that allows you to specify the distinction between elements and attributes, is described in detail in the documentation for deserialization.

Credits

This has largely been inspired by serde-xml-rs. quick-xml follows its convention for deserialization, including the $value special name.

Parsing the "value" of a tag

If you have an input of the form <foo abc="xyz">bar</foo>, and you want to get at the bar, you can use either the special name $text, or the special name $value:

struct Foo {
    #[serde(rename = "@abc")]
    pub abc: String,
    #[serde(rename = "$text")]
    pub body: String,
}

Read about the difference in the documentation.

Performance

Note that despite not focusing on performance (there are several unnecessary copies), it remains about 10x faster than serde-xml-rs.

Features

encoding: support non utf8 xmls
serialize: support serde Serialize/Deserialize

Performance

Benchmarking is hard and the results depend on your input file and your machine.

Here on my particular file, quick-xml is around 50 times faster than xml-rs crate.

// quick-xml benches
test bench_quick_xml            ... bench:     198,866 ns/iter (+/- 9,663)
test bench_quick_xml_escaped    ... bench:     282,740 ns/iter (+/- 61,625)
test bench_quick_xml_namespaced ... bench:     389,977 ns/iter (+/- 32,045)

// same bench with xml-rs
test bench_xml_rs               ... bench:  14,468,930 ns/iter (+/- 321,171)

// serde-xml-rs vs serialize feature
test bench_serde_quick_xml      ... bench:   1,181,198 ns/iter (+/- 138,290)
test bench_serde_xml_rs         ... bench:  15,039,564 ns/iter (+/- 783,485)

For a feature and performance comparison, you can also have a look at RazrFalcon's parser comparison table.

Contribute

Any PR is welcomed!

License

MIT

quick-xml's People

Contributors

Stargazers

Watchers

Forkers

matt2xu zummenix albertogp tmoers michaelwu vandenoever pfernie blanham eijebong huntiep hsivonen andy128k bobo1239 sanemat mxj4 ganita cryze markddr frewsxcv danielkeep yawara artemshein igxactly-forks gurry austinmorris meven cih-y2k flier koute cmyr isgasho rust-work baitcenter vks davidbarsky klausi dbcfd akhranovsky takaakifuruse snpefk jumperchen tiannian njaremko anti-social crusty-dave songroom2016 ten0 iovxw andreivasiliu frankier zenria colingabr glyphpoch lilianmoraru pacman82 huaoguo andreaskarg ra2003 cboudereau timando elrnv endor songlinshu emiddleton untitaker blankname daose tobz1000 coffeejunk ikatson rake5k marcoieni cemeyer yijunyu vitaly-m quebin31 pchampin ajtribick shadmwanzia nmarley dralley s3bk icodein romatthe zhengxiwan apexys mtcoster heymind gerritsangel sergiobenitez dyz1990 ouda18 peaberberian chrifo ruffle-rs cpick solita-jaakkoha 01intelligence charrondev fwcd

quick-xml's Issues

streaming bzip/gzip support

Hey!

Thanks for providing this library!

I was wondering if there could be big performance wins from reduced io by supporting streaming from gzip/bzip.

I was wondering how easy it would be to plug this in, or what benefits we might see. I'm happy to lead the work if you had any ideas about where I might start?

Thanks
Ben

panic when using clean buffers

I am getting a panic from the following code.

extern crate quick_xml;


use quick_xml::reader::Reader;
use quick_xml::events::Event;



pub fn main(){
    let ont_s = br#"<?xml version="1.0"?>
<Ontology xmlns="http://www.w3.org/2002/07/owl#"
     xml:base="http://example.com/iri"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     ontologyIRI="http://example.com/iri"
     versionIRI="http://example.com/viri">
  <Prefix name="o" IRI="http://example.com/iri#"/>
  <Prefix name="owl" IRI="http://www.w3.org/2002/07/owl#"/>
  <Prefix name="rdf" IRI="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
  <Prefix name="xml" IRI="http://www.w3.org/XML/1998/namespace"/>
  <Prefix name="xsd" IRI="http://www.w3.org/2001/XMLSchema#"/>
  <Prefix name="rdfs" IRI="http://www.w3.org/2000/01/rdf-schema#"/>
</Ontology>
"#;

    let mut reader = Reader::from_reader(&ont_s[..]);
    //let mut buf = vec![];
    //let mut ns_buf = vec![];

    loop{

        let mut buf = vec![];
        let mut ns_buf = vec![];
        match reader.read_namespaced_event(&mut buf, &mut ns_buf){
            Ok((_, Event::Eof)) => break,
            a=> {
                println!("Event:{:?} {:?}", reader.buffer_position(), a);
            }
        }
    }
}

thread 'main' panicked at 'index 30 out of range for slice of length 0', /checkout/src/libcore/slice/mod.rs:745:4

If I comment out the declaration of buf and ns_buf inside the
loop, and just use a single pair defined outside instead, then,
magically it all works.

I should say, here, that I am new to rust, so it's possibly me being
daft. I don't really understand what I am supposed to be using these
two vectors for; however, to my mind, using a clean one for every
event should work, even if it is inefficient. Having it panic here,
but not with a shared buffer seems, to me, counter intuitive.

Thanks for the library!

Do not use unchecked multiply

thread '<unnamed>' panicked at 'attempt to multiply with overflow', /home/linkmauve/data/cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.10.0/src/escape.rs:166:9

Here is the testcase generated by cargo-fuzz:

\x00\x00<\x00\x00\x0a>&#44444444401?#\x0a413518#\x0a\x0a\x0a;<:<)(<:\x0a\x0a\x0a\x0a;<:\x0a\x0a<:\x0a\x0a\x0a\x0a\x0a<\x00*\x00\x00\x00\x00

The &#44444444401 part seems to be the issue, multiplying that by ten does not fit in an u32, causing the panic.

This bug has been found while fuzzing xmpp-parsers which uses minidom, you would benefit a lot from fuzzing quick-xml itself.

Is it faster than html5ever/xml5ever?

Basically 2 questions:

Could I use quick-XML to also parse HTML?
And, if so, is html5ever from servo slower or faster? https://github.com/servo/html5ever

Kind regards,
Melroy van den Berg

Add a structure to facilitate .as_str()/.as_bytes()

At the moment, Element has two methods as_bytes() and as_str() to allow access to raw bytes or conversion to UTF-8 strings. Also, Element has no way to access the all contents (from start to end) as either raw or str, only as String (which is incidentally why the writer cannot use write_wrapped_str for start elements). Attributes, on the other hand, iterates over raw keys with converted strings. Maybe we could introduce a "Text" structure or something similar, that:

automatically converts to [u8](this is what the Deref trait allows if I understand correctly)
as a as_str() method to do the explicit conversion
then Attributes would iterate over (Text, Text), and Element could have two functions name() that returns a Text over the bytes from start to name_end, and content() that returns a Text over the bytes from start to end.

Not sure if Text is a good name, but you get the idea :-)

Another fuzzing issue

thread '<unnamed>' panicked at 'slice index starts at 7 but ends at 5', /checkout/src/libcore/slice/mod.rs:751:4

Here is the testcase generated by cargo-fuzz:

(\x00\x00(<!--

I haven’t been able to figure out where in the code this panic happens, perhaps in the comment parsing code.

get_encoding method for Reader

Would you be open to the idea of exposing the encoding used by Reader via a get_encoding method?

In my use case I'm interested in more than the decode method (i.e. the name of the encoding).
Sometimes I pass the a &Reader into a function, only so it can call decode. Being able to pass an &Encoding would express my intent clearer and help me get rid of generic parameter introduced by Reader further reducing some noise.

Usually I am all against to many getter functions, since they are a sign of a leaky abstraction. Yet, I am using quick-xml precisely because it is not an abstraction, but does allow my to make many low level decisions.

That being said, I do have a workaround. So if you this collides with your design, I would not mind too much.

A way to get the row/column of a Reader

xml-rs provides TextPosition for this, while quick-xml only has Reader::buffer_position() to get the byte position. It would be useful for error reporting if there was a way to compute the row and column from this (preferably without overhead if unused).

Panic on overflow in subtraction

Found by cargo-fuzz by @frewsxcv

extern crate quick_xml;

use quick_xml::reader::Reader;
use std::io::Cursor;
fn main() {
    let data : &[u8] = b"\xe9\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n(\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00<>\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00<<\x00\x00\x00";
    let cursor = Cursor::new(data);
    let mut reader = Reader::from_reader(cursor);
    let mut buf = vec![];
    loop {
        match reader.read_event(&mut buf) {
            Ok(quick_xml::events::Event::Eof) | Err(..) => break,
            _ => buf.clear(),
        }
    }
}

🐇 RUST_BACKTRACE=1 ../target/debug/read_xml
thread 'main' panicked at 'attempt to subtract with overflow', /home/manishearth/.cargo/git/checkouts/quick-xml-df13d551d3762172/0fd7fbb/src/reader.rs:368
stack backtrace:
   1:     0x560c727f24b9 - std::sys::imp::backtrace::tracing::imp::write::hbb14611794d3841b
                        at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x560c727f44ce - std::panicking::default_hook::{{closure}}::h6ed906c7818ac88c
                        at /checkout/src/libstd/panicking.rs:351
   3:     0x560c727f40d4 - std::panicking::default_hook::h23eeafbf7c1c05c3
                        at /checkout/src/libstd/panicking.rs:367
   4:     0x560c727f48cb - std::panicking::rust_panic_with_hook::hd0067971b6d1240e
                        at /checkout/src/libstd/panicking.rs:545
   5:     0x560c727f4754 - std::panicking::begin_panic::h1fd1f10a3de8f902
                        at /checkout/src/libstd/panicking.rs:507
   6:     0x560c727f46c9 - std::panicking::begin_panic_fmt::haa043917b5d6f21b
                        at /checkout/src/libstd/panicking.rs:491
   7:     0x560c727f4657 - rust_begin_unwind
                        at /checkout/src/libstd/panicking.rs:467
   8:     0x560c72819c5d - core::panicking::panic_fmt::he9c7f335d160b59d
                        at /checkout/src/libcore/panicking.rs:69
   9:     0x560c72819b94 - core::panicking::panic::hb790668694ff6b20
                        at /checkout/src/libcore/panicking.rs:49
  10:     0x560c727c56d1 - <quick_xml::reader::Reader<B>>::read_start::h4ca5c41cb76479cf
                        at /home/manishearth/.cargo/git/checkouts/quick-xml-df13d551d3762172/0fd7fbb/src/reader.rs:368
  11:     0x560c727c6635 - <quick_xml::reader::Reader<B>>::read_until_close::hfbfc33da61c25d63
                        at /home/manishearth/.cargo/git/checkouts/quick-xml-df13d551d3762172/0fd7fbb/src/reader.rs:209
  12:     0x560c727c4fe7 - <quick_xml::reader::Reader<B>>::read_event::h44d5632c6f14e52c
                        at /home/manishearth/.cargo/git/checkouts/quick-xml-df13d551d3762172/0fd7fbb/src/reader.rs:393
  13:     0x560c727d1729 - read_xml::main::h4120cc96af0987c8
                        at /home/manishearth/mozilla/fuzz/targets/quick-xml/read_xml.rs:12
  14:     0x560c727fb1fa - __rust_maybe_catch_panic
                        at /checkout/src/libpanic_unwind/lib.rs:98
  15:     0x560c727f4e16 - std::rt::lang_start::hb7fc7ec87b663023
                        at /checkout/src/libstd/panicking.rs:429
                        at /checkout/src/libstd/panic.rs:361
                        at /checkout/src/libstd/rt.rs:57
  16:     0x560c727d1852 - main
  17:     0x7fbb9807082f - __libc_start_main
  18:     0x560c727c2a58 - _start
  19:                0x0 - <unknown>

cc @pnkfelix

BytesStart .with_attributes() and .extend_attributes() seem redundant

Both of these methods do the same thing, in fact .with_attributes() calls .extend_attributes() do perform its function. The only real difference is that .with_attributes() consumes self while .extend_attributes() uses &mut self. Since there's no real value in having both of these in my opinion, I propose we remove .with_attribtues() and instead only use .extend_attributes().

`read_event` vs. `read_namespaced_event`

Interleaving calls to read_event and read_namespaced_event could lead to surprising behavior (internal stack of namespaces would not be properly maintained?). It seems the Reader should always be either tracking or not tracking the namespaces, and as such should expose some configuration fn parse_namespaces(self, parse: bool) -> Self. That, or perhaps there should be separate implementations, EventReader, NamespacedEventReader, with only the latter exposing fns such as read_namespaced_event.

Docs missing

e.g. http://tafia.github.io/quick-xml/quick_xml/error/type.ResultPos.html

escaping BytesText does not does not work in 0.10

The BytesText::escaped function no longer escapes in 0.10. The problem stems from the following change in #96: https://github.com/tafia/quick-xml/pull/96/files#diff-8fa47ebb5ed7bcd20efc67fa222b62aaL344

Improve API for creating new elements

Instead of Element::new(name, attributes), we could do separate the text and attributes part:

Element::new(text) // works for any kind of element: text, comment, start, end...
element.add_attributes(iter) // only useful for start elements

This way, no need to pass an empty iterator for elements where attributes don't make sense (text or comment).

I would also change the Element::new to use an AsRef<[u8]> for the name, to allow faster/easier creation of elements from existing ones.

quick_xml reader is private

use quick_xml::reader::Reader;

error[E0603]: module reader is private

Escape characters

https://www.w3.org/TR/xml/#syntax
http://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents

Add a fn on Element or on AsStr, so the escape is done on demand and there is no perf penalty when xml is known to be simple enough.

Doctypes and entities

I'm working on parsing XML files that may utilize external entities. Currently when I throw them through quick-xml parsing the doctype is totally fine but when it gets to the entity there is an error because it's not an entity that is explicitly in the code. I definitely don't want the external entities to be able to bring in files or execute anything but do you have an opinion on allowing them to register new entities so that parsing will at least succeed? The replacement would essentially not be done but it would be able to check for the presence of a document defined entity.

Newlines in attributes are lost

Newlines in attributes are not encoded into XML escape sequence and are lost or re-parsing.

(https://stackoverflow.com/questions/2004386/how-to-save-newlines-in-xml-attribute)

Absence of git tags

Would it be possible to get tags onto release commits

Parse str instead of u8

Parsers like nom support parsing either [u8] or str. Since xml is text it makes sense to base the parser on text. This reduces the complexity of the parsers that use quick-xml and should make it easier to avoid encoding bugs: there is not more encoding to worry about.

Xml declaration error

I am now getting a thread '

' panicked at 'Malformed("Xml declaration must start with '?xml '") at position 130', but the xml does start with <?xml.

Other charsets are not supported

Looks like it works with UTF-8 only.
At least I can confirm that it doesn't work with KOI8-R.

Invalid files silently succeed

Most invalid files silently succeed, producing Text, Start, and Eof events (others are probably possible too).

Examples:

The empty file (produces just Eof)
foo (produces Text followed by Eof)
< (produces Start(BytesStart { buf: [], name_len: 0 }), followed by Eof)
<foo> (produces Start, Eof)

Benchmarks

Add benchmarks

small file
other libs (xml-rs in particular)

CDATA should not be treated as markup

How would you feel about making quick-xml not omitting CDATA Events? Instead I would suggest handling CDATA as part of decode(). After all writing: <![CDATA[<example>]]> is just another way of writing <example>. To make matters worse the to approaches may be mixed <exa<![CDATA[mple>]]>. With the current API it is really hard to get this right.

Support single quote in attributes

Hello,

May I ask if you could consider adding support for single quote in attributes?

For example current version 0.12.1 produces:

<tag key="value"></tag>

I wish there could be an option to choose between double quote and single quote, so this could be doable:

<tag key='value'></tag>

My most concern is some existing projects use double quote, some use single quote. If the crate can support both of them, it could help prevent noises in git...

Thank you,

panick on read_namespaced_event with different buffers

Environment: Debian 9 amd64
Reproducibility: Always
Version: quick_xml 0.12.1. Also known to be reproducible with 0.11.0.
Steps to reproduce: compile and run this:

extern crate quick_xml; // version "0.11.0" or "0.12.1".

fn main()
{
    let xml = r#"<?xml version='1.0'?><a:a xmlns:a='http://example.org/something' xmlns='b:c'><a:d><hello xmlns='x:y:z'><world>earth</world></hello></a:d>"#;
    let mut parser = quick_xml::Reader::from_str(xml); // for quick_xml 0.11.0, use "quick_xml::reader::Reader".
    let mut buf = Vec::new();
    let mut buf_ns = Vec::new();
    for _ in 0..4 {
      let _ = parser.read_namespaced_event(&mut buf, &mut buf_ns);
    }
    let mut buf = Vec::new();
    let mut buf_ns = Vec::new();
    for _ in 0..4 {
        let _ = parser.read_namespaced_event(&mut buf, &mut buf_ns); // <-- panick
    }
}

Acutal result: it panics:

$ RUST_BACKTRACE=1 cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/willem/e/crash-quick-xml/target/debug/crash-quick-xml`
thread 'main' panicked at 'index 29 out of range for slice of length 0', libcore/slice/mod.rs:785:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at libstd/sys_common/backtrace.rs:59
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::slice::slice_index_len_fail
             at libcore/slice/mod.rs:785
  10: <core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index
             at /checkout/src/libcore/slice/mod.rs:916
  11: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /checkout/src/libcore/slice/mod.rs:767
  12: quick_xml::reader::Namespace::prefix
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:855
  13: quick_xml::reader::NamespaceBufferIndex::find_namespace_value::{{closure}}
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:902
  14: core::iter::traits::DoubleEndedIterator::rfind::{{closure}}
             at /checkout/src/libcore/iter/traits.rs:580
  15: <core::slice::Iter<'a, T> as core::iter::traits::DoubleEndedIterator>::try_rfold
             at /checkout/src/libcore/slice/mod.rs:1319
  16: core::iter::traits::DoubleEndedIterator::rfind
             at /checkout/src/libcore/iter/traits.rs:579
  17: <core::iter::Rev<I> as core::iter::iterator::Iterator>::find
             at /checkout/src/libcore/iter/mod.rs:459
  18: quick_xml::reader::NamespaceBufferIndex::find_namespace_value
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:899
  19: <quick_xml::reader::Reader<B>>::read_namespaced_event
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:566
  20: crash_quick_xml::main
             at src/main.rs:15
  21: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  22: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:479
  23: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  24: std::rt::lang_start_internal
             at libstd/panicking.rs:458
             at libstd/panic.rs:358
             at libstd/rt.rs:58
  25: std::rt::lang_start
             at /checkout/src/libstd/rt.rs:74
  26: main
  27: __libc_start_main
  28: _start

Expected result:

if the API is not supposed to be used like this, and if it is possible to enforce that with Rust's typesystem: the API is written in such a way that Rust's typesystem forbids to use the library like this
if the API is not supposed to be used like this, and if it is not possible to enforce it with Rust's typesystem: the English API specifications state it must not be used like this
if the API does not intend to forbid to use it like this: it should not panic.

Getting to deeply nested XML

Hi,

Is there an easy way of wading through deeply nested XML?

For example, to get to:

{configData.xsd}configData/{genericNrm.xsd}SubNetwork/{genericNrm.xsd}SubNetwork/{genericNrm.xsd}MeContext/{genericNrm.xsd}ManagedElement/{utranNrm.xsd}RncFunction/{utranNrm.xsd}at
    tributes/{utranNrm.xsd}mcc")

Would I need a nested loop+match for each level?

Or is there a way of finding out the parents of a StartEvent?

Cheers,
James

Add examples

Improve docs on `Reader#read_event()`, `BytesText#decode` and buf arguments

I was recently stuck on an issue when migrating one of my tools to quick-xml. The problem was that the parser accumulates bytes read into the given buffer, and the method BytesText#unescape_and_decode was simply grabbing everything in the buffer into a String, which included previously read content that was not even part of the text segment. Apparently this means that users of this API must clear the buffer before reading the next event. This isn't intuitive at the moment.

Is there something I'm missing that should be included in the documentation? Besides, what is the purpose of this buf argument? Couldn't reader objects simply own the buffer?

More tests

Try implementing all xml-rs' for a start ?

Question: Element body

This is more a question than an issue:
Given an Event for the start of an element how can i obtained all the element body from its start to its end?

Namespaces

Add an ondemand namespace management

whenever there is a Event::Start, insert/update namespace hashmap
whenever there is a Event::End, remove/update namespace hashmap

Attributes containing "=" result in an error

If you pass in an input like <a att1="a=b"/>, instead of receiving an attribute with the value "a=b" you'll get an error that says "Err((Malformed("Got 2 '=' tokens"), 9))`.

Strip namespace from node name

This could be either a BytesStart method or an update of read_namespaced_event.

Linked to tafia/calamine#73

Documentation On How to Iterate Through a Series of XML Elements and Make Objects Out of Those Elements.

For example, how could I iterate through the following xml elements and extract each text:
<tag1>text1</tag1><tag1>text2</tag1><tag1>text3</tag1><tag1><tag2>text4</tag2></tag1>

EscapeError(UnterminatedEntity) when doing unescape_and_decode on a CDATA containing an ampersand

I'm trying to use unescape_and_decode() on a CDATA similar to the following. I don't even know if the source XML is valid or if the ampersand should be encoded.

<tag><![CDATA[some & thing]]></tag>

error

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: EscapeError(UnterminatedEntity(14..39))', libcore/result.rs:945:5

Any way to avoid copying attribute values when writing XML?

Some of the attributes I have are pretty long, so I would like to avoid copying them.

However, it does not look there is a way to do that. I have to pass them through Event::Start (so indentation works as expected), and that requires placing them into BytesStart (which would do the copying).

P.S. doing it without copying probably means escaping them on the fly.

Panic when escaping values

Steps to reproduce:

fetch my xml branch of cobalt.
RUST_BACKTRACE=1 cargo test --test mod

(sorry I've not taken time to dig into my output to RSS or RSS's output to quick-xml)

I'm making the assumption this is a quick-xml bug and not a rss bug because of my quick glance at the quick-xml code. If this is wrong, I'll move this over to rss.

Backtrace

        thread 'rss' panicked at 'slice index starts at 28 but ends at 26', /checkout/src/libcore/slice/mod.rs:741:4
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
...
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:92
   9: core::slice::slice_index_order_fail
             at /checkout/src/libcore/slice/mod.rs:741
  10: <core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index
             at /checkout/src/libcore/slice/mod.rs:864
  11: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /checkout/src/libcore/slice/mod.rs:717
  12: quick_xml::escape::escape
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.9.2/src/escape.rs:57
  13: quick_xml::events::BytesText::escaped
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.9.2/src/events/mod.rs:344
  14: <quick_xml::writer::Writer<W>>::write_event
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.9.2/src/writer.rs:81
  15: <quick_xml::writer::Writer<W> as rss::toxml::WriterExt>::write_text_element
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/toxml.rs:59
  16: <rss::item::Item as rss::toxml::ToXml>::to_xml
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/item.rs:617
  17: <&'a T as rss::toxml::ToXml>::to_xml
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/toxml.rs:20
  18: <quick_xml::writer::Writer<W> as rss::toxml::WriterExt>::write_objects
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/toxml.rs:103
  19: <rss::channel::Channel as rss::toxml::ToXml>::to_xml
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/channel.rs:1284
  20: rss::channel::Channel::write_to
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/channel.rs:1048
  21: <rss::channel::Channel as alloc::string::ToString>::to_string
             at /home/epage/.cargo/registry/src/github.com-1ecc6299db9ec823/rss-1.0.0/src/channel.rs:1058
  22: cobalt::cobalt::create_rss
             at src/cobalt.rs:295
  23: cobalt::cobalt::build
             at src/cobalt.rs:197
...

Attributes with `>` character are not properly managed

Broken tests:
https://github.com/tafia/quick-xml/blob/master/tests/xmlrs_reader_tests.rs#L45-L80

BytesStart unescaped method returns the attributes in the same result

If you had for example the following xml snippet <namespace:hello> greeting='wave'> and the corresponding BytesStart for that, (let's called the element), element.unescaped() would return namespace:hello> greeting='wave', while .name() and local_name() only include the actual tag part (namespace:hello> and hello> respectively), and not the attributes.

This goes against the documentation which says it should just return the unescaped tag name.

Writer planned?

Very promising crate, I like the emphasis on performance, and the simple API. I would gladly use it, except that my use case is to read XML, transform it (replace text, include other XML files, etc.), and then write the result.

Do you envision the crate to support writing XML as well? If so, this is something I would be glad to contribute! Hopefully with a quick writer, too :-)

Is there a recommended way to obtain the line number to a buffer position?

I would like to know the line in which an error occurred. Is there already a way of doing this?

Change `position` name

It conflicts with the position name on the iterator that XmlReader implements as well.
It represents the current buffer position and not the exact position where the error is found (could probably be enhanced later on) and thus it could be misleading

read_namespaced_event mutable reference to reader

First thank you for the nice package. I'm using the read_namespaced_event with the following example.
The problem is that i can't read the element inner text (unescape_and_decode) because the borrow checker tells me that a mutable reference has already been taken for "reader" (on the line read_namespace_event). How can i solve this?

extern crate quick_xml;
use quick_xml::reader::Reader;
use quick_xml::events::Event;
use std::io::Read;

struct Resource {
    etag: String,
    calendar_data: String
}

struct Prop {
    namespace: String,
    local_name: String,
    value: String
}

impl Prop {
    fn new() -> Prop {
        Prop {
            namespace: String::new(),
            local_name: String::new(),
            value: String::new()
        }
    }
}

struct PropStat {
    status: String,
    props: Vec<Prop>
}

impl PropStat {
    fn new() -> PropStat {
        PropStat {
            status: String::new(),
            props: Vec::<Prop>::new()
        }
    }
}

struct Response {
    href: String,
    propstats: Vec<PropStat>
}

impl Response {
    fn new() -> Response {
        Response {
            href: String::new(),
            propstats: Vec::<PropStat>::new()
        }
    }
}

fn parse_report(xml_data: &str) -> Vec<Resource> {
    let result = Vec::<Resource>::new();

    let mut reader = Reader::from_str(xml_data);
    reader.trim_text(true);

    let mut count = 0;
    let mut buf = Vec::new();

    #[derive(Clone, Copy)]
    enum State {
        Root,
        MultiStatus,
        Response,
        Success,
        Error
    };

    let mut responses = Vec::<Response>::new();
    let mut current_response = Response::new();
    let mut current_prop_stat = PropStat::new();
    let mut current_prop = Prop::new();

    let mut depth = 0;
    let mut state = State::MultiStatus;

    loop {
        match reader.read_namespaced_event(&mut buf) {
            Ok((ref namespace_value, Event::Start(ref e))) => {
                let namespace_value = namespace_value.unwrap_or_default();
                let local_name = e.local_name();
                match (depth, state, namespace_value, local_name) {
                    (0, State::Root, b"DAV:",b"multistatus") => { state = State::MultiStatus },
                    (1, State::MultiStatus, b"DAV:", b"response") => { state = State::Response; current_response = Response::new(); }
                    (2, State::Response, b"DAV:", b"href") => { current_response.href = String::from(e.unescape_and_decode(&reader).unwrap()); }
                    _ => {}
                }
                depth += 1;
            },
            Ok((ref namespace_value, Event::End(ref e))) => {
                let namespace_value = namespace_value.unwrap_or_default();
                let local_name = e.local_name();
                match (depth, state, namespace_value, local_name) {
                    (1, State::MultiStatus, b"DAV:", b"multistatus") => { state = State::Root }
                    (2, State::MultiStatus, b"DAV:", b"multistatus") => { state = State::MultiStatus }
                    _ => {}
                }
                depth -= 1;
            },
            Ok((_, Event::Eof)) => break,
            Err(e) => break,
            _ => ()
        }
    }
    result
}

fn main() {
    let test_data = r#" 
<?xml version="1.0" encoding="UTF-8"?>
<D:multistatus xmlns:D="DAV:" xmlns:caldav="urn:ietf:params:xml:ns:caldav" xmlns:cs="http://calendarserver.org/ns/" xmlns:ical="http://apple.com/ns/ical/">
 <D:response xmlns:carddav="urn:ietf:params:xml:ns:carddav" xmlns:cm="http://cal.me.com/_namespace/" xmlns:md="urn:mobileme:davservices">
  <D:href>/caldav/v2/johndoh%40gmail.com/events/07b7it7uonpnlnvjldr0l1ckg8%40google.com.ics</D:href>
  <D:propstat>
   <D:status>HTTP/1.1 200 OK</D:status>
   <D:prop>
    <D:getetag>"63576798396"</D:getetag>
    <caldav:calendar-data>BEGIN:VCALENDAR</caldav:calendar-data>
   </D:prop>
  </D:propstat>
 </D:response>
</D:multistatus>
"#;

    parse_report(test_data);
}

generate an empty text event on expanding self closing tags

I'd like to suggest generating empty Text events then expanding self closing tags and empty Text events are not trimmed. How do you feel about it?

Corrupt symbols when reading 'ascii' encoding with non-ASCII escapes

I'm not 100% sure, but I think unescape_and_decode does operations in wrong order.

I have an XML file which has 'ascii' encoding and uses escapes for non-ASCII characters. The current implementation will unescape character first (yielding UTF-8), then does "decode" (which, I think, corrupts this UTF-8 character, since document encoding is set to "ascii").

Link to the XML: http://unitsofmeasure.org/ucum-essence.xml

See the "ampere" definition at the line 252.

(obviously, the workaround for me is to change encoding from 'ascii' into 'UTF-8', as 'ascii' is a subset).

Improve documentation

Explain the design, explain what Element is and why do everything based on a big vector instead of splitting all properties directly ....

More examples as well

resolving namespaces for things other then tags

After calling reader.read_namespaced_event, the namespace buffer is borrowed to the first element of the returned tuple. As long as that is borrowed, it is not possible to resolve other qnames that appear in attributes or values.

The only solution I can think of is to pass the namespace buffer along, so

pub fn read_namespaced_event<'a, 'b, 'c>(
    &'a mut self,
    buf: &'b mut Vec<u8>,
    namespace_buffer: &'c mut Vec<u8>,
) -> Result<(&'c [u8], Option<&'c [u8]>, Event<'b>)> {
}

Panic when trim text is active

When parsing the following XML

<Run>
<!B>
</Run>

with reader.trim_text(true);, then it panics with:

        thread 'parse::livesplit_fuzz_crash' panicked at 'index 5 out of range for slice of length 4', src\libcore\slice\mod.rs:748:4
stack backtrace:
   0:           0xd563cf - std::sys_common::backtrace::_print::h788ffa2f5ef861b7
                               at src\libstd\sys\windows\backtrace/mod.rs:65
                               at src\libstd\sys_common/backtrace.rs:71
   1:           0xd67b76 - std::panicking::default_hook::{{closure}}::h07b19ee06ce409af
                               at src\libstd\sys_common/backtrace.rs:60
                               at src\libstd/panicking.rs:381
   2:           0xd6785a - std::panicking::default_hook::h5f4807df908f2485
                               at src\libstd/panicking.rs:391
   3:           0xd6803c - std::panicking::rust_panic_with_hook::h4b94a3a8ebb03363
                               at src\libstd/panicking.rs:611
   4:           0xd67f08 - std::panicking::begin_panic::hc1afa6ec7886430a
                               at src\libstd/panicking.rs:572
   5:           0xd67deb - std::panicking::begin_panic_fmt::h745ccd445d0a2438
                               at src\libstd/panicking.rs:522
   6:           0xd67d6f - rust_begin_unwind
                               at src\libstd/panicking.rs:498
   7:           0xd7b01e - core::panicking::panic_fmt::h5c59f72fc2fa4c71
                               at src\libcore/panicking.rs:71
   8:           0xd7b111 - core::slice::slice_index_len_fail::h0eddb005211069d8
                               at src\libcore\slice/mod.rs:748
   9:           0x4a1821 - <core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index::hd1fc40186de97550
                               at C:\projects\rust\src\libcore\slice/mod.rs:879
  10:           0x47c5b0 - core::slice::<impl core::ops::index::Index<I> for [T]>::index::hdb6bef459e9f75c3
                               at C:\projects\rust\src\libcore\slice/mod.rs:730
  11:           0x40265c - <alloc::vec::Vec<T> as core::ops::index::Index<core::ops::range::Range<usize>>>::index::hec15d839b3cc4514
                               at C:\projects\rust\src\liballoc/vec.rs:1575
  12:           0x45da41 - <quick_xml::reader::Reader<B>>::read_bang::hbb0897709c9e1f08
                               at C:\Users\Christopher Serr\.cargo\registry\src\github.com-1ecc6299db9ec823\quick-xml-0.9.3\src/reader.rs:277
  13:           0x45c723 - <quick_xml::reader::Reader<B>>::read_until_close::h364d8540707f02bd
                               at C:\Users\Christopher Serr\.cargo\registry\src\github.com-1ecc6299db9ec823\quick-xml-0.9.3\src/reader.rs:224
  14:           0x45b00a - <quick_xml::reader::Reader<B>>::read_event::h2d8f1f169a3604f1
                               at C:\Users\Christopher Serr\.cargo\registry\src\github.com-1ecc6299db9ec823\quick-xml-0.9.3\src/reader.rs:454
  15:           0x45baa2 - <quick_xml::reader::Reader<B>>::read_until_open::hab15a807d188a780
                               at C:\Users\Christopher Serr\.cargo\registry\src\github.com-1ecc6299db9ec823\quick-xml-0.9.3\src/reader.rs:167
  16:           0x45b01d - <quick_xml::reader::Reader<B>>::read_event::h2d8f1f169a3604f1
                               at C:\Users\Christopher Serr\.cargo\registry\src\github.com-1ecc6299db9ec823\quick-xml-0.9.3\src/reader.rs:455
  17:           0x40ad84 - livesplit_core::run::parser::quick_xml_util::parse_children::h655e1cbb19ec474e
                               at C:\Projekte\livesplit-core\src\run\parser/quick_xml_util.rs:193
  18:           0x446339 - livesplit_core::run::parser::quick_livesplit::parse::{{closure}}::hf7a50f3cece4b789
                               at C:\Projekte\livesplit-core\src\run\parser/quick_livesplit.rs:354
  19:           0x404511 - livesplit_core::run::parser::quick_xml_util::parse_base::h8be9f426055b57a7
                               at C:\Projekte\livesplit-core\src\run\parser/quick_xml_util.rs:219
  20:           0x445c6d - livesplit_core::run::parser::quick_livesplit::parse::had2e9529462e5375
                               at C:\Projekte\livesplit-core\src\run\parser/quick_livesplit.rs:346
  21:           0x4a2e7b - parsing::parse::livesplit_fuzz_crash::h739424198cfa3d47
                               at tests/parsing.rs:22
  22:           0x4b38c6 - <F as test::FnBox<T>>::call_box::h8652f97c00eee037
                               at src\libtest/lib.rs:1480
                               at C:\projects\rust\src\libcore\ops/function.rs:223
                               at src\libtest/lib.rs:141
  23:           0xd6b6ce - _rust_maybe_catch_panic
                               at src\libpanic_unwind/lib.rs:99
  24:           0x4a406c - std::sys_common::backtrace::__rust_begin_short_backtrace::h6087b56ec1e9aa74
                               at C:\projects\rust\src\libstd/panicking.rs:459
                               at C:\projects\rust\src\libstd/panic.rs:361
                               at src\libtest/lib.rs:1419
                               at C:\projects\rust\src\libstd\sys_common/backtrace.rs:136
  25:           0x4a4ce7 - std::panicking::try::do_call::hd464f7013f49753a
                               at C:\projects\rust\src\libstd\thread/mod.rs:394
                               at C:\projects\rust\src\libstd/panic.rs:296
                               at C:\projects\rust\src\libstd/panicking.rs:480
  26:           0xd6b6ce - _rust_maybe_catch_panic
                               at src\libpanic_unwind/lib.rs:99
  27:           0x4ac9ac - <F as alloc::boxed::FnBox<A>>::call_box::h8657cf680c887452
                               at C:\projects\rust\src\libstd/panicking.rs:459
                               at C:\projects\rust\src\libstd/panic.rs:361
                               at C:\projects\rust\src\libstd\thread/mod.rs:393
                               at C:\projects\rust\src\liballoc/boxed.rs:682
  28:           0xd65f90 - std::sys::imp::thread::Thread::new::thread_start::h58762d9deebb6e43
                               at C:\projects\rust\src\liballoc/boxed.rs:692
                               at src\libstd\sys_common/thread.rs:21
                               at src\libstd\sys\windows/thread.rs:51
  29:     0x7fff0ff12773 - unit_addrs_search

tafia / quick-xml Goto Github PK

quick-xml's Introduction

quick-xml

Example

Reader

Writer

Serde

Credits

Parsing the "value" of a tag

Performance

Features

Performance

Contribute

License

quick-xml's People

Contributors

Stargazers

Watchers

Forkers

quick-xml's Issues

Recommend Projects

Recommend Topics

Recommend Org