snoyberg / xml Goto Github PK

View Code? Open in Web Editor NEW

70.0 7.0 62.0 1.06 MB

Various XML utility packages for Haskell

Haskell 100.00%

xml's People

Stargazers

Watchers

Forkers

erikd aristidb ygale sol umerazad alexanderkjeldaas yunomu vishwas jcristovao cosmo0920 tonio213 ulikoehler philonous erantapaa michaelxavier rbros creichert k0ral simplyrets tkvogt s9gf4ult odr vigoo lspitzner alexkalderimis afcady bgamari gautier59 abbradar liumengyang dylex merijn bucklereed edwardbetts gaumala kindaro binarysunrise-io deepfire k-bx tvh emhoracek treeowl unhammer tarmean chobbes tolysz treetide asellappen jgm elvecent mt-caret teto juliapath zoominsoftware jaunruh minoru poorlyknitsweater sheaf diegodiv igrep mniip jwaldmann

xml's Issues

Constraints on `renderBytes`, `renderText`

It may be that I misunderstand the docs but the types of renderBytes and renderText are unexpected to me following from the phrase "this module does not provide IO and ST variants, since the underlying rendering operations are pure functions":

renderBuilder :: Monad m => RenderSettings -> Conduit Event m Builder Source
renderBytes :: (MonadBase base m, PrimMonad base) => RenderSettings -> ConduitM Event ByteString m () Source
renderText :: (MonadThrow m, MonadBase base m, PrimMonad base) => RenderSettings -> ConduitM Event Text m () Source

The PrimMonad constraints force you to deal quite explicitly with ST or IO to get the answers out. It also seems weird that renderText gets MonadThrow while the other two do not - it's not clear why that one would throw an exception any more than the other two; renderBytes by design can only generate valid UTF8 ByteStrings so conversion to Text can never fail.

xml-conduit: EventBeginElement does not expose the default namespace of its children

I'm trying to use the Event-based conduits from xml-conduit to write an XMPP server. This leads to the following problem:

Every connection starts with a bit of XML of this format:

<stream:stream
        from='[email protected]'
        to='im.example.com'
        version='1.0'
        xml:lang='en'
        xmlns='jabber:client'
        xmlns:stream='http://etherx.jabber.org/streams'>

This is parsed by Text.XML.Stream.Parse.parseText as:

EventBeginElement
    (Name {nameLocalName = "stream", nameNamespace = Just "http://etherx.jabber.org/streams", namePrefix = Just "stream"})
    [ (Name {nameLocalName = "from", nameNamespace = Nothing, namePrefix = Nothing}, [ContentText "[email protected]"])
    , (Name {nameLocalName = "to", nameNamespace = Nothing, namePrefix = Nothing}, [ContentText "im.example.com"])
    , (Name {nameLocalName = "version", nameNamespace = Nothing, namePrefix = Nothing}, [ContentText "1.0"])
    , (Name {nameLocalName = "lang", nameNamespace = Just "http://www.w3.org/XML/1998/namespace", namePrefix = Just "xml"}, [ContentText "en"])]

Note that the 'jabber:client' namespace is completely gone in this format. Differentiating between which namespace is used here is important (a server needs to respond differently to 'jabber:client' and 'jabber:server'), and currently it can only be found by inspecting the children of the element (which aren't there yet, initially).

Rendering has the same problem: I can not find a way to create an element with a prefix, a namespace for that prefix but a different default namespace for the children.

I'm not sure this is the right place, and not a bug in xml-types instead.

Add xml-conduit docs to Hackage

Not having them on Hackage makes the beginner experience significantly worse.

Example in Text.XML.Stream.Parse requires type signatures

If I try and run the first example as-is, I get the following error:

ghc: panic! (the 'impossible' happened)
  (GHC version 8.0.2 for x86_64-apple-darwin):
	nameModule system $dShow_acbe

If, however, I add type signatures, then I get the expected result (though not as shown in the docs: the provided People type isn't a record type, but the printed result is).

performance question

Hi for a simple dom parsing my program use 1.20 s, the same in F# (.net) 0.100 s. 10x slower :( Now the question as i am newbie in haskell, do i things wrong or it is as it is?

{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Text.XML as Xml
import           Formatting
import           Formatting.Clock
import           System.Clock

main :: IO ()
main = do
    start <- getTime Monotonic
    _ <- Xml.readFile Xml.def "..\\data\\file1.xml"
    end <- getTime Monotonic
    fprint ("parse_xml: " % timeSpecs % "\n") start end

parse_xml:      1.20 s
Execution time: 1.55 s

Streaming seems to break when encountering `xmlns` attribute

I'm playing around with this library for the first time, here's my hello-world:

// in file "foo.xml"
<foo>hello</foo>

#!/usr/bin/env stack
{- stack
     --resolver lts-7.15
     runghc
     --package conduit
     --package xml-conduit-1.4.0.2
-}

{-# language OverloadedStrings #-}

module XmlParsing where

import Conduit
import Text.XML.Stream.Parse

main :: IO ()
main = do
  txt <- runConduitRes (parseFile def "foo.xml" .| tagIgnoreAttrs "foo" content)
  print txt

Output:

Just "hello"

However, when I add an xmlns attribute in foo.xml:

<foo xmlns="bar">hello</foo>

The output becomes:

Nothing

Is that supposed to happen, or is it a bug? Thanks!

Offer error mode that pretty prints and diffs XML documents

Currently figuring out what you did wrong is a pain, pretty printing and diffing "expected"/"actual" in a user's query into the document would help.

Doctypes seem to be ignored?

From what I can understand, the Document type has a field for doctypes, namely prologueDoctype. However, the doctype in the input document seems to be ignored and is not put into the prologueDoctype field.

From reading test/main.hs, it seems that this is intended?

Parsing pretty printed XML results in whitespace

This behavior surprised me. When pretty-printing xml with xml-conduit, there is white space added, not only before and after tags but also around contents. Parsing this again, results in node contents with that whitespace.

Code here
Output:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    bar
</foo>
"\n    bar\n"

I would have expected a pretty printer that outputs <foo>bar</foo>, since "bar" is not a node. But the behavior seems intentional.
However, the parser output seems a logical consequence.

blaze-builder 0.4.0.1 is released, please test and relax upper bounds

Thanks! https://hackage.haskell.org/package/blaze-builder-0.4.0.1

Update to data-default-0.5.0

Hi, if you have the chance, could you please release an update of xml-conduit that compiles with the current version of the data-default package?

DTD ENTITY definitions can themselves have entities in them

Prelude Text.XML> parseText def "<!DOCTYPE foo [<!ENTITY A \"&#65;\" >]><foo>&A;</foo>"
Right (Document {documentPrologue = Prologue {prologueBefore = [], prologueDoctype = Just (Doctype {doctypeName = "foo", doctypeID = Nothing}), prologueAfter = []}, documentRoot = Element {elementName = Name {nameLocalName = "foo", nameNamespace = Nothing, namePrefix = Nothing}, elementAttributes = fromList [], elementNodes = [NodeContent "&#65;"]}, documentEpilogue = []})

Note the NodeContent; when this is rendered, it becomes &#65;, rather than A.

Also note that entities can reference other entities, which is the root of the infamous 'billion laughs' attack; here be dragons. Character entities are safe, though.

This might not be worth supporting properly, but it should definitely explicitly error out rather than producing garbage.

Example on Text.XML.Stream.Parse page doesn't work

Hi, I'm pretty new to haskell, and was thrown off when the example below didn't work.

http://hackage.haskell.org/packages/archive/xml-enumerator/0.4.3.1/doc/html/Text-XML-Stream-Parse.html

This does work: It looks like there are some weird string things going on, and the arguments to parseFile_ are ParseSettings -> FilePath, not FilePath -> ParseSettings. Thanks!

 {-# LANGUAGE OverloadedStrings #-}

import Text.XML.Stream.Parse
import Data.Text (Text, unpack) 
import Data.Enumerator
import Data.XML.Types

data Person = Person { age :: Int, name :: Text }
    deriving Show

parsePerson :: Iteratee Event IO (Maybe Person) 
parsePerson = tagName "person" (requireAttr "age") $ \age -> do
    name <- content
    return $ Person (read $ unpack age) name

parsePeople :: Iteratee Event IO (Maybe [Person])
parsePeople = tagNoAttr "people" $ many parsePerson

main = parseFile_ def "people.xml" $ force "people required" parsePeople

row and column information

This is a feature request.

If I'm not mistaken, xml-conduit does not keep track of row and column information (e.g. in the nodes). My use case would benefit from something like that, because potential errors in the xml document are fed back to the user.

Any thoughts on that?

How to combine element and attributeIs axis?

Hi, sorry to bother you. I got confused using xml-conduit cursor.

In XPath, I could filter nodes like
/bookstore/book[@category='WEB']/title
. With Cursor,
element "bookstore" &/ elemental "book" &/ elemental "title"
works except the category attr filter.
If "element &| attributeIs", then I cannot use &/ to combine more axis.

May I know how to do this? Thanks.

Parsing unordered elements

In many cases, one wants to parse a set of XML elements in whichever order they come. While the current xml-conduit API shines to parse ordered elements, it requires some efforts to be used with unordered ones.
Would you accept contributions to support this not-so-uncommon use case ? Also, do you have suggestions about the way to design/implement it ?

In the end, I'd like to be able to write something like:

-- New run-like functions, to choose which behavior to adopt
ordered = undefined
unordered = undefined

-- New consumer, inspired from 'ignoreAttrs'
ignoreElements = undefined

-- 1/ Parse <parentTag>, and some unordered children elements.
-- Fail if there is an unknown child.
tagName "parentTag" ignoreAttrs $ \_ -> unordered $ do
  child1 <- tagName "childTag1" ignoreAttrs (const content)
  -- ...
  childN <- tagName "childTagN" ignoreAttrs (const content)
  return $ Something child1 {- ... -} childN

-- 2/ Parse <parentTag>, and some unordered children elements.
-- Ignore unknown children.
tagName "parentTag" ignoreAttrs $ \_ -> unordered $ do
  child1 <- tagName "childTag1" ignoreAttrs (const content)
  -- ...
  childN <- tagName "childTagN" ignoreAttrs (const content)
  ignoreElements
  return $ Something child1 {- ... -} childN

-- 3/ Parse <parentTag>, and some ordered children elements.
tagName "parentTag" ignoreAttrs $ \_ -> ordered $ do
  child1 <- tagName "childTag1" ignoreAttrs (const content)
  -- ...
  childN <- tagName "childTagN" ignoreAttrs (const content)
  return $ Something child1 {- ... -} childN

-- This requires a change of the signature of 'tagName' (and many other functions)

Rendering optional attributes still runs even if the attribute should not be shown

Consider the following:

[xml|<foo :False:end=#{error "Oops"}>|]

I would expect this to generate just <foo>. However, it seems that the contents is evaluated, which leads to the error. This makes it impossible to use isJust/fromJust with optional attributes.

In a more general question, do you think that $maybe should work inside element declarations?

xml-conduit-1.1.0.7 unable to parse CDATA properlly

Text.XML.Stream.Parse.parseToken treat 'CDATA' element as 'Begin' one

code:

import qualified Data.ByteString.Lazy.Char8 as LB
import qualified Text.XML as XML

main = do
    xml     <- LB.readFile "x.xml"
    print $ XML.parseLBS XML.def xml

xml:

<a><![CDATA[www.google.com]]></a>

output for xml-conduit-1.1.0.7:

Left 1:29-1:33: Expected end element for: Name {nameLocalName = "![CDATA[www.google.com]]", nameNamespace = Nothing, namePrefix = Nothing}, but received: EventEndElement (Name {nameLocalName = "a", nameNamespace = Nothing, namePrefix = Nothing})

output for xml-conduit-1.1.0.3:

Right (Document {documentPrologue = Prologue {prologueBefore = [], prologueDoctype = Nothing, prologueAfter = []}, documentRoot = Element {elementName = Name {nameLocalName = "a", nameNamespace = Nothing, namePrefix = Nothing}, elementAttributes = fromList [], elementNodes = [NodeContent "www.google.com"]}, documentEpilogue = []})

Error in parsing file with extra spaces

In version 0.7.0.3 (and 0.7.0.2) simple parsing like

runResourceT $ XP.parseFile XP.def "test.xml" $$ XP.tagNoAttr "root" $ return ()

works only if file content is "".
If one add extra space or LF (" ") - than got an error: ParseError {errorContexts = ["demandInput"], errorMessage = "not enough input"}

In version 0.7.0.1 it was correct.

The possible reason is
sinkToken de = sinkParser ((endOfInput >> return Nothing) <|> fmap Just (parseToken $ psDecodeEntities de)) -- 0.7.3.1
=>
sinkToken = sinkParser . parseToken . psDecodeEntities -- 0.7.3.2

How to extract multiple attributes from one cursor?

Given a node like this:

<foo a="1" b="2"/>

Is there a way to extract both a and b (ie a tuple (1,2)) using Text.XML.Cursor?

parseLBS fails to parse (valid?) xml

I'm trying to parse a (maybe not very well formated) rss feed with xml-conduit:

> res <- simpleHttp "http://www.semencespaysannes.org/rss.php"
> parseLBS def res
Left Data.Conduit.Text.decode: Error decoding stream of UTF-8 bytes. Error encountered in stream at offset 98. Encountered at byte sequence "\233sea"

incorrect handling of iso8859-1

When parsing an iso8859-1 html page, with the header

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

the accentuated characters are rendered as �s.

Minimal example:

{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (readFile)
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.Text.IO as TIO
import Text.HTML.DOM (readFile)
import Text.XML.Cursor

url :: String
url = "http://www.confederationpaysanne.fr/petites_annonces.php"

main :: IO ()
main = do
    cursor <- fromDocument `fmap` readFile file
    mapM_ TIO.putStrLn $ cursor $// element "title" &// content

Parsing <article> tags

This is all a bit above my head, so all I really know is that the code below works until I change the commenting on the part marked with ********* at which point I get empty responses.

{-# LANGUAGE OverloadedStrings #-}

module HtmlParser where

import Network.HTTP.Conduit (simpleHttp)
import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, content, element, fromDocument, ($//), (&//), (&/), (&|))

-- The URL we're going to search
url = "http://www.amsterdamfoodie.nl/2015/wine-beer-food-restaurants-troost/"

cursorFor :: String -> IO Cursor
cursorFor u = do
    page <- simpleHttp u
    return $ fromDocument $ parseLBS page

-- The data we're going to search for
-- (&/) :: Axis node -> (Cursor node -> [a]) -> Cursor node -> [a]
-- Combine two axes so that the second works on the children of the results of the first.

 -- cursor $// element "p" &/ element "a"
findNodes :: Cursor -> [Cursor]
-- *****************
-- findNodes = element "article" &/ element "a"
findNodes = element "p" &/ element "a"
-- ********************

-- Extract the data from each node in turn
-- attribute :: Name -> Cursor -> [Text]
-- extractData :: Cursor -> Text
extractData = concat . attribute "href"

-- Process the list of data elements
processData = mapM_ putStrLn

{-
($//) :: Cursor node -> (Cursor node -> [Text]) -> [Text]
Apply an axis to the descendants of a 'Cursor node'.

(&|) :: (Cursor node -> [Cursor]) -> (Cursor -> Text) -> Cursor node -> [Text]
Apply a function to the result of an axis.

Cursor Node -> [Text]
:t findNodes &| extractData
-}
parseAF :: IO ()
parseAF = do
     cursor <- cursorFor url
     processData $ cursor $// (findNodes &| extractData)

Not in scope: data constructor ‘ParseSettings’

If I modify parser-sample.hs to provide a different ParseSettings, e.g.

let ps = ParseSettings { psDecodeEntities = decodeXmlEntities, psRetainNamespaces = True }
people <- runResourceT $
    parseFile ps "people.xml" $$ force "people required" parsePeople

I get

Not in scope: data constructor ‘ParseSettings’

Is there something I'm doing wrong? or should ParseSettings be exported as ParseSettings(..)?

checking attrs of tag

The most common stream-parser for xml-tags now is Text.XML.Stream.Parse.tag but it ignore attrs.
Having checking function Name -> [(Text, Text)] -> Maybe a could be convenient.

Inconsistent LICENSE file and cabal license: declaration

LICENSE file is BSD2 while the cabal license: says MIT.

License in cabal file should be BSD2 according to the accompanying LICENSE file, not BSD3

xml-conduit too greedy

I'm trying to parse a streaming XML protocol (XMPP) with xml-conduit, but it seems either xml-conduit or something else in the pipeline is pulling data too eagerly. For example in

{-# LANGUAGE OverloadedStrings #-}
module Test where

import qualified Data.ByteString as BS
import Data.Conduit
import Data.Default
import qualified Data.Conduit.List as CL
import qualified Text.XML.Stream.Parse as XP

xml =
   [ "<?xml version='1.0'?>"
   , "<stream:stream xmlns='jabber:client' "
   , "xmlns:stream='http://etherx.jabber.org/streams' id='1365401808' "
   , "from='examplehost.org' version='1.0' xml:lang='en'>"
   , "<stream:features>"
   , "<starttls xmlns='urn:ietf:params:xml:ns:xmpp-tls'/>"
   , error "Booh!"
   ] :: [BS.ByteString]


main :: IO ()
main = (runResourceT $ CL.sourceList xml $= XP.parseBytes def $$ CL.peek )
         >>= print

executing main will trigger the error. (Note that the "stream"-tag will only be closed when the session is terminated)
(Tested with xml-conduit 0.6.1 and conduit 0.3.0)

Alternate event parsing interface in Text.XML.Stream.Parse

So, I may be totally wrong here, but I think it may make sense to have
in Text.XML.Stream.Parse a version of tag (and the other analogous functions) that
does not try to absorb the balancing closed tag.

The example I have in mind is a rss reader that, when updating some feed, only
wants to parse the xml stream until an item node with a certain pubDate
child has been read, thus stops the parsing even though some balancing
closed parent tags have not yet been reached.

I think it is not too difficult to create a modified version of tag for that
(actually I will try to do it over the weekend). However I don't know
xml-conduit well and I may be asking for something that is already possible, or
that doesn't make sense, sorry in advance if that is the case.

xml-conduit-1.4.0.2:Text.XML compile fails with blaze-markup-0.8.0.0

[6 of 7] Compiling Text.XML         ( Text/XML.hs, dist/dist-sandbox-b4493aa4/build/Text/XML.o )
Text/XML.hs:331:16:
    Couldn't match type ‘a0 -> BI.MarkupM a0’ with ‘BI.MarkupM ()’
    Expected type: B5.Html
      Actual type: a0 -> BI.MarkupM a0
    Probable cause: ‘BI.Leaf’ is applied to too few arguments
    In the expression: BI.Leaf tag open (fromString " />")
    In an equation for ‘leaf’:
        leaf = BI.Leaf tag open (fromString " />")

I guess an upper-bound would avoid this. :)

Data instance for Element / Node / Document

It would be useful if there was a Data instance available (even if it is just from deriving), since it would mean that SYB or Uniplate could be used out of the box to make quick XML transformations.

Example of stream parsing with choice and nesting.

The example on how to use Text.XML.Stream.Parse is very simple and easy to follow but I'd like to see a fuller example that tackles some more complexity in the XML such as choice and nesting. For instance, how would this be parsed?

<?xml version="1.0" encoding="utf-8"?>
<people>
  <person age="25">Michael</person>
  <anonymous></anonymous>
  <person age="2">Eliezer</person>
  <anonymous></anonymous>
  <family children="1">
      <adult age="27">Trevor</adult>
      <adult age="32">Donna</adult>
      <count named="2" anonymous="1"></count>
  </family>
  <anonymous></anonymous>
  <count named="2" anonymous="3" family="3"></count>
</people>

`takeAllTreesContent` consumes more than implied by the documentation

The documentation of takeAllTreesContent compares it's behaviour to ignoreAllTreesContent, which should consume a single tag and whatever tree of tags/content is inside it. However, takeAllTreesContent in reality just consumes every single event left in the stream. This due to every single case having a recursion calling takeAllTreesContent, for example:

     Just e@(EventBeginElement name _) -> do
       yield e
       takeAllTreesContent
       endEvent <- await
       case endEvent of
        Just e@(EventEndElement name') | name == name' -> yield e >> takeAllTreesContent
         _ -> lift $ monadThrow $ InvalidEndElement name endEvent

For it to work as implied by the docs it should clearly terminate after consuming a single tag's tree. Which means removing the above recursion and changing takeAllTreesContent with many takeAllTreesContent.

[xml-conduit] Suppress empty xmlns in child elements.

A RenderSettings option would be useful to tell renderers not to reset namespaces for empty elements if a parent have a namespace in it's name.

Now:

<body>
  <request xmlns="my-ns">
    <spam xmlns="">salad</spam>
    <sausage xmlns="another-ns">
      <eggs xmlns="">chips</eggs>
    </sausage>
  </request>
</body>

With rsNoResetNS:

<body>
  <request xmlns="my-ns">
    <spam>salad</spam>
    <sausage xmlns="another-ns">
      <eggs>chips</eggs>
    </sausage>
  </request>
</body>

Rationelle: because some servers are very picky about element namespaces and can't find data which is under their nose. Pleasing them is a royal PITA and leads to massive code duplication or wrappers galore.

Alternative (read: nasty hack): Instead of using renderLBS, render to lazy Text, do a replace "xmlns=\"\"" "" on it and, finally, encode to bytestring.

xml-conduit fails to compile with HEAD

There is a bug in GHC which makes xml-conduit fail to compile with HEAD. #11276 is the issue tracking the problem.

A workaround is to add a type signature to dropWS which is a local definition in pad which is defined in the Text.XML.Parse module.

Streaming interface needs functions to return xmlns and prefixed tags

I just released a SVG parser for diagrams that uses xml-conduit:
https://github.com/diagrams/svg-diagrams

I need a way to read the namespace. Although I don't need it yet because the SVG spec is pretty stable and xmlns in the SVG is only used to signal what version of SVG is used. But the parser complains about attributes that have not been consumed. Prefixed tags also make problems. They seem not to be recognized.

Misleading error message for empty documents

When trying to parse empty documents (which in turn is not intended, but still it has happened), a ContentAfterRoot error is raised.

Test case:
sourceList [] $$ sinkDoc def

Text.XML.Stream.Parse.choose seems to be awkward to use.

The choose method seems to be quite troublesome to use:

getTagContent :: Name -> Consumer Event IO (Maybe (Name, Text))
getTagContent tag = fmap (\content -> fmap (\c -> (tag, c)) content) $ tagIgnoreAttrs tag content

getTagsContent :: [Name] -> Consumer Event IO (Maybe (Name, Text))
getTagsContent tags = choose $ (fmap getTagContent tags)

The compile fails with:

    Couldn't match type ‘ConduitM Event o0 IO (Maybe (Name, Text))’
                   with ‘forall o1. ConduitM Event o1 IO (Maybe (Name, Text))’
    Expected type: Name -> Consumer Event IO (Maybe (Name, Text))
      Actual type: Name -> ConduitM Event o0 IO (Maybe (Name, Text))
    In the first argument of ‘fmap’, namely ‘getTagContent’
    In the second argument of ‘($)’, namely ‘(fmap getTagContent tags)’

See also (someone else having this issue): http://stackoverflow.com/questions/33606038/xml-conduit-combining-tagparsers

Asking in #haskell-beginners on Freenode there was a suggestion that the forall in the expansion of Consumer was at the root of this, which seems to make sense, but I've been unable to determine why. Removing the type definition has no material effect on the error either.

html-conduit errors with decoding umlauts

I am trying to read a file with an 'a' umlaut (\228) and it is failing with this error:

*** Exception: Error parsing XML file tstfile.txt: Data.Conduit.Text.decode: Error decoding str
eam of UTF-8 bytes. Error encountered in stream at offset 5. Encountered at byte sequence "\228
\223ga"

There does not appear to be any parseText option unlike in xml-conduit.
Does it make sense to provide the user the ability to supply a decoder (eg Latin1) and/or a parseText option?
Or is there another way to get this to work.

Thanks,
Grant

Text.HTML.DOM handles XHTML poorly?

I was doing some parsing of HTML off the web and having some trouble with doing some traversals using xml-lens—I could pull titles off some pages, but not others. After many false starts, I came to realize that Text.HTML.DOM doesn't handle XHTML well; given that there's a lot of XHTML being served out there as text/html, that seems like a potentially significant issue.

The misbehavior is pretty clear if you print the document---you end up with a bogus "?xml" tag, and no DOCTYPE declaration. Remove those, and everything parses fine. Conversely, the XML parser has no problem. I think all this is demonstrated in the script below.

{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.DOM
import Text.XML

main = do
  let htmlDoc = Text.HTML.DOM.parseLBS source
      xmlParse = Text.XML.parseLBS def source
  print $ renderText def htmlDoc
  case xmlParse of
    Left e -> print e
    Right xmlDoc -> print $ renderText def xmlDoc
  where
    source = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html>\n<head>\n<title>A short title</title>\n</head>\n<body>\n<p>This is some simple text</p>\n<a href=\"http://nowhere.com/\">This Is Nowhere</a>\n<img src=\"img.png\" />\n</body>\n</html>\n"

I'm happy to look into this, though I'm still relatively new to Haskell, and I'm not sure how quickly I'm likely to progress...

quasi-quoter is using "otherwise", breaks with import qualified Prelude

nothing important, but ...

import qualified Prelude
import Text.Hamlet.XML
foo = [xml|$if Prelude.True 
    foo|]

gives Not in scope: ‘otherwise’, because of https://github.com/snoyberg/xml/blob/master/xml-hamlet/Text/Hamlet/XML.hs#L72

The code should probably use Prelude.otherwise or _. Here's another breakage for plain otherwise:

import Text.Hamlet.XML
foo = let otherwise = "bar" in [xml|$if True 
    foo|]

xml-conduit fails to compile with ghc-8.rc2 (type errors)

note: issue #71 (which links to https://ghc.haskell.org/trac/ghc/ticket/11276) has a similar title, but seems to different error (#71: type checker hangs, this: type checking results in errors)

xml-conduit commit 498b81b ,
ghc --version 8.0.0.20160204 (rc2)

cabal install:
(conflict: xml-conduit => transformers>=0.2 && <0.5)
...

cabal install --allow-newer:

[4 of 7] Compiling Text.XML.Stream.Parse ( Text/XML/Stream/Parse.hs, dist/dist-sandbox-c639d5f8/build/Text/XML/Stream/Parse.o )

Text/XML/Stream/Parse.hs:204:28: error:
    • Couldn't match type ‘a1’ with ‘a0 -> [(TName, [Content])]’
      ‘a1’ is a rigid type variable bound by
        a type expected by the context:
          forall a1. a1 -> a1
        at Text/XML/Stream/Parse.hs:204:28
      Expected type: forall a. a -> a
        Actual type: (a0 -> [(TName, [Content])])
                     -> a0 -> [(TName, [Content])]

There are in fact a lot more errors shown, but I don't believe them, because of https://ghc.haskell.org/trac/ghc/ticket/11541 (fixed in HEAD,which I don't have, and present in rc2)

render element

in xml-enumerator simple element was renered as

in xml-conduit-0.5.1 as

which is a little bit worse.

ParseSettings constructor is not exposed

Hello,

It seems that:
https://github.com/snoyberg/xml/blob/master/xml-conduit/Text/XML/Stream/Parse.hs
does not expose the ParseSettings constructor?

I wondered if this was on purpose. In particular it seems to prevent me from building a ParseSettings using decodeHtmlEntities. Am I missing something? Should I not want to do that in the first place?

html-conduit: missing lower bound on conduit-extra for decodeUtf8Lenient

The documentation for decodeUtf8Lenient from conduit-extra says it has been available since 1.1.1 (and I didn't verify this). The current version of Text.HTML.DOM uses it. However, there is no lower bound on conduit-extra in html-conduit.cabal. I ran into this issue when html-conduit failed to compile with conduit-extra-1.0.0.1.

Issues decoding leading escaped spaces when using Text.XML.Stream.Parse

content and maybeContent both skip over leading escaped whitespace.

A minimal example is:

<term name="foo">&#160;&#39;&#160;</term>

where the content parses as: "'\160", rather than "\160'\160".

Note that trailing characters are treated correctly.

Namespaces in attribute values

I am having trouble with a XML that has namespaced values. For example:

<Object xmlns:q1="example.org.nz" xsi:type="q1:ObjectType" Other="attributes" are="here" />

I would like to compare the value of the "type" attribute for equality with
"{example.org.nz}ObjectType", but can't do that. The "q1" is just noise.

See this MSDN documentation for where these things come from.

In the meantime I am going to ignore the prefix's in the values, but it is not ideal.

XPath

Hi!

Is there any plan to add xpath parser?
I thought i could give it a try for my own purposes but saw that anyName was added on 1.5.0, wonder if this is heading to a xpath parser, if its the case i should not bother with my efforts.

Namespace prefixes are forgotten, breaking CURIEs and RDFa

Currently, xml-conduit forgets about namespace binding information besides element/attribute resolution. Effectively, namespace names are alpha-renamed. However, this information is needed for some purposes (including CURIE syntax and RDFa); alpha-renaming can break the semantics of documents.

Consider this example:

*Main Text.XML> parseText def "<html xmlns:wiki=\"http://en.wikipedia.org/wiki/\"><a href=\"[wiki:Foo]\" /></html>"
Right (Document {documentPrologue = Prologue {prologueBefore = [], prologueDoctype = Nothing, prologueAfter = []}, documentRoot = Element {elementName = Name {nameLocalName = "html", nameNamespace = Nothing, namePrefix = Nothing}, elementAttributes = fromList [], elementNodes = [NodeElement (Element {elementName = Name {nameLocalName = "a", nameNamespace = Nothing, namePrefix = Nothing}, elementAttributes = fromList [(Name {nameLocalName = "href", nameNamespace = Nothing, namePrefix = Nothing},"[wiki:Foo]")], elementNodes = []})]}, documentEpilogue = []})

It's not possible to determine where the CURIE "[wiki:Foo]" points to because the "wiki" namespace isn't directly used in any attribute or element names and so it's totally forgotten (and, even if it was, there would be the problem of silent shadowing).

The general solution is to have Element grow another field, recording what namespaces are bound and their prefices, and to also honour that information when rendering XML. (As a side-note, this would also subsume Name's prefix field, but I'm not sure that change would be worth making.)

Strange XML render problem

The following code:

let doc = Document (Prologue [] Nothing []) (Element "Root" [] []) []
renderText def doc

Works fine with ghc 7.04 64-bit OS X and gives me the expected output:

<?xml version="1.0" encoding="UTF-8"?>
<Root/>

On both 32-bit and 64-bit flavors of Linux, I get either "" for renderText or Empty for renderLBS.

This code works fine, however:

import qualified Data.Text as T
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL

main = do
        let doc = Document (Prologue [] Nothing []) (Element "Root" [] []) []
            source = renderBytes def doc
        x <- runResourceT $ (source $= (CT.decode CT.utf8) $$ CL.consume)

        putStrLn $ T.unpack $ T.concat x

The problem seems to stem from the laziness introduced in the function Text.XML.renderLBS (which utilizes unsafePerformIO to run the resource outside of the IO monad), but I'm not sure why the host OS should make a difference here. Any ideas?

Thanks!

snoyberg / xml Goto Github PK

xml's People

Stargazers

Watchers

Forkers

xml's Issues

Recommend Projects

Recommend Topics

Recommend Org