Coder Social home page Coder Social logo

alschemist's People

Contributors

ashinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

alschemist's Issues

`html->sxml` with escaped quotes breaks text into multiple nodes

This is admittedly something I noticed in CHICKEN Scheme, not Chibi, but I was told that the CHICKEN bindings for html-parser are just a thin wrapper around the same code for Chibi.

Anyways, there's some weirdness with escaping quotes in text when using html->sxml. Perhaps a short example would be sufficient to explain the problem I'm encountering:

(html->sxml "<p>foo&apos;bar&quot;baz</p>") ;=> (*TOP* (p "foo" "'" "bar" "\"" "baz"))

As a counter-example, I'll use the ssax egg:

(call-with-input-string "<p>foo&apos;bar&quot;baz</p>" (cut ssax:xml->sxml <> '())) ;=> (*TOP* (p "foo'bar\"baz"))

I guess fundamentally it's a question of whether there should be one text node or not. I would argue that in this particular case, it should be a single node. I have been using html-parser to try and scrape some web pages, and this is extremely unexpected! Especially so if one uses txpath / sxpath on the final result, as //p/text() queries will not necessarily behave as expected. You would have to (apply string-append ((txpath "//p/text()") sxml)) to the result to get the full contents of the text.

Is there a rationale for this, or is that some kind of limitation of the parser? I know that tags may also contain sub-tags in HTML, but I'm not sure a new node should be made if a tag's contents are not HTML tags themselves.

See copy of this ticket on the CHICKEN bug tracker.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.