Using kramdown to mark up a digitized book, I've been converting the OCR-generated ast

The original goal was twofold: to create a faster pure Ruby Ma

The kramdown parser does its thing in multiple steps: First th

Incorrect smart_quote handling around underscore markup about kramdown HOT 10 CLOSED

gettalong commented on July 26, 2024

Incorrect smart_quote handling around underscore markup

from kramdown.

Comments (10)

gettalong commented on July 26, 2024

The handling of quotes is correct as per the kramdown spec. Other libraries like rubypants do the same as kramdown when given your test cases.

I recommend putting the underscores outside of the quotes.

from kramdown.

Vaguery commented on July 26, 2024

So when a quotation starts with an italic word, the italics should overlap the front quote? Not how typesetting works.

Just curious: why does it work correctly for single asterisks, but not for underscores?

"_Single underscores_ do not work."

"*Single asterisks* work as they should."

'__Double underscores__ do not work.'

'**Double underscores** do not work.'

from kramdown.

gettalong commented on July 26, 2024

That's a good question! It should definitely work the same. rubypants shows the same, incorrect behavior...

Will have a look at it! And sorry for my dismissive behavior, should have investigated a bit more pedantically!

from kramdown.

Vaguery commented on July 26, 2024

It seems like a little edge case unless you've actually had first-hand experience trying to write a book in it, I know :)

I've already started writing a new parser, with specs and discrete acceptance tests for specified features.

Do you have any particular technical goals for kramdown, such as speed or ability to handle very large input files, or is accuracy of all the edge cases your main goal? Would it be a problem if I were to write a comprehensive suite of explicit unit and integration specs, rather than the large suite of fixtures you use now?

from kramdown.

gettalong commented on July 26, 2024

The original goal was twofold:

to create a faster pure Ruby Markdown parser than Maruku and
to provide a syntax spec for the kramdown specific input so that the edge cases are all documented

Since the internal structure of the parsers and the converters doesn't really matter, I thought it would be better to provide test inputs and their respective outputs to ensure that changing the code doesn't invalidate the spec. Therefore the large suite of test cases.

You can use whatever you like and what seems best for you for testing your parser :)

from kramdown.

Vaguery commented on July 26, 2024

I'm just taking a little while to tease apart all the integration tests you have here into individual specs. A few "small changes" to the current parser broke a surprising variety of them in a lot of uninterpretable ways, so single-responsibility tests will help me step carefully.

Could you say a few words about the overall architecture? My problem visualizing the whole is that it feels as though each parser is applied in the order they appear in the @block_parsers (and then @span_parsers) list. Is that the case? They're being serially applied in that order to each component, recursively over each branch found?

So as a side effect of this serial ordering, I don't need to worry (for architectural reasons) about accidentally applying a smart_quotes parser to an ERB block? Because the "only things left" to parse will be "extra raw" text that should have the quotes curled, dashes replaced and so forth?

from kramdown.

gettalong commented on July 26, 2024

The kramdown parser does its thing in multiple steps:

First the document is parsed for block level elements, the big building blocks.
Then the created document tree is updated by parsing raw_text elements with the span level parser and applying the IALs.
The last step is replacing abbreviations with the correct kramdown Element object.

The block and span parsers are applied in a serial kind of way, in the order they appear in @block_parser and @span_parsers.

Each parser defines a regexp that is used to identify the block or span level element. Since some parsers use similar regexps (e.g. setext header and paragraphs, both need a line of text) the order of the parsers is important. Also, by arranging the parsers that are not order-dependent in a most-commonly-used way it is possible to squeeze some more speed out of the kramdown parser.

In case of the block parsers, they are just tested on the current position of the document in the order they appear and the first one wins.

In case of the span parsers, the start regexps for all spans parsers are stitched together into one huge regexp. A match for this regexp is searched for and then the span parsers are tested in the order they appear in the @span_parsers array. This is done to avoid scanning each and every character.

Regarding an ERB block parser: This depends on how you create the element. If you create a custom :erb element with the ERB text in the #value accessor, you have nothing to worry about. Text will only be parsed by the span level parser if you create an element of type :raw_text with the ERB text!

from kramdown.

gettalong commented on July 26, 2024

So, I have found a way to make your examples work without breaking my test suite :) So your presented use cases will work in the next release.

from kramdown.

Vaguery commented on July 26, 2024

Thank you! I worked on it for quite a while and was elbow-deep into a full parser rewrite before I got stuck. So I'm glad you found an incremental solution.

Do you have the sense that, especially as Markdown flavors get more popular, the ability to extend and revise the pattern-matching is starting to trump the speed of the implementations?

from kramdown.

gettalong commented on July 26, 2024

Not really, no. There are also some fast implementations that are not regexp-based like the original Perl Markdown from Gruber but which are based on a PEG grammar that can more easily be modified.

I also have to admit that I don't immediately recognize what all the code of the kramdown parser does :) It takes me some time to understand some really complex parts (should have add more comments).

from kramdown.

Incorrect smart_quote handling around underscore markup about kramdown HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent