jgm / commonmark-hs Goto Github PK

View Code? Open in Web Editor NEW

131.0 131.0 31.0 1.31 MB

Pure Haskell commonmark parsing library, designed to be flexible and extensible

Makefile 0.82% Haskell 99.18%

commonmark-hs's People

Contributors

Stargazers

Watchers

commonmark-hs's Issues

stack overflow

This occurs on the benchmark for pathological <?, but it can be reproduced without this:

% python -c 'print("?" * 4000)' | commonmark +RTS -K90000 -xc
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
  MAIN.MAIN
*** Exception (reporting due to +RTS -xc): (THUNK), stack trace:
  MAIN.MAIN
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
  Data.Text.Internal.IO.readTextDevice
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
  Commonmark.Inlines.pSymbol,
  called from Commonmark.Inlines.defaultInlineParser,
  called from Commonmark.Inlines.pInline,
  called from Commonmark.Inlines.attrParser,
  called from Commonmark.Inlines.parseChunks,
  called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
  called from Commonmark.Blocks.restOfLine,
  called from Commonmark.Blocks.block_starts,
  called from Commonmark.Blocks.blockContinues,
  called from Commonmark.Blocks.Commonmark.Blocks.processLines
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
  Commonmark.Inlines.pSymbol,
  called from Commonmark.Inlines.defaultInlineParser,
  called from Commonmark.Inlines.pInline,
  called from Commonmark.Inlines.attrParser,
  called from Commonmark.Inlines.parseChunks,
  called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
  called from Commonmark.Blocks.restOfLine,
  called from Commonmark.Blocks.block_starts,
  called from Commonmark.Blocks.blockContinues,
  called from Commonmark.Blocks.Commonmark.Blocks.processLines
Commonmark.Blocks.Commonmark.Blocks.processLines (src/Commonmark/Blocks.hs:108:1-12)
Commonmark.Blocks.blockContinues (src/Commonmark/Blocks.hs:(223,8)-(238,64))
Commonmark.Blocks.block_starts (src/Commonmark/Blocks.hs:(151,28)-(163,13))
Commonmark.Blocks.restOfLine (src/Commonmark/Blocks.hs:170:34-43)
Commonmark.Blocks.Commonmark.Blocks.runInlineParser (src/Commonmark/Blocks.hs:385:1-15)
Commonmark.Inlines.parseChunks (src/Commonmark/Inlines.hs:(70,34)-(72,45))
Commonmark.Inlines.attrParser (src/Commonmark/Inlines.hs:(331,33)-(333,71))
Commonmark.Inlines.pInline (src/Commonmark/Inlines.hs:333:30-71)
Commonmark.Inlines.defaultInlineParser (src/Commonmark/Inlines.hs:(84,35)-(94,28))
Commonmark.Inlines.pSymbol (src/Commonmark/Inlines.hs:(431,43)-(435,28))
*** Exception (reporting due to +RTS -xc): (THUNK_1_0), stack trace:
  Commonmark.Inlines.pSymbol,
  called from Commonmark.Inlines.defaultInlineParser,
  called from Commonmark.Inlines.pInline,
  called from Commonmark.Inlines.attrParser,
  called from Commonmark.Inlines.parseChunks,
  called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
  called from Commonmark.Blocks.restOfLine,
  called from Commonmark.Blocks.block_starts,
  called from Commonmark.Blocks.blockContinues,
  called from Commonmark.Blocks.Commonmark.Blocks.processLines
commonmark: Stack space overflow: current size 33568 bytes.
commonmark: Use `+RTS -Ksize -RTS' to increase it.

Implement `example_lists` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-example_lists
including back-references.

Implement in commonmark-core, commonmark-pandoc, commonmark-cli.

Documentation improvements

Example of adding an inline parser (say, abbreviations)
Example of modifying existing HTML output
Example of creating a new output format, e.g. roff man. (In this case there are some complexities, because the effect of an 'emph' might be \f[I], but it might be \f[BI] (if we're already in a boldface section). We can handle that by making the IsInline type for roff be a newtype embeddding State EmphState Builder or something like that. Then the effect can be state dependent. (Maybe we should consider changing all the implementations to be like this? It would remove some of the complexity for e.g. footnotes.)

Quadratic time.

Originally posted by @jgm in #40 (comment)

Footnote in AST lacks index information

The HasFootnote instance for Pandoc AST ignores the footnote identifiers, and labels:

commonmark-hs/commonmark-pandoc/src/Commonmark/Pandoc.hs

Lines 215 to 219 in c9afe7c

    
           instance (Rangeable (Cm a B.Inlines), Rangeable (Cm a B.Blocks)) 
        
                => HasFootnote (Cm a B.Inlines) (Cm a B.Blocks) where 
        
             footnote _num _lab _x = mempty 
        
             footnoteList _xs = mempty 
        
             footnoteRef _num _lab contents = B.note <$> contents

And it looks like Pandoc is doing its own state management to compute the identifier, and render the references accordingly. Is storing these footnote identifiers/labels in the Pandoc AST explicitly out of scope?

Loop in highlighting with footnote extension

% commonmark --highlight -xfootnotes
[^1]

[^1]: a
commonmark: <<loop>>

New pathological parsing for fenced divs

python -c 'n=800; print("::: {#id}\n" * n + "a\n" + ":::\n" * n)'

This is a problem with the new fix for detecting fence closers.
At each close we need to iterate through all subordinate fenced divs in the stack.

Unable to set attributes on table

When I add an attribute like {.overflows .collapsing .compact .sortable} immediately above the table, in the AST I get a wrapping div (with these attributes), which wrapes the table element. I'd expect these attributes to be applied to the <table> tag itself.

Implement `simple_tables` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-simple_tables

Implement in commonmark-core, commonmark-cli, commonmark-pandoc.

Note that we parse line by line, with no lookahead. The first line will be parsed as a paragraph line. See the way setext headers are currently handled.

Split emoji module from pandoc into separate package so it can be used here too

Hackage has an emoji package
https://hackage.haskell.org/package/emoji
but it only provides one-way lookup; it doesn't allow you to go from the emoticons to their string descriptions. It also has fewer emojis than pandoc (1400 vs 1757), and it uses String rather than Text. The module in pandoc is superior.

Make source positions a parser option?

Instead of handling it with typeclasses.
This would allow simpler typeclasses: Html, Pandoc.
It might also allow us to improve performance by avoiding the work of storing and computing ranges.

loop with --highlight option (source map)

% commonmark --highlight -xall benchmark.md
commonmark: <<loop>>

Extra block included in fenced_div

commonmark -xfenced_divs
::: {#id}
- a
- b
:::

Next para.
^D
<div id="id">
<ul>
<li>a
</li>
<li>b
</li>
</ul>
<p>Next para.</p>
</div>

The paragraph at the end should be outside the div.

Implement `line_blocks` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-line_blocks

Implement in commonmark-core, commonmark-pandoc, and commonmark-cli.

Implement `multiline_tables` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-multiline_tables

Implement in commonmark-core, commonmark-pandoc, commonmark-cli.

pathological case parsing inline CDATA tag

commonmark/cmark#299 affects commonmark-hs as well.

python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null
0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k

python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null
1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k

python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null
6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k

Bad definition list parsing

% commonmark -xdefinition_lists
## Blah

`-v`, `--version`

:   Print version.

`-h`, `--help`

:   Show usage message.
^D
<h2 id="blah">Blah</h2>
<dl>
<dt><code>-v</code>, <code>--version</code></dt>
<dd>
<p>Print version.</p>
</dd>
</dl>
<dl>
<dt><code>-h</code>, <code>--help</code></dt>
<dd>
<p>Show usage message.</p>
</dd>
</dl>

This produces two dls rather than one. But if you remove the heading, you get one as intended.

New system for source map

The current system (defining a new typeclass instance for the constructors) doesn't give fine-grained enough information (e.g. it doesn't distinguish code span delimiters from the code). And it is awkwardly designed, so that for example it's easy to write instances that produce loops.

Better to put a field in state that keeps a source map, and maybe another field that controls whether to populate it (for efficiency this can be turned off). Then the individual constructors can be designed to insert whatever fine-grained mapping data would be useful.

Alternatively, instead of making these parsers work for any Monad, limit to the HasSourceMap typeclass and define dummy default instances for common monads.

Unable to link images

The parser doesn't create correct pandoc AST nodes for this syntax:

[![asciicast](https://asciinema.org/a/329911.svg)](https://asciinema.org/a/329911)

Here's how one would expect it to render:

Implement `task_lists` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-task_lists

Implement in commonmark-core, commonmark-pandoc, and commonmark-cli.

Fix Windows test failures

See the test suite under Actions.
Needs investigating whether there are issues in the library itself, or just in the test suite.

Improve performance

See notes on performance in the README.md.

rawHtmlSpec gets in the way of parsing special links

In neuron we support special links of the format <a34sfef4> (as well as: <z:zettels?tag=foo>) that has to be processed by the app to be replaced with some custom HTML stuff.

I wrote a syntax spec defining the parser in syntaxInlineParsers, but when actually using it it was only getting applied for <1hello> but not <hello> (which gets rendered as raw HTML).

rawhtmlspec in defaultBlockSpecs is the cause of this behaviour. What is the recommended way to turn this behaviour off, so that applications have the flexibility to parse syntax with angle brackets?

Implement `implicit_figures` extension

As in pandoc: see https://pandoc.org/MANUAL.html#extension-implicit_figures

Implement in commonmark-core, commonmark-cli, commonmark-pandoc.

Consider Megaparsec

A quick question. Perhaps this repo is a good place to try using Megaparsec instead of Parsec?

Megaparsec has some niceties compared to Parsec, which include better error reporting and Unicode support, and it's also reportedly faster. The only downside I can think of (apart from switching parsers being potentially a lot of work) is that Megaparsec is not as stable in terms of API and features -- but that comes with territory with an actively-maintained package.

Since this repo is not (yet) directly tied to Pandoc, I thought that this might be a good place for such experiments.

Just a thought, no pressure.

Test issue for GitHub heading idenfiers

Heading with emoji 💎

Other heading with emoji 💎

Release to hackage?

Hello! Thanks for commonmark!

The newest version of https://github.com/srid/neuron, uses the commonmark library (quite successfully I, as mainly a user, might add). To bring the new features to all users (and e.g. nixpkgs) we would like to make a new neuron release. Sadly releasing to hackage with a dependency not on hackage is a bad idea.
So for this situation it would be marvelous if we could make a first commonmark release to hackage soon.

@jgm Would you be willing to do that?

If you don‘t want to maintain commonmark on hackage, @srid has offered to do the hackage maintenance, if you were to agree with that.

Bug in parsing some HTML tags immediately followed by eof

% echo -n "<\!A>" | commonmark
"stdin" (line 1, column 5):
unexpected end of input
% echo -n "<\!-- hi -->" | commonmark
"stdin" (line 1, column 12):
unexpected end of input
% echo -n "<?" | commonmark
"stdin" (line 1, column 3):
unexpected end of input
% echo -n "<? hi ?>" | commonmark
"stdin" (line 1, column 9):
unexpected end of input

Incorrect recognition of indented code blocks when pipe_tables enabled

% commonmark -xpipe_tables
    iconv -t utf-8 input.txt | pandoc | iconv -f utf-8
<p>iconv -t utf-8 input.txt | pandoc | iconv -f utf-8</p>

highlighting loop with fenced_divs extension

% commonmark --highlight -xfenced_divs
::: {.blue}
hi
:::
commonmark: <<loop>>

Implicit heading references breaks when used with `smart`

% commonmark -ximplicit_heading_references -xauto_identifiers -xsmart
# hi

See [hi].

# Jo's heading

See [Jo's heading]
^D
<h1 id="hi">hi</h1>
<p>See <a href="#">hi</a>.</p>
<h1 id="jos-heading">Jo’s heading</h1>
<p>See [Jo’s heading]</p>

Implement `pandoc_title_block` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-pandoc_title_block

Examples to show off the library

Markdown-aware spell checker: emit misspelled words + source locations.
In-place transformations: do a specific transformation on an existing markdown document (e.g., capitalizing all heading titles, or changing indented code blocks to fenced style) without changing anything else about the document.

No source map entries for reference link definitions

% commonmark --highlight
[hi]

[hi]: url

<!DOCTYPE html>
...
<pre><span class="paragraph" title="paragraph"><span class="link" title="link">[<span class="str">hi</span>]</span>

[hi]: url

How does `Commondmark.Pandoc` differ from `Text.Pandoc.Readers.CommonMark`

More of a question than an issue, but yeah it seems like Pandoc already knows how to convert commonmark into a Pandoc AST?

source map/highlight issue with link in table cell

% commonmark  -xall --highlight
| Sample                   |showdown  |commonmark|marked    |markdown-it|
|--------------------------|---------:|---------:|---------:|----------:|
|[README.md]               |         1|       3.6|       3.1|        3.9|

[README.md]: url

yields (snipping relevant part)

| <span class="str">Sample</span>                   |<span class="str">showdown</span>  |<span class="str">commonmark</span>|<span class="str">marked</span>    |<span class="str">markdown</span><span class="str">-</span><span class="str">it</span>|
|--------------------------|---------:|---------:|---------:|----------:|
|<span class="link" title="link">[<span class="str">README</span><span class="str">.</span><span class="str">md</span>]               |         <span class="str">1</span>|       <span class="str">3</span><span class="str">.</span><span class="str">6</span>|       <span class="str">3</span><span class="str">.</span><span class="str">1</span>|        <span class="str">3</span><span class="str">.</span><span class="str">9</span>|

in which the link isn't closed in the right place.

Nonlinear parsing time for inline link openers without closers

See benchmarks

benchmarking pathological/inline link openers without closers/commonmark/800
time                 12.22 ms   (12.01 ms .. 12.43 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 12.28 ms   (12.21 ms .. 12.35 ms)
std dev              144.8 μs   (107.0 μs .. 185.6 μs)

benchmarking pathological/inline link openers without closers/commonmark/1200
time                 26.03 ms   (25.76 ms .. 26.32 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 26.21 ms   (26.05 ms .. 26.49 ms)
std dev              350.4 μs   (183.2 μs .. 539.2 μs)

benchmarking pathological/inline link openers without closers/commonmark/1600
time                 47.06 ms   (46.08 ms .. 47.64 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 47.86 ms   (47.42 ms .. 48.95 ms)
std dev              995.8 μs   (346.9 μs .. 1.569 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking pathological/inline link openers without closers/commonmark/2000
time                 73.23 ms   (71.16 ms .. 75.25 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 75.10 ms   (74.27 ms .. 76.17 ms)
std dev              1.334 ms   (813.3 μs .. 1.975 ms)
variance introduced by outliers: 14% (moderately inflated)

Implement `yaml_metadata_block` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-yaml_metadata_block

We should avoid a yaml dependency. HsYAML might be an acceptable dependency, but an alternative could be to parse the whole block as a literal string and put it in a data-yaml attribute of a Div, so it can be extracted and processed in a second pass.

Not an issue ;-)

Hi,
correct tracking of source positions is a great undertaking, especially when it comes to integration with proofreading software. Much lesser attempts for LaTeX and Markdown can be seen in
TeXtidote and Tex2txt.

Many thanks already for pandoc, and good progress with this project!
Matthias

Implement `grid_tables` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-grid_tables

Implement in commonmark-core, commonmark-pandoc, commonmark-cli.

Understanding your intent

Would you be willing to write a short para about your intent here? Is this package meant to ultimately replace the Pandoc markdown parser? Or are you using this as a place to trial extensions that might land in the CommonMark spec? Or... ?

I've been reading your code and trying to get my head around whether I should be attempting to contribute here, or to mmark, or somehow using cmark-gfm, or... Mostly I need one of the table extensions; my target is LaTeX fragments but I'm mostly interested in seeing what comes of a newer AST as compared to using pandoc-types' "native" AST.

Warm regards to Pandoc's author.

AfC

Parsing `aa bb ` times N has strange performance characteristics

E.g. with N = 6000, it is much faster than with N = 5000.
With N = 10000, it takes over 20 seconds.
With N = 12000, 2 seconds.

time python -c 'print("*aa bb* "*10000)' | commonmark +RTS -t

Bad sourcepos

 % commonmark --sourcepos
[hi]   ok

[hi]: url
<p data-sourcepos="stdin@1:1-1:10"><a data-sourcepos="stdin@1:1-@1:1" href="url"><span data-sourcepos="stdin@1:2-1:4">hi</span></a>   <span data-sourcepos="stdin@1:8-1:10">ok</span></p>

We could also use some more extensive test cases for source positions.

loop in highlighting with definition_list extension

instance (HasDefinitionList il bl, Semigroup bl, Semigroup il)
        => HasDefinitionList (WithSourceMap il) (WithSourceMap bl) where
  definitionList spacing items = definitionList spacing items
                                   <* addName "definitionList"

% commonmark --highlight -xdefinition_lists
hi
:    there

commonmark: <<loop>>

Implement `citations` extension

As in pandoc: https://pandoc.org/MANUAL.html#extension-citations

Implement in commonmark-core, commonmark-cli, commonmark-pandoc.

This is tricky: we need to make sure this comes first in bracketed specs to avoid
interpretation as a span.

Parsing * * * * * * … takes quadratic time

$ python3 -c 'print(end="* "*1000)' | time commonmark > /dev/null
0.46user 0.16system 0:00.35elapsed 177%CPU (0avgtext+0avgdata 52080maxresident)k
0inputs+0outputs (0major+15662minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*2000)' | time commonmark > /dev/null
1.45user 0.63system 0:01.07elapsed 193%CPU (0avgtext+0avgdata 52120maxresident)k
0inputs+0outputs (0major+16320minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*4000)' | time commonmark > /dev/null
6.27user 2.81system 0:04.52elapsed 201%CPU (0avgtext+0avgdata 52292maxresident)k
0inputs+0outputs (0major+18973minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*8000)' | time commonmark > /dev/null
35.88user 15.90system 0:25.93elapsed 199%CPU (0avgtext+0avgdata 51780maxresident)k
0inputs+0outputs (0major+25323minor)pagefaults 0swaps

One of the extensions seems to use FFI

I saw this in GHCJS for a particular markdown content. It doesn't happen with bare commonmark parser, but only when the extensions are enabled. I'll get to debugging and isolating the problem one of the following days, but it would be great if someone already knew what it could be off the top of their head ...

![asciicast](https://asciinema.org/a/329911.svg){#ident .centered .big}

generates:

<p>
  <img class="" id="" src="https://asciinema.org/a/329911.svg" title="">
  .centered .big}
</p>

	instance (Rangeable (Cm a B.Inlines), Rangeable (Cm a B.Blocks))
	=> HasFootnote (Cm a B.Inlines) (Cm a B.Blocks) where
	footnote _num _lab _x = mempty
	footnoteList _xs = mempty
	footnoteRef _num _lab contents = B.note <$> contents

jgm / commonmark-hs Goto Github PK

commonmark-hs's People

Contributors

Stargazers

Watchers

Forkers

commonmark-hs's Issues

Heading with emoji 💎

Other heading with emoji 💎

Recommend Projects

Recommend Topics

Recommend Org