jgm / commonmark-hs Goto Github PK
View Code? Open in Web Editor NEWPure Haskell commonmark parsing library, designed to be flexible and extensible
Pure Haskell commonmark parsing library, designed to be flexible and extensible
This occurs on the benchmark for pathological <?
, but it can be reproduced without this:
% python -c 'print("?" * 4000)' | commonmark +RTS -K90000 -xc
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
MAIN.MAIN
*** Exception (reporting due to +RTS -xc): (THUNK), stack trace:
MAIN.MAIN
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
Data.Text.Internal.IO.readTextDevice
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
Commonmark.Inlines.pSymbol,
called from Commonmark.Inlines.defaultInlineParser,
called from Commonmark.Inlines.pInline,
called from Commonmark.Inlines.attrParser,
called from Commonmark.Inlines.parseChunks,
called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
called from Commonmark.Blocks.restOfLine,
called from Commonmark.Blocks.block_starts,
called from Commonmark.Blocks.blockContinues,
called from Commonmark.Blocks.Commonmark.Blocks.processLines
*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
Commonmark.Inlines.pSymbol,
called from Commonmark.Inlines.defaultInlineParser,
called from Commonmark.Inlines.pInline,
called from Commonmark.Inlines.attrParser,
called from Commonmark.Inlines.parseChunks,
called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
called from Commonmark.Blocks.restOfLine,
called from Commonmark.Blocks.block_starts,
called from Commonmark.Blocks.blockContinues,
called from Commonmark.Blocks.Commonmark.Blocks.processLines
Commonmark.Blocks.Commonmark.Blocks.processLines (src/Commonmark/Blocks.hs:108:1-12)
Commonmark.Blocks.blockContinues (src/Commonmark/Blocks.hs:(223,8)-(238,64))
Commonmark.Blocks.block_starts (src/Commonmark/Blocks.hs:(151,28)-(163,13))
Commonmark.Blocks.restOfLine (src/Commonmark/Blocks.hs:170:34-43)
Commonmark.Blocks.Commonmark.Blocks.runInlineParser (src/Commonmark/Blocks.hs:385:1-15)
Commonmark.Inlines.parseChunks (src/Commonmark/Inlines.hs:(70,34)-(72,45))
Commonmark.Inlines.attrParser (src/Commonmark/Inlines.hs:(331,33)-(333,71))
Commonmark.Inlines.pInline (src/Commonmark/Inlines.hs:333:30-71)
Commonmark.Inlines.defaultInlineParser (src/Commonmark/Inlines.hs:(84,35)-(94,28))
Commonmark.Inlines.pSymbol (src/Commonmark/Inlines.hs:(431,43)-(435,28))
*** Exception (reporting due to +RTS -xc): (THUNK_1_0), stack trace:
Commonmark.Inlines.pSymbol,
called from Commonmark.Inlines.defaultInlineParser,
called from Commonmark.Inlines.pInline,
called from Commonmark.Inlines.attrParser,
called from Commonmark.Inlines.parseChunks,
called from Commonmark.Blocks.Commonmark.Blocks.runInlineParser,
called from Commonmark.Blocks.restOfLine,
called from Commonmark.Blocks.block_starts,
called from Commonmark.Blocks.blockContinues,
called from Commonmark.Blocks.Commonmark.Blocks.processLines
commonmark: Stack space overflow: current size 33568 bytes.
commonmark: Use `+RTS -Ksize -RTS' to increase it.
As in pandoc: https://pandoc.org/MANUAL.html#extension-example_lists
including back-references.
Implement in commonmark-core, commonmark-pandoc, commonmark-cli.
\f[I]
, but it might be \f[BI]
(if we're already in a boldface section). We can handle that by making the IsInline type for roff be a newtype embeddding State EmphState Builder or something like that. Then the effect can be state dependent. (Maybe we should consider changing all the implementations to be like this? It would remove some of the complexity for e.g. footnotes.)Quadratic time.
Originally posted by @jgm in #40 (comment)
The HasFootnote
instance for Pandoc AST ignores the footnote identifiers, and labels:
commonmark-hs/commonmark-pandoc/src/Commonmark/Pandoc.hs
Lines 215 to 219 in c9afe7c
And it looks like Pandoc is doing its own state management to compute the identifier, and render the references accordingly. Is storing these footnote identifiers/labels in the Pandoc AST explicitly out of scope?
% commonmark --highlight -xfootnotes
[^1]
[^1]: a
commonmark: <<loop>>
python -c 'n=800; print("::: {#id}\n" * n + "a\n" + ":::\n" * n)'
This is a problem with the new fix for detecting fence closers.
At each close we need to iterate through all subordinate fenced divs in the stack.
When I add an attribute like {.overflows .collapsing .compact .sortable}
immediately above the table, in the AST I get a wrapping div
(with these attributes), which wrapes the table
element. I'd expect these attributes to be applied to the <table>
tag itself.
As in pandoc: https://pandoc.org/MANUAL.html#extension-simple_tables
Implement in commonmark-core, commonmark-cli, commonmark-pandoc.
Note that we parse line by line, with no lookahead. The first line will be parsed as a paragraph line. See the way setext headers are currently handled.
Hackage has an emoji package
https://hackage.haskell.org/package/emoji
but it only provides one-way lookup; it doesn't allow you to go from the emoticons to their string descriptions. It also has fewer emojis than pandoc (1400 vs 1757), and it uses String rather than Text. The module in pandoc is superior.
Instead of handling it with typeclasses.
This would allow simpler typeclasses: Html, Pandoc.
It might also allow us to improve performance by avoiding the work of storing and computing ranges.
% commonmark --highlight -xall benchmark.md
commonmark: <<loop>>
commonmark -xfenced_divs
::: {#id}
- a
- b
:::
Next para.
^D
<div id="id">
<ul>
<li>a
</li>
<li>b
</li>
</ul>
<p>Next para.</p>
</div>
The paragraph at the end should be outside the div.
As in pandoc: https://pandoc.org/MANUAL.html#extension-line_blocks
Implement in commonmark-core, commonmark-pandoc, and commonmark-cli.
As in pandoc: https://pandoc.org/MANUAL.html#extension-multiline_tables
Implement in commonmark-core, commonmark-pandoc, commonmark-cli.
commonmark/cmark#299 affects commonmark-hs as well.
python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null
0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k
python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null
1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k
python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null
6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k
% commonmark -xdefinition_lists
## Blah
`-v`, `--version`
: Print version.
`-h`, `--help`
: Show usage message.
^D
<h2 id="blah">Blah</h2>
<dl>
<dt><code>-v</code>, <code>--version</code></dt>
<dd>
<p>Print version.</p>
</dd>
</dl>
<dl>
<dt><code>-h</code>, <code>--help</code></dt>
<dd>
<p>Show usage message.</p>
</dd>
</dl>
This produces two dls rather than one. But if you remove the heading, you get one as intended.
The current system (defining a new typeclass instance for the constructors) doesn't give fine-grained enough information (e.g. it doesn't distinguish code span delimiters from the code). And it is awkwardly designed, so that for example it's easy to write instances that produce loops.
Better to put a field in state that keeps a source map, and maybe another field that controls whether to populate it (for efficiency this can be turned off). Then the individual constructors can be designed to insert whatever fine-grained mapping data would be useful.
Alternatively, instead of making these parsers work for any Monad, limit to the HasSourceMap typeclass and define dummy default instances for common monads.
As in pandoc: https://pandoc.org/MANUAL.html#extension-task_lists
Implement in commonmark-core, commonmark-pandoc, and commonmark-cli.
See the test suite under Actions.
Needs investigating whether there are issues in the library itself, or just in the test suite.
See notes on performance in the README.md.
In neuron we support special links of the format <a34sfef4>
(as well as: <z:zettels?tag=foo>
) that has to be processed by the app to be replaced with some custom HTML stuff.
I wrote a syntax spec defining the parser in syntaxInlineParsers
, but when actually using it it was only getting applied for <1hello>
but not <hello>
(which gets rendered as raw HTML).
rawhtmlspec
in defaultBlockSpecs
is the cause of this behaviour. What is the recommended way to turn this behaviour off, so that applications have the flexibility to parse syntax with angle brackets?
As in pandoc: see https://pandoc.org/MANUAL.html#extension-implicit_figures
Implement in commonmark-core, commonmark-cli, commonmark-pandoc.
A quick question. Perhaps this repo is a good place to try using Megaparsec instead of Parsec?
Megaparsec has some niceties compared to Parsec, which include better error reporting and Unicode support, and it's also reportedly faster. The only downside I can think of (apart from switching parsers being potentially a lot of work) is that Megaparsec is not as stable in terms of API and features -- but that comes with territory with an actively-maintained package.
Since this repo is not (yet) directly tied to Pandoc, I thought that this might be a good place for such experiments.
Just a thought, no pressure.
Hello! Thanks for commonmark!
The newest version of https://github.com/srid/neuron, uses the commonmark library (quite successfully I, as mainly a user, might add). To bring the new features to all users (and e.g. nixpkgs) we would like to make a new neuron release. Sadly releasing to hackage with a dependency not on hackage is a bad idea.
So for this situation it would be marvelous if we could make a first commonmark release to hackage soon.
@jgm Would you be willing to do that?
If you don‘t want to maintain commonmark on hackage, @srid has offered to do the hackage maintenance, if you were to agree with that.
% echo -n "<\!A>" | commonmark
"stdin" (line 1, column 5):
unexpected end of input
% echo -n "<\!-- hi -->" | commonmark
"stdin" (line 1, column 12):
unexpected end of input
% echo -n "<?" | commonmark
"stdin" (line 1, column 3):
unexpected end of input
% echo -n "<? hi ?>" | commonmark
"stdin" (line 1, column 9):
unexpected end of input
% commonmark -xpipe_tables
iconv -t utf-8 input.txt | pandoc | iconv -f utf-8
<p>iconv -t utf-8 input.txt | pandoc | iconv -f utf-8</p>
% commonmark --highlight -xfenced_divs
::: {.blue}
hi
:::
commonmark: <<loop>>
% commonmark -ximplicit_heading_references -xauto_identifiers -xsmart
# hi
See [hi].
# Jo's heading
See [Jo's heading]
^D
<h1 id="hi">hi</h1>
<p>See <a href="#">hi</a>.</p>
<h1 id="jos-heading">Jo’s heading</h1>
<p>See [Jo’s heading]</p>
% commonmark --highlight
[hi]
[hi]: url
<!DOCTYPE html>
...
<pre><span class="paragraph" title="paragraph"><span class="link" title="link">[<span class="str">hi</span>]</span>
[hi]: url
More of a question than an issue, but yeah it seems like Pandoc already knows how to convert commonmark into a Pandoc AST?
% commonmark -xall --highlight
| Sample |showdown |commonmark|marked |markdown-it|
|--------------------------|---------:|---------:|---------:|----------:|
|[README.md] | 1| 3.6| 3.1| 3.9|
[README.md]: url
yields (snipping relevant part)
| <span class="str">Sample</span> |<span class="str">showdown</span> |<span class="str">commonmark</span>|<span class="str">marked</span> |<span class="str">markdown</span><span class="str">-</span><span class="str">it</span>|
|--------------------------|---------:|---------:|---------:|----------:|
|<span class="link" title="link">[<span class="str">README</span><span class="str">.</span><span class="str">md</span>] | <span class="str">1</span>| <span class="str">3</span><span class="str">.</span><span class="str">6</span>| <span class="str">3</span><span class="str">.</span><span class="str">1</span>| <span class="str">3</span><span class="str">.</span><span class="str">9</span>|
in which the link isn't closed in the right place.
See benchmarks
benchmarking pathological/inline link openers without closers/commonmark/800
time 12.22 ms (12.01 ms .. 12.43 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 12.28 ms (12.21 ms .. 12.35 ms)
std dev 144.8 μs (107.0 μs .. 185.6 μs)
benchmarking pathological/inline link openers without closers/commonmark/1200
time 26.03 ms (25.76 ms .. 26.32 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 26.21 ms (26.05 ms .. 26.49 ms)
std dev 350.4 μs (183.2 μs .. 539.2 μs)
benchmarking pathological/inline link openers without closers/commonmark/1600
time 47.06 ms (46.08 ms .. 47.64 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 47.86 ms (47.42 ms .. 48.95 ms)
std dev 995.8 μs (346.9 μs .. 1.569 ms)
variance introduced by outliers: 11% (moderately inflated)
benchmarking pathological/inline link openers without closers/commonmark/2000
time 73.23 ms (71.16 ms .. 75.25 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 75.10 ms (74.27 ms .. 76.17 ms)
std dev 1.334 ms (813.3 μs .. 1.975 ms)
variance introduced by outliers: 14% (moderately inflated)
As in pandoc: https://pandoc.org/MANUAL.html#extension-yaml_metadata_block
We should avoid a yaml dependency. HsYAML might be an acceptable dependency, but an alternative could be to parse the whole block as a literal string and put it in a data-yaml
attribute of a Div, so it can be extracted and processed in a second pass.
As in pandoc: https://pandoc.org/MANUAL.html#extension-grid_tables
Implement in commonmark-core, commonmark-pandoc, commonmark-cli.
Would you be willing to write a short para about your intent here? Is this package meant to ultimately replace the Pandoc markdown parser? Or are you using this as a place to trial extensions that might land in the CommonMark spec? Or... ?
I've been reading your code and trying to get my head around whether I should be attempting to contribute here, or to mmark, or somehow using cmark-gfm, or... Mostly I need one of the table extensions; my target is LaTeX fragments but I'm mostly interested in seeing what comes of a newer AST as compared to using pandoc-types' "native" AST.
Warm regards to Pandoc's author.
AfC
E.g. with N = 6000, it is much faster than with N = 5000.
With N = 10000, it takes over 20 seconds.
With N = 12000, 2 seconds.
time python -c 'print("*aa bb* "*10000)' | commonmark +RTS -t
% commonmark --sourcepos
[hi] ok
[hi]: url
<p data-sourcepos="stdin@1:1-1:10"><a data-sourcepos="stdin@1:1-@1:1" href="url"><span data-sourcepos="stdin@1:2-1:4">hi</span></a> <span data-sourcepos="stdin@1:8-1:10">ok</span></p>
We could also use some more extensive test cases for source positions.
instance (HasDefinitionList il bl, Semigroup bl, Semigroup il)
=> HasDefinitionList (WithSourceMap il) (WithSourceMap bl) where
definitionList spacing items = definitionList spacing items
<* addName "definitionList"
% commonmark --highlight -xdefinition_lists
hi
: there
commonmark: <<loop>>
As in pandoc: https://pandoc.org/MANUAL.html#extension-citations
Implement in commonmark-core, commonmark-cli, commonmark-pandoc.
This is tricky: we need to make sure this comes first in bracketed specs to avoid
interpretation as a span.
$ python3 -c 'print(end="* "*1000)' | time commonmark > /dev/null
0.46user 0.16system 0:00.35elapsed 177%CPU (0avgtext+0avgdata 52080maxresident)k
0inputs+0outputs (0major+15662minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*2000)' | time commonmark > /dev/null
1.45user 0.63system 0:01.07elapsed 193%CPU (0avgtext+0avgdata 52120maxresident)k
0inputs+0outputs (0major+16320minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*4000)' | time commonmark > /dev/null
6.27user 2.81system 0:04.52elapsed 201%CPU (0avgtext+0avgdata 52292maxresident)k
0inputs+0outputs (0major+18973minor)pagefaults 0swaps
$ python3 -c 'print(end="* "*8000)' | time commonmark > /dev/null
35.88user 15.90system 0:25.93elapsed 199%CPU (0avgtext+0avgdata 51780maxresident)k
0inputs+0outputs (0major+25323minor)pagefaults 0swaps
I saw this in GHCJS for a particular markdown content. It doesn't happen with bare commonmark parser, but only when the extensions are enabled. I'll get to debugging and isolating the problem one of the following days, but it would be great if someone already knew what it could be off the top of their head ...
There aren't any right now.
For example, for duplicate link references.
This for example:
![asciicast](https://asciinema.org/a/329911.svg){#ident .centered .big}
generates:
<p>
<img class="" id="" src="https://asciinema.org/a/329911.svg" title="">
.centered .big}
</p>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.