3b / 3bmd Goto Github PK

View Code? Open in Web Editor NEW

79.0 8.0 14.0 190 KB

markdown processor in CL using esrap parser

License: MIT License

Common Lisp 98.31% Python 1.69%

parsing peg common-lisp grammar

3bmd's People

Contributors

Stargazers

Watchers

Forkers

archimag daewok vseloved puercopop shinmera eadmund davidalphafox cl-kuthirgal asimpson mdbergmann mrlightningbolt svetlyak40wt melisgl

3bmd's Issues

Escaping headings

A have a bit of headache with the parse/print consistency of headings.

First, and this may be how markdown works, if there is no newline after the "heading", then it's parsed as :PLAIN:

CL-USER> (3bmd::parse-doc "x
#y
")
((:PARAGRAPH "x") (:HEADING :LEVEL 1 :CONTENTS ("y")))
NIL
T
CL-USER> (3bmd::parse-doc "x
#y")
((:PARAGRAPH "x") (:PLAIN "#" "y"))
NIL
T

When the latter is printed, an extra newline is inserted:

CL-USER> (3bmd:print-doc-to-stream (3bmd::parse-doc "x
#y") t :format :markdown)
x

#y
NIL

When the heading is escaped, the parse is good, but printing loses the escape:

CL-USER> (3bmd::parse-doc "x
\\#y
")
((:PLAIN "x" "
"
  "#" "y"))
NIL
T
CL-USER> (3bmd:print-doc-to-stream (3bmd::parse-doc "x
\\#y
") t)
x
#y
NIL

If this output is parsed again, then we get a :HEADING. Thus print/parse consistency is lost.

The quick fix would be to escape all # characters in print-md-escaped, but that produces unnecessarily cluttered output, which goes against the spirit of markdown. The right solution seems to be to escape only in column 0, but that's not easily and portably available.

Undefined function: parse-string

Hello,

This is surprising, but why not:

(3bmd:parse-string "rst")

; in: 3BMD:PARSE-STRING "rst"
;     (3BMD:PARSE-STRING "rst")
; 
; caught STYLE-WARNING:
;   undefined function: 3BMD:PARSE-STRING

Slime finds this choice, I find parse-string as an exported symbol, but my grep and my eyes couldn't find a function definition too.

with Quicklisp of january.

regards

mailto randomness

When a mailto element is printed to html there is randomness injected into the encoding supposedly to make life more difficult for spammers:

(defun encode-email (text)
  (with-output-to-string (s)
    (loop for i across text
       for r = (random 1.0)
       do (cond
            ((< r 0.1) (write-char i s))
            ;; fixme: make this portable to non-unicode/ascii lisps?
            ((< r 0.6) (format s "&#x~x;" (char-code i)))
            (t (format s "&#~d;" (char-code i)))))))

Unfortunately, this has the side effect of introducing spurious diffs when the generated html is version controlled. Would a deterministic solution be acceptable?

Is there any intent to meet the CommonMark spec

Is there any intent to meet the new spec at http://spec.commonmark.org/?

Fails on ABCL-1.9.2

It fails as:

:info:build Caught UNBOUND-VARIABLE while processing --eval option "(asdf:operate (quote asdf:build-op) (quote 3bmd-tests))":
:info:build   The variable DEF-GRAMMAR-TEST is unbound.
:info:build Command failed: env XDG_CACHE_HOME=$HOME/.cache /opt/local/bin/abcl --noinit --batch --eval '(require "asdf")' --eval '(setf asdf:*central-registry* (list* (quote *default-pathname-defaults*) #p"/opt/local/var/macports/build/_Users_catap_src_macports-ports_lisp_cl-3bmd/cl-3bmd/work/build/system/" #p"/opt/local/share/common-lisp/system/" asdf:*central-registry*))' --eval '(asdf:operate (quote asdf:build-op) (quote 3bmd-tests))' 2>&1

SBCL, ECL, CLisp and CCL works.

Metadata for Quicklisp

Please consider adding :description, :author and :license information to your ASDF system(s). This will greatly help Quicklisp users and make it easier for them to report bugs.

More information:

http://blog.quicklisp.org/2015/05/looking-for-more-metadata.html
https://www.quicklisp.org

Improve formatting of generated html source, make extra whitespace optional

Currently HTML is generated with some newlines and no indentation. Working indentation would be nice to have, as would an option to add no extra whitespace including newlines.

*PADDING* and PADDED sound like they should modify indentation, but apparently just do something with newlines?

support CommonMark?

3bmd is older than CommonMark, so it tries to implement the original markdown syntax with reference to behavior of other markdown processors where that was ambiguous. That strategy has all the problems that motivated CommonMark, and CommonMark seems popular enough now that not matching it is annoying and/or confusing to users (ex #45).

Unfortunately, it looks like it would be difficult or impossible to write a proper PEG/TDPL grammar for the entire CommonMark spec at once, so it would probably be hard to maintain compatibility with existing 3bmd extensions.

It probably wouldn't be too hard to write a new parser using something like the multiple pass lines -> blocks -> inlines strategy suggested by the spec. The inlines pass might be able to reuse a lot of the 3bmd inline grammar, possibly with some limitations on length of code span delimiters and similar. In that case, inline extensions might be usable without too much changes (I'd probably want to clean up the AST in the process though, so they would need updates to match that). Block elements would need rewritten though, not sure if that pass would use esrap for parsing or if it would need something more complicated to handle the arbitrary indentation in lists/blockquotes. Possibly a hybrid with an esrap rule to detect start of a block, and then let the block parse the following lines however it wants.

I don't have any current plans to work on such a thing though, since my current limited markdown needs are satisfied by 3bmd as it is and I have other things that are higher priority for now (unless someone has a pile of money to throw at a commonmark parser or something). It does seem interesting enough that I might try to at least do a proof-of-concept between other projects at some point, but will probably be a while if so.

some related links:
CommonDoc : probable replacement for the ad-hoc AST in 3bmd in a rewrite
commondoc-markdown : Project using 3bmd with CommonDoc, possibly supporting CommonMark in the future.
cl-cmark : CommonMark processor using FFI to libcmark

README documents colorize-name-map wrong

The docs say 3bmd:*colorize-name-map* but it should be 3bmd-code-blocks::*colorize-name-map* (also it's unexported)

# interpreted as title inside of lists

Something like * #not a title will be rendered as

<ul>
<li><h1>not a title</h1></li>
</ul>

when it should just be

<ul>
<li>#not a title</li>
</ul>

Clean up parse tree and add to public interface.

Current parse tree is mostly derived from the grammar rather than having any though put into it.

Would be nicer to have a more logical parse tree as an officially supported part of the API, for people who want to modify it or add other output formats.

3bmd-math in read me should be 3bmd-ext-math

Code blocks in list items

Code blocks lose the indent when printed:

CL-USER> (let ((3bmd-code-blocks:*code-blocks* t))
           (3bmd-grammar:parse-doc "
- xxx

    ```
    0123456789
            89
    ```
"))
((:BULLET-LIST
  (:LIST-ITEM (:PARAGRAPH "xxx")
   (3BMD-CODE-BLOCKS::CODE-BLOCK :LANG "" :PARAMS NIL :CONTENT "0123456789
        89"))))
NIL
T
CL-USER> (let ((3bmd-code-blocks:*code-blocks* t))
           (3bmd:print-doc-to-stream * *standard-output* :format :markdown))
- xxx

    ```
0123456789
        89
```

smart quotes and backslash

Running this:

(let ((3bmd-grammar:*smart-quotes* t))
  (3bmd:parse-string-and-print-to-stream "\\'" *standard-output*))

gives the error:

Cannot FUNCALL the SYMBOL-FUNCTION of special operator QUOTE.

make sure parser always returns something useful

The grammar should match all input, but in case of bugs it would be nice to (optionally?) catch parse errors and return something useful anyway.

first step would probably be to add a catch-all (* character) to the end of the doc rule, and add it to the blocks (maybe as an extra plain block?)
"incomplete parse" errors should probably be handled the same way (maybe not even bother with trying to catch extra in the doc grammar if this needs to be here anyway?)
"parse failed" should just return the original input?

One giant paragraph instead of separate paragraph tags

Hello!

Thank you for your amazing project. I am using it to write a static site generator and am running into an issue where it it outputs a single paragraph tag for an entire string of text with newlines instead of separating into new paragraphs on the newlines.

The below code:

This is a first post. I am excited to have this post in place. I am using a new blogging engine I wrote myself in Common Lisp.

This is me hoping the paragraph gets formatted properly.

Gives me the following output:

<p>This is a first post. I am excited to have this post in place. I am using a new blogging engine I wrote myself in Common Lisp.This is me hoping the paragraph gets formatted properly.</p>

Any assistance with this issue would be much appreciated. I am running with the latest build from quicklisp on SBCL for macOS.

Failed tests on clisp

I'm using clisp from https://gitlab.com/gnu-clisp/clisp/-/commit/66924971790e4cbee3d58f36e530caa0ad568e5f and attempt to run tests via MacPorts leads to failure:

:info:test   INDENT-BY-TAB-SHOULD-BE-REPLACED-WITH-SPACES........................................................................[ OK ]
:info:test   BLANK-LINE-TEST2........................................................................[ OK ]
:info:test   BLANK-LINE-TEST1........................................................................[ OK ]
:info:test   NEWLINE........................................................................[ OK ]
:info:test   MULTIPLE-SPACES-MIXED-WITH-TABS........................................................................[ OK ]
:info:test   TAB-AS-SPACE........................................................................[ OK ]
:info:test   SPACE-TEST........................................................................[ OK ]
:info:test   EOF-TEST........................................................................[ OK ]
:info:test Test run had 1 failure:
:info:test   Failure 1: FAILED-ASSERTION when running 3BMD-TESTS::PARSE-LIST-WITH-CARRIAGE-RETURN
:info:test     Binary predicate (EQUALP X Y) failed.
:info:test     x: 3BMD-TESTS::RESULT => 
:info:test     ((:BULLET-LIST
:info:test       (:LIST-ITEM
:info:test        (:PLAIN "x"
:info:test         "
:info:test     "
:info:test         "y"
:info:test         "
:info:test     "
:info:test         "Not" " " "verbatim"))))
:info:test     y: 3BMD-TESTS::EXPECTED => 
:info:test     ((:BULLET-LIST
:info:test       (:LIST-ITEM
:info:test        (:PARAGRAPH
:info:test         "x
:info:test     y")
:info:test        (:PARAGRAPH "Not" " " "verbatim"))))
:info:test *** - tests failed

Escaping curly brackets

In commit 18a59d3, I changed print-md-escaped to escape the [] and {} characters. The former was necessary for print/parse consistency, while the latter wasn't because {} are not parsed specially (except for allowing them to be backslash escaped). However, in melisgl/mgl-pax#28, we find that escaping curly brackets makes outputting latex-in-markdown for pandoc a pain.

Do you think not escaping them would be correct?

":REFERENCE-LINK fell through ETYPECASE expression."

[a][a[a]]

confuses the reference-link-double grammar, shouldn't match it at all

Processing instructions are escaped instead of passed through

Out of the box 3bmd doesn't recognise that processing instructions are valid:

cl-user> (3bmd:parse-string-and-print-to-stream "<?this is a valid processing instruction?>" t)
<p>&lt;?this is a valid processing instruction?&gt;</p>

At least according to the CommonMark spec, processing instructions are allowed as blocks and inlines, and should be passed through verbatim.

README link out of date

Per https://github.com/nikodemus/esrap, the new location of esrap is https://github.com/scymtym/esrap if you want to update the README link.

Definition lists extention does not work because of error

There is no applicable method for the generic function #<STANDARD-GENERIC-FUNCTION 3BMD-EXT:PRINT-MD-TAGGED-ELEMENT (35)> when called with arguments (3BMD-DEFINITION-LISTS::DEFINITION-LIST #<SB-IMPL::STRING-OUTPUT-STREAM {666A6F3}> ((:TERMS ((3BMD-DEFINITION-LISTS::DEFINITION-TERM "test" " " "definition")) :DEFINITIONS ((3BMD-DEFINITION-LISTS::DEFINITION-LIST-ITEM (:PLAIN "The" " " "definition" " " "test")))) (:TERMS ((3BMD-DEFINITION-LISTS::DEFINITION-TERM "second" " " "item")) :DEFINITIONS ((3BMD-DEFINITION-LISTS::DEFINITION-LIST-ITEM (:PLAIN "Nother" " " "definition" " " "test")))))).

Code is expected PRINT-MD-TAGGED-ELEMENT method, but extension defines PRINT-TAGGED-ELEMENT.

Don't require blank line around ``` code block

github and commonmark don't require blank lines before or after fenced code blocks, so 3bmd probably shouldn't require them either.

Pygments option to ext-code-blocks extension should probably be secured better

Currently the Pygments mode of ext-code-blocks passes user input to the pygmentize for the language and options. The code tries to do so safely by trying to avoid going through a shell and rejecting the cssfile option, but it would probably be better to whitelist the allowed options and either whitelist the languages (possibly querying from pygmentize on first use?) or at least restrict the characters allowed.

Memory usage on large inputs

I'm using the per-block implementation in parse-doc, but it's still fairly easy to run out of memory with large %blocks with something like this:

CL-USER> (time
          (let ((input (with-output-to-string (out)
                         (loop repeat 100000
                               do (format out "- ~A ~A ~A ~A~%"
                                          (random 1000000) (random 1000000)
                                          (random 1000000) (random 1000000))))))
            (3bmd-grammar::parse-doc input)
            (length input)))
Evaluation took:
  12.364 seconds of real time
  12.371129 seconds of total run time (11.750773 user, 0.620356 system)
  [ Run times consist of 5.771 seconds GC time, and 6.601 seconds non-GC time. ]
  100.06% CPU
  37,030,481,562 processor cycles
  15,570,202,368 bytes consed
  
2955662
CL-USER> (/ 15570202368 2955662.0)
5267.924

This example uses a bulleted list because it is probably the worst offender, but a large paragraph behaves similarly.

According to time, consing scales linearly with the number of repeats, which is good. Perhaps 5267 bytes per character is too high, but I suspect that the main problem is that maximum size of the working set also scales linearly.

add support for/try tests from CommonMark

aka "standard markdown", "common markdown"
http://commonmark.org/

figure out proper handling of lists with some blank lines

in lists with some items separated by blank lines, we currently treat all elements as paragraphs.

markdown.pl only treat entries before/after blank lines as paragraph (2,3,4 in example).

Github treats everything starting before the first line as a paragraph (2,3,4,5,6 in example below).

see http://babelmark.bobtfish.net/ for a comparison of various other implementations, all 3 behaviours seem reasonably common

test case:

* l0
* l1
* l2

* l3

* l4
* l5
* l6

test case as displayed by github:

Code blocks highlighting

It seems that regular, indented code blocks are treated differently than unindented (marked with ```) code blocks in terms of highlighting? IMO, even though there's currently no way to set a language, the indented code blocks should also undergo highlighting, at least that's how other formatters, e.g. Stackoverflow render it too.

`3bmd::ensure-paragraph` undefined

When trying to print as markdown I get an error that the function 3bmd::ensure-paragraph is undefined, and inspection shows that 3bmd::end-paragraph also is.

Hypothesis: these were renamed 3bmd::ensure-block and 3bmd::end-block and the one call site didn't get edited. Can you confirm, @melisgl?

:description

Would you please consider adding a :description option to your system definition of 3bmd, 3bmd-ext-code-blocks and 3bmd-ext-wiki-links?

:REFLINK :DEFINITION that looks like :EMPH

Is this not valid input?

(3bmd-grammar:parse-doc "[l][*x*]")
.. debugger invoked on SB-KERNEL:CASE-FAILURE:
..   :EMPH fell through ETYPECASE expression.
..   Wanted one of (STRING CHARACTER LIST).

"# foo" fails to parse

... with error "Incomplete parse, stopped at 6.". If I add a newline and some more text, it works.

We use Markdown because it's able to output meaningful HTML no matter how bad is the input; so more generally, it would be nice to have an option to just accept any input and never throw a parse error.

Accepting empty cells in table

3bmd-ext-tables ignore empty cells in table. In a following text, 3bmd don't render correctly.

| a |   |
| - | - |
|   | b |

I expect to render like a following.

a
	b

I guess the cause is

3bmd/tables.lisp

Line 37 in 5b301ad

(+ (and (! (or (and sp #\|) endline)) inline))

Thank you.

Code-blocks, nested into a list items aren't supported

There are two problems:

Parsed code goes as a list item's sibling despite that it has the same indentation as item's content.
And they are parsed as inline code instead of CODE-BLOCK.

40ANTS-DOC-TEST/UTILS-TEST> (let ((3bmd-code-blocks:*code-blocks* t))
                              (3bmd-grammar:parse-doc "
* Added a warning mechanism, which will issue such warnings on words which looks
  like a symbol, but when real symbol or reference is absent:

  ```
  WARNING: Unable to find symbol \"API\" mentioned in (CL-INFO:@INDEX SECTION)
  ```
"))
((:BULLET-LIST
  (:LIST-ITEM
   (:PLAIN "Added" " " "a" " " "warning" " " "mechanism," " " "which" " "
    "will" " " "issue" " " "such" " " "warnings" " " "on" " " "words" " "
    "which" " " "looks" "
"
    "  " "like" " " "a" " " "symbol," " " "but" " " "when" " " "real" " "
    "symbol" " " "or" " " "reference" " " "is" " " "absent:")))
 (:PLAIN "  "
  (:CODE "
  WARNING: Unable to find symbol \"API\" mentioned in (CL-INFO:@INDEX SECTION)
")))
NIL
T

When there is now indentation, than code block is parsed correctly:

40ANTS-DOC-TEST/UTILS-TEST> (let ((3bmd-code-blocks:*code-blocks* t))
                              (3bmd-grammar:parse-doc "
```
WARNING: Unable to find symbol \"API\" mentioned in (CL-INFO:@INDEX SECTION)
```
"))
((3BMD-CODE-BLOCKS::CODE-BLOCK :LANG "" :PARAMS NIL :CONTENT
  "WARNING: Unable to find symbol \"API\" mentioned in (CL-INFO:@INDEX SECTION)"))
NIL
T

Warn when reference links have no definition

This should produce a warning when printed:

[something][non-existent]

add option for less-strict html blocks

cl-mongo README.md embeds documentation created with docmentation template, which doesn't close <p> tags, and the embedded chunks are <p> followed immediately by a <blockquote> in 1 html block rather than separated into 2 as I understand the markdown docs to require.

Github parses these chunks as HTML blocks when rendering documentation, but doesn't seem to in issues.

Probably can get reasonable parsing by optionally allowing multiple html-block-in-tags in one html-block, and adding a variant of <p> that is closed by a html-block-in-tags (or possibly only a subset of block tags?) rather than </p>