Coder Social home page Coder Social logo

markdown's Introduction

intellij-markdown official JetBrains project Maven Central IR

A multiplatform Markdown processor written in Kotlin.

Introduction

intellij-markdown is an extensible Markdown processor written in Kotlin. It aims to suit the following needs:

  • Use one code base for both client and server-side processing
  • Produce consistent output on different platforms
  • Support different Markdown flavours
  • Be easily extensible

The processor is written in pure Kotlin (with a little flex), so it can be compiled not only for the JVM target, but for JS and Native. This allows for the processor to be used everywhere.

Usage

Adding intellij-markdown as a dependency

The library is hosted in the Maven Central Repository, so to be able to use it, you need to configure the central repository:

repositories {
  mavenCentral()
}

If you have Gradle >= 5.4, you can just add the main artifact as a dependency:

dependencies {
  implementation("org.jetbrains:markdown:<version>")
}

Gradle should resolve your target platform and decide which artifact (JVM or JS) to download.

For the multiplatform projects you can the single dependency to the commonMain class path:

commonMain {
  dependencies {
    implementation("org.jetbrains:markdown:<version>")
  }
}

If you are using Maven or older Gradle, you need to specify the correct artifact for your platform, e.g.:

  • org.jetbrains:markdown-jvm:<version> for the JVM version
  • org.jetbrains:markdown-js:<version> for the JS version

Using intellij-markdown for parsing and generating HTML

One of the goals of this project is to provide flexibility in terms of the tasks being solved. Markdown Plugin for JetBrains IDEs is an example of a usage when Markdown processing is done in several stages:

  1. Parse block structure without parsing inlines to provide lazy parsable blocks for IDE
  2. Quickly parse inlines of a given block to provide faster syntax highlighting update
  3. Generate HTML for preview

These tasks may be completed independently according to the current needs.

Simple html generation (Kotlin)

val src = "Some *Markdown*"
val flavour = CommonMarkFlavourDescriptor()
val parsedTree = MarkdownParser(flavour).buildMarkdownTreeFromString(src)
val html = HtmlGenerator(src, parsedTree, flavour).generateHtml()

Simple html generation (Java)

final String src = "Some *Markdown*";
final MarkdownFlavourDescriptor flavour = new GFMFlavourDescriptor();
final ASTNode parsedTree = new MarkdownParser(flavour).buildMarkdownTreeFromString(text);
final String html = new HtmlGenerator(src, parsedTree, flavour, false).generateHtml();

Development gotchas

The only non-Kotlin files are .flex lexer definitions. They are used for generating lexers, which are the first stage of inline elements parsing. Unfortunately, due to bugs, native java->kt conversion crashes for these files. Because of that, conversion from .flex to respective Kotlin files requires some manual steps:

  1. Install Grammar Kit plugin. It should be suggested on the opening of any .flex file.
  2. Install jflexToKotlin plugin (you will need to build it and then install it manually, via settings).
  3. Run Run JFlex Generator action while having .flex file opened.
    • On the first run, a dialog will open, suggesting to place to download JFlex - select the project root, then delete excessively downloaded .skeleton file.
  4. A respective _<SomeName>Lexer.java will be generated somewhere. Move it near the existing _<SomeName>Lexer.kt.
  5. Delete the .kt lexer.
  6. Run Convert JFlex Lexer to Kotlin action while having the new .java file opened.
  7. Fix the small problems such as imports in the generated .kt file. There should be no major issues. Please try to minimize the number of changes to the generated files. This is needed for keeping a clean Git history.

Parsing algorithm

The parsing process is held in two logical parts:

  1. Splitting the document into blocks of logical structure (lists, blockquotes, paragraphs, etc.)
  2. Parsing the inline structure of the resulting blocks

This is the same way as the one being proposed by the Commonmark spec.

Building the logical structure

Each (future) node (list, list item, blockquote, etc.) is associated with the so-called MarkerBlock. The rollback-free parsing algorithm is processing every token in the file, one by one. Tokens are passed to the opened marker blocks, and each block chooses whether to either:

  • do nothing
  • drop itself
  • complete itself

The MarkerProcessor stores the blocks, executes the actions chosen by the blocks, and, possibly, adds some new ones.

Parsing inlines

For the sake of speed and parsing convenience, the text is passed to the MarkdownLexer first. Then the resulting set of tokens is processed in a special way.

Some inline constructs in Markdown have priorities, i.e., if two different ones overlap, the parsing result depends on their types, not their positions - e.g. *code, `not* emph` and `code, *not` emph* are both code spans + literal asterisks. This means that normal recursive parsing is inapplicable.

Still, the parsing of inline elements is quite straightforward. For each inline construct, there is a particular SequentialParser which accepts some input text and returns:

  1. The parsed ranges found in this text;
  2. The sub-text(s), which are to be passed to the subsequent inline parsers.

Building AST

After building the logical structure and parsing inline elements, a set of ranges corresponding to some markdown entities (i.e. nodes) is given. In order to work with the results effectively, it ought to be converted to the AST.

As a result, a root ASTNode corresponding to the parsed Markdown document is returned. Each AST node has its own type which is called IElementType as in the IntelliJ Platform.

Generating HTML

For a given AST root, a special visitor to generate the resulting HTML is created. Using a given mapping from IElementType to the GeneratingProvider it processes the parsed tree in Depth-First order, generating HTML pieces for on each node visit.

Extending the parser

Many routines in the above process can be extended or redefined by creating a different Markdown flavour. The minimal default flavour is CommonMark which is implemented in this project.

GitHub Flavoured Markdown is an example of extending CommonMark flavour implementation. It can be used as a reference for implementing your own Markdown features.

API

  • MarkdownFlavourDescriptor is a base class for extending the Markdown parser.
    • markerProcessorFactory is responsible for block structure customization.

      • stateInfo value allows to use a state during document parsing procedure.

        updateStateInfo(pos: LookaheadText.Position) is called at the beginning of each position processing

      • populateConstraintsTokens is called to create nodes for block structure markers at the beginning of the lines (for example, > characters constituting blockquotes)

      • getMarkerBlockProviders is a place to (re)define types of block structures

    • sequentialParserManager

      getParserSequence defines inlines parsing procedure. The method must return a list of SequentialParsers where the earliest parsers have the biggest operation precedence. For example, to parse code spans and emphasis elements with the correct priority, the list should be [CodeSpanParser, EmphParser] but not the opposite.

      SequentialParser has only one method:

      parse(tokens: TokensCache, rangesToGlue: List<IntRange>): ParsingResult

      • tokens is a special holder for the tokens returned by lexer

      • rangesToGlue is a list of ranges in the document which are to be searched for the structures in question.

        Considering the input: A * emph `code * span` b * c for the emph parser ranges [A * emph , b * c] mean that emph must be searched in the input A * emph | b * c.

        The method must essentially return the parsing result (nodes for the found structures) and the parts of the text to be given to the next parsers.

        Considering the same input for the code span parser the result would be `code * span` of the type "code span" and the delegate pieces would be [A * emph , b * c].

    • createInlinesLexer should return the lexer to split the text to the tokens before inline parsing procedure run.

    • createHtmlGeneratingProviders(linkMap: LinkMap, baseURI: URI?) is the place where generated HTML is customized. This method should return a map which defines how to handle the particular kinds of the nodes in the resulting tree.

      linkMap here is precalculated information about the links defined in the document with the means of link definition. baseURI is the URI to be considered the base path for the relative links resolving. For example, given baseUri='/user/repo-name/blob/master' the link foo/bar.png should be transformed to the /user/repo-name/blob/master/foo/bar.png.

      Each returned provider must implement processNode(visitor: HtmlGenerator.HtmlGeneratingVisitor, text: String, node: ASTNode) where

      • text is the whole document being processed,
      • node is the node being given to the provider,
      • visitor is a special object responsible for the HTML generation. See GeneratingProviders.kt for the samples.

markdown's People

Contributors

ajalt avatar berezhkoe avatar dtretyakov avatar dzharkov avatar firsttimeinforever avatar hurricup avatar ilya-g avatar jmfayard avatar jolanrensen avatar jsmonk avatar kamildoleglo avatar kkononov avatar kradima avatar kvanttt avatar ligi avatar lukellmann avatar max-kammerer avatar piotrtomiak avatar sdeken avatar sebastianaigner avatar valich avatar vladimir-koshelev avatar yole avatar zolotov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

markdown's Issues

Problem with links containing umlauts

Clicking on [Foo](https://de.wikipedia.org/wiki/Spezial:Zufällige_Seite) gives the error

Bildschirmfoto 2021-02-15 um 20 14 12

Such links worked some weeks ago.

If one replaces the "ä", as in [Foo](https://de.wikipedia.org/wiki/Spezial:Zufallige_Seite), the error does not appear (Wikipedia does not have such a page).

Support XSS protection

Similarly to markdown-it, we should trim links which try to execute something on the system.

The implementation is actually pretty similar, using similar regexes for filtering links.

All kind of links and images should be filtered by default.

Using intellij-markdown as a maven/gradle dependency

I tried to use intellij-markdown in my projects but it seems that there is no public Maven repository (maven central, jcenter, etc) which can be used to download intellij-markdown from. How can I automatically add intellij-markdown to my project as a dependency?

Version 0.1.45 on Maven Central requires JDK 11

Similar to this other issue, the latest release of 0.1.45 in Maven Central requires Java11 in its Gradle module file.

This wasn't an issue in the jcenter publication because it was published with a previous version of Gradle and did not include a .module file.

This will make build fail in setups were Maven Central has a higher precedence than Jcenter and using Java8.

2 possible solutions:

  • removing the .module file from Maven Central, not sure how possible this is
  • publishing a patched 0.1.46 version that declares Java8 compatibility (see square/kotlinpoet#1000). It looks like markdown is included transitively by dokka so users would have to explicitely upgrade markdown but that gives a path forward

Support for Kotlin multiplatform

Hey @valich, have you considered converting this to a pure Kotlin library and adding support for multiplatform? I skimmed through some files and it looks like the usage of classes from Java are fairly low. It would be great to be able to use a single markdown parser on multiple platforms and not just Java or JS.

Blockquote ending not detected

Consider the following test markdown

a

> quote

b

This is rendered incorrectly with both CommonMarkFlavourDescriptor and GFMFlavourDescriptor like this (version 0.2.3):

<p>a

</p><blockquote><p>quote

b</p></blockquote>

However, the markdown should render with b after the </blockquote>, like this (especially when using the GitHub flavour):

a

quote

b

Generated HTML is wrapped in <p> tag

The below code generates HTML wrapped in <p> tag by default. Is there a way to exclude the wrapper tag? Its causing styling issues.

  val markDownText = "Some text"
  val flavour = new GFMFlavourDescriptor
  val parsedTree = new MarkdownParser(flavour).buildMarkdownTreeFromString(markDownText)
  val html = new HtmlGenerator(markDownText, parsedTree, flavour, false).generateHtml

Documentation and/or example usage

The README is rather lacking in the details of how to use intellij-markdown. For example I can't tell whether I can use this to create an AST of a given markdown file which I can use for other purposes? The README links some files which does not look like an api and there is no reference to any programmatic api I can use.

Cannot escape any characters using backslash

Escaping restricted element using backslash actually returns single LeafASTNode:TEXT with all characters within range (with backslash particularly)

So for input aaa\*bbb I expect these AST nodes: 4 leafs splitted something like that LeafASTNode:TEXT, LeafASTNode:Escape, LeafASTNode:Asterisk, LeafASTNode:TEXT, so I can process the Asterisk

Indented code blocks aren't supported

Here goes the code:

    print("hello world") # four spaces in the beginning

The indented line should be rendered as a code block. A code block is indeed rendered, but it's empty.

README.md: Java example doesn't compile

Besides the String variable being called text instead of src on line 3, the generateHtml method on line 4 uses default parameters that have to be explicitly added when calling it from Java code.

CRLF line endings are not parsed correctly.

The CommmonMark spec requires CR and CRLF to be supported as line endings, but this library parses CRLF incorrectly. For example:

MarkdownParser(CommonMarkFlavourDescriptor()).buildMarkdownTreeFromString("a \r\nb")

parses as:

Markdown:PARAGRAPH
  Markdown:TEXT 'a'
  Markdown:BR '·␍'
  Markdown:EOL '␊'
  Markdown:TEXT 'b'

and renders as:

<body><p>a<br />
b</p></body>

Enable iOS build target

The iOS target is currently disabled due to the fact that enabling it would cause tests to fail on CI. This is because there are a number of tests that rely on files in the test resources directory, but iOS tests run on the simulator, and don't have access to the local file system.

Here are a few of the ways we could do this:

  1. Just enable the target and accept failing CI builds. Not ideal.
  2. Move the file-based tests to jvmTest to only run them only on JVM. A couple of tests like the performance tests are already JVM-only.
  3. Create an intermediate source set like nonIosTest or desktopTest and move the file-based tests there.
  4. Remove all file-based tests currently in commonTest. Not ideal, since some of them test functionality not covered by the spec tests.
  5. Find a way to copy the test resources into the simulator before running the tests.

It would be great to publish the iOS target in the next release, so I'd love to hear your thoughts on if there's any approach you prefer.

[HELP] Highlighted Syntax code blocks

Hi,
thanks for this precious job!

I am trying to integrate this library for formatting code documentation, how did you integrate syntax highlighting?
There is an extension for that?

Thanks a lot, Davide.

Two adjacent links fail to render

The following should be rendered as two adjacent links, but instead there's an error:

[Rust](https://jetbrains.team/team?team=Rust-MNVpW0Zos2C)[💬](https://jetbrains.team/m/Arseniy.Pendryak)

markdown-it parser has no problem with this.

Incorrect unescaping of HTML entities

With input containing &lt; and &gt; (in the escaped form), we end up having incorrect HTML.

Example input: &lt;wrongTag&gt;
Output HTML: <wrongTag>

Not sure, but the same problem may be with &amp; (in the escaped form)

The reason is in what EntityConverter#replaceEntities(text: CharSequence, processEntities: Boolean, processEscapes: Boolean) does when called with true values of arguments.

Markdown parsing fails on Kotlin/JS

The basic example given in the project page is fails to run on Kotlin/JS application with Invalid regular expression error.

val src = "Some *markdown*"
val flavour = GFMFlavourDescriptor()
val parsedTree = MarkdownParser(flavour).buildMarkdownTreeFromString(src)
val html = HtmlGenerator(src, parsedTree, flavour).generateHtml()
log("Markdown : $html")

Version Used:

 kotlin("js") version "1.5.0-RC"
 implementation("org.jetbrains:markdown:0.2.2")
kotlin_kotlin.js?215f:19299 SyntaxError: Invalid regular expression: /^ {0,3}(\-+|=+) *$/: Invalid escape
    at new RegExp (<anonymous>)
    at Regex (webpack-internal:///./kotlin/kotlin_kotlin.js:18373:26)
    at Regex_init_$Init$_0 (webpack-internal:///./kotlin/kotlin_kotlin.js:18294:11)
    at Regex_init_$Create$_0 (webpack-internal:///./kotlin/kotlin_kotlin.js:18298:12)
    at new Companion_126 (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:9168:21)
    at Companion_getInstance_125 (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:9181:7)
    at new SetextHeaderProvider (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:9185:5)
    at GFMMarkerProcessor.CommonMarkMarkerProcessor [as constructor] (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:1234:122)
    at new GFMMarkerProcessor (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:1875:31)
    at Factory_1.createMarkerProcessor_1 (webpack-internal:///./kotlin/kotlin_org_jetbrains_markdown.js:1861:12)

Autolink fails in some cases

Example of input handled incorrectly:

JetBrains Account portal https://account.jetbrains.com
JetBrains Online Store https://www.jetbrains.com/store

Result:

image

Full reference LINK_TEXT should not allow links

The following markdown

[foo [bar](/a)][ref]

[ref]: /b

is parsed into

Markdown:MARKDOWN_FILE
  Markdown:PARAGRAPH
    Markdown:FULL_REFERENCE_LINK
      Markdown:LINK_TEXT
        Markdown:[
        Markdown:TEXT
        WHITE_SPACE
        Markdown:INLINE_LINK
        Markdown:]
      Markdown:LINK_LABEL
  Markdown:EOL
  Markdown:EOL
  Markdown:LINK_DEFINITION
    Markdown:LINK_LABEL
    WHITE_SPACE
    Markdown:LINK_DESTINATION

But due to spec https://spec.commonmark.org/0.30/#example-531 there should be 2 SHORT_REFERENCE_LINK.

MarkdownFlavourDescriptor improvement

It would be nice if MarkdownFlavourDescriptor will allow to pass arbitrary context to createHtmlGeneratingProviders instead of hardcoded LinkMap and URI

Add JS-IR target to publication

Please add IR target to the project publication (either use BOTH or IR-only). It seems that IR will be the default target for all platforms since 1.5 and JS-LEGACY could not be called from IR.

Improve link autodetection

Currently for a link to be detected as such it has to start with http(s):// or www., this is too simplistic. Please add TLDs to the link detection function.

Handling of &nbsp and other html entries in gfm

According to the specification gfm supports html entries like &nbsp. Are they supported by this library? If so, how?

Given a test case:

Hello World! Docs with period issue, e.g.&nbsp;this.

using:

val gfmFlavourDescriptor = GFMFlavourDescriptor()
val markdownAstRoot = IntellijMarkdownParser(gfmFlavourDescriptor).buildMarkdownTreeFromString(testString)

I get a text node from 37 to 52 instead of some code or other form of marker that this fragment contains html entry.

"Failed to construct URL" error in JS rendering

Encountering this problem in Chrome (there's no such in Firefox).

The markdown input causing error:

[ML.NET](https://blogs.msdn.microsoft.com%2Fdotnet%2F2018%2F05%2F07%2Fintroducing-ml-net-cross-platform-proven-and-open-source-machine-learning-framework%2F&formCheck=f8307abbbb11b8c559607432591df0ae)

Stacktrace:

Uncaught TypeError: Failed to construct 'URL': Invalid URL
    at URI.resolve_61zpoe$ (ui.js:7571)
    at InlineLinkGeneratingProvider.LinkGeneratingProvider.makeAbsoluteUrl_0 (ui.js:10005)
    at InlineLinkGeneratingProvider.LinkGeneratingProvider.renderLink_pm9a16$ (ui.js:10017)
    at InlineLinkGeneratingProvider.LinkGeneratingProvider.processNode_swx2my$ (ui.js:10000)
    at HtmlGenerator$HtmlGeneratingVisitor.visitNode_6c73xi$ (ui.js:10224)
    at accept (ui.js:7775)
    at CheckedListItemGeneratingProvider$SubParagraphGeneratingProvider.InlineHolderGeneratingProvider.processNode_swx2my$ (ui.js:9791)
    at CheckedListItemGeneratingProvider.processNode_swx2my$ (ui.js:8512)
    at HtmlGenerator$HtmlGeneratingVisitor.visitNode_6c73xi$ (ui.js:10224)
    at accept (ui.js:7775)

Unordered list items incorrectly come back as paragraphs

When used in Dokka, on includes, it imports the Markdown using this parser. When you get to a unordered list item (maybe the same for ordered), then it thinks that the text of the list item is a new paragraph.

* something

is parsed as unordered list, list item, paragraph, text "something"

that seems wrong and breaks how the input is seen by Dokka. Causes this bug:
Kotlin/dokka#71

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.