jgm / pandoc-types Goto Github PK

View Code? Open in Web Editor NEW

106.0 13.0 65.0 425 KB

types for representing structured documents

Home Page: http://johnmacfarlane.net/pandoc

License: Other

Haskell 100.00%

pandoc-types's People

Stargazers

Watchers

Forkers

batterseapower samstokes reinerp basvandijk qrilka mattias-lundell davidgriffith ericnormand netconstructor open-source-gis multitenant scraping-xx parsing timtylin mpickering spointy bgamari dubiousjim cosmo0920 mb21 ryansroberts baig jodonoghue mstksg sid-kap tarleb two8g jkr hubertp-lshift tonymorris roblabla romanhargrave fisx leftaroundabout danse italia gwils dtheras rholmes777 chris-martin peteryland despresc haydenbetts panizzopoli whitten ueokande mgttlinger josephcsible felixonmars pyssling lehmacdj oczarnecki argent0 keywordsalad turion mejormus traviscardwell arranstewart-dev ricnorr obsidiansystems hasufell hapytex rebeccaskinner 414owen igrep

pandoc-types's Issues

Crazy ideas: table structure

Here are a few thoughts I had while implementing table features in pandoc readers and writers. None of the suggestions have been thought through, so take the below with a grain of salt.

Promote RowHeadColumns from TableBody to the full table. The number of row heads will typically be constant for all table bodies; it is also relevant for, and should also apply to, the table head and foot.
Use grid-based data structure. Going from a grid structure to a list of rows and cells seems much simpler than the other way around. Most writers need to compute the table grid, so it might as well be the main structure. This could provide additional type-level guarantees and make it easier to access cells column-wise, e.g. when checking for the most frequent cell alignment in a column.
A possible data structure would be Array from package array. It is already a transitive dependency of pandoc-types (through deepseq) and described in the Haskell 2010 report.
HTML limits rowspan to a max value of 65534 (= 2¹⁶ - 2), and it would be reasonable to adopt this limit. RowSpan could be then be a newtype wrapper for Word16.
The colspan attribute is limited to a max value of 1000 in HTML. Like for rowspan, this seems like a reasonable limit. ColSpan could also wrap Word16.

Cc: @despresc

Please accept QuickCheck 2.12

The build excludes the latest version of QuickCheck as a dependency, but when than constraint is lifted the build succeeds and works just fine.

Why do we need isNull?

pandoc-types/Text/Pandoc/Builder.hs

Lines 196 to 197 in 154b91b

    
           isNull :: Many a -> Bool 
        
           isNull = Seq.null . unMany

Is there a reason that we need isNull? Many is Foldable, so isn't it always the same as null? Can we mark it as deprecated now and eventually remove it?

ListAttributes Int meaning missing

There is no Int meaning in the documentation of ListAttributes tuple

pandoc-types documentation on hackage is not build properly?

If I go to http://hackage.haskell.org/package/pandoc-types, I cannot see or open any module such as the Text.Pandoc.Definition module. Some nice Samaritan on #pandoc (irc) pointed me to the source code (https://github.com/jgm/pandoc-types/blob/master/Text/Pandoc/Definition.hs), which was a big help in finding out the various types and their definition used by pandox, but formatted api documentation would be nice to have.

Broken dependency constraints for pandoc-types 1.20

While working on msp-strath/Mary#46, I have noticed that pandoc-types-1.20
has the constraint QuickCheck (>=2.4 && <2.14) but it uses liftShrink in
Text.Pandoc.Arbitrary and that was only introduced in QuickCheck-2.10.

Is it possible to update the dependencies on hackage? Cheers!

pandoc-types-1.17.3.1 does not compile with GHC 8.4.1

The error is:

Text/Pandoc/Definition.hs:97:10: error:
    • No instance for (Semigroup Pandoc)
        arising from the superclasses of an instance declaration
    • In the instance declaration for ‘Monoid Pandoc’
   |
97 | instance Monoid Pandoc where
   |          ^^^^^^^^^^^^^

Text/Pandoc/Definition.hs:106:10: error:
    • No instance for (Semigroup Meta)
        arising from the superclasses of an instance declaration
    • In the instance declaration for ‘Monoid Meta’
    |
106 | instance Monoid Meta where
    |          ^^^^^^^^^^^

What is the purpose of Table's ShortCaption?

The Table Block currently includes the Caption data type, which is made up of an optional ShortCaption. I didn't see any mention of a short caption in the User Guide and also no way to specify one for a table in Pandoc's Markdown. I then assumed that this would be a field utilised by pandoc's LaTeX reader and writer for storing a short caption in LaTeX tables for use in a List of Tables, like so:

\caption[short caption]{long caption}

However, when I try this using an example .tex file containing a table with a caption of the above form, the resulting AST's Table element does not contain a ShortCaption. Hence my assumption that the ShortCaption would be utilised by pandoc's LaTeX reader and writer is wrong. I am thus left wondering what the purpose of the ShortCaption is.

containers >= 0.3

The dropWhileL and dropWhileR functions used in Text/Pandoc/Builder.hs require containers >= 0.3 as you can see by comparing the exports from containers-0.2.1 http://hackage.haskell.org/packages/archive/containers/0.2.0.1/doc/html/Data-Sequence.html I was going to make the obvious suggestion that "containers >= 0.3" be specified, but I think the intention is for pandoc to build with older versions of base. But it looks like versions of containers with these functions require base >= 4.2 to judge from http://hackage.haskell.org/package/containers-0.3.0.0

instacne ToMetaValue String in v1.17.5 is useless as it is

pandoc-types/Text/Pandoc/Builder.hs

Lines 280 to 281 in 71c5782

    
           instance ToMetaValue String where 
        
             toMetaValue = MetaString

I've been defining a similar instance in pandoc-crossref for a while now. Now I can't due to duplicate instances. But the instance defined here is pretty useless too, because trying to use it will fail every time due to overlapping instances:

    • Overlapping instances for ToMetaValue [Char]
      Matching instances:
        instance ToMetaValue String -- Defined in ‘Text.Pandoc.Builder’
        instance ToMetaValue a => ToMetaValue [a]
          -- Defined in ‘Text.Pandoc.Builder’

Consider adding an {-# OVERLAPPING #-} pragma:

instance {-# OVERLAPPING #-} ToMetaValue String where
  toMetaValue = MetaString

otherwise, it's very problematic for me and in general.

nullAttr doesn't seem to work as before with pandoc filter

I come to get a runtime error with nullAttr (from Text.Pandoc.Defitnition) in pandoc-types-1.23.1 when I use it within a pandoc filter like bellow;

$ cat myfilter.hs
import Text.Pandoc.JSON

main :: IO ()
main = toJSONFilter block

block (Para _) = Div nullAttr []
block b = b

This leads to an unexpected error instead of the expected result [ Div ( "" , [] , [] ) [] ];

$ pandoc --filter myfilter.hs -f markdown -t native
foo
(hit ctrl-d)
Error running filter myfilter.hs:
Filter returned error status -11

This filter worked as expected before, though I'm not sure until which version. And I could get the expected result now if I redefined nullAttr as the same definition as the original ("",[],[]), hiding the imported definition.

Does somebody has any clue regarding this error?

-- Sorry in advance if this was to be issued in pandoc itself instead of here.

Emph with custom class

Can Emph be made take a custom Attr?

I would like this capability in order to define FontAwesome icons when using Semantic UI which requires the HTML tag to be <i> (rather than anything else, say <span>).

In other words, can the following be generated from a Pandoc AST?

<i class="icons tag" />

Issue with compiling pandoc-theorem

There is an issue compiling pandoc-theorem against pandoc-types-1.23:

pandoc-theorem/app/Main.hs:8:8: error:
    • No instance for (Text.Pandoc.JSON.ToJSONFilter
                         IO
                         (Text.Pandoc.Definition.Block -> [Text.Pandoc.Definition.Block]))
        arising from a use of ‘toJSONFilter’
        (maybe you haven't applied a function to enough arguments?)
    • In the expression: toJSONFilter $ toList . convertBlock
      In an equation for ‘main’:
          main = toJSONFilter $ toList . convertBlock
  |
8 | main = toJSONFilter $ toList . convertBlock
  |        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The relevant source can be found here.

pandoc-theorem relies on an instance ToJSONFilter IO (Block -> [Block] which has been removed in 183af9d

Should this issue be fixed on pandoc-theorem's side or could the instance be re-introduced?

Improving tables

See the main todo list and the relevant issue. I would like to start implementing better table handling in Pandoc. Specifically, I would implement all but the last of these bullet points using one of the designs below (or a modified version of one of them).

I think something like this recently outlined approach is a good way forward for now. The representation is a little loose (any table in the intermediate representation is valid, so there are multiple ways to write a given table, but only one normalized way), but it should allow the readers and writers to be switched more easily. This is slightly modified version of that approach:

type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data CellType = DataCell | HeaderCell
data Cell = Cell Attr CellType (Maybe Aligment) RowSpan ColSpan [Block]
type Row = [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row]
  ...

The Maybe Alignment on the individual cells allows the cells to override the alignment of the column(s) in which they reside. This makes it easier to specify one's intentions when a cell spans multiple columns with conflicting alignments, and has the advantage of allowing better \multicolumn and \multirow support in the LaTeX reader and writer. It also comes up naturally when one thinks of possible extensions to the supported markdown table formats.

A similar design has the following modifications:

data Cell = Cell Attr (Maybe Alignment) RowSpan ColSpan [Block]
data HeaderRow = Row Attr [Cell]
data BodyRow = Row Attr [Cell] [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [HeaderRow] [BodyRow] [HeaderRow]
  ...

This has the advantage of making explicit the table head/body/foot and row head/body structure that seems to be assumed in the first approach, where the first entirely header rows become the table head, and the last such rows become the table foot. Cells in the head and foot sections would correspond to th cells, and cells in body section would correspond to td cells. It does not require a CellType, but one could still be added, making these even more similar to HTML tables. This approach has the disadvantage of making the table representation more complex.

I assume that the tables are normalized (laid on a grid with a given width so that overlapping cells and empty spaces can be dealt with in the table) like so, informally:

Empty rows are filtered out from the table
The grid has a height equal to the number of rows in the table, and some fixed width.
Rows are laid on the grid from top to bottom.
The top of each cell is as far down on the grid as it is on the table.
The top-left corner of each cell, in turn, is placed on the leftmost empty grid space on the row, if it exists within the grid width, and is otherwise dropped. If it would overlap a cell on a previous row or extend past the remaining grid width, its width (ColSpan) would be lowered to fit. If it would extend past the bottom of the grid, its height (RowSpan) would be lowered to fit.
If there are too few cells in a row to fill the available width, then blank cells are added to the end of the row.

The table head, table foot, row head (the list of row head sections without the row body), and row body (the list of row body sections without the row head) should be normalized independently in any design where these exist (implicitly in the first, or explicitly in the second). The overall table width would be the length of the [(Alignment, ColWidth)] list, and the row head/body width would add to that width. (The row head width would be the width of the first row in the row head).

add width field to Image

Issue 332 in pandoc jgm/pandoc#332
requires an additional data field that goes with an Image.
Allowed values would be integers between 0 and 100, or maybe 200.
If empty, 100 is default.

Walkable instance (and newtype) for Attributes

It would be nice to be able to walk an AST and perform a replacement or query on all attributes. (We need to do that, e.g., in the EPUB writer.)

The instances wouldn't be hard to write, but probably this would require putting Attributes in a newtype or data type.

Pandoc don't parse textile +underlined text+

Discussion jgm/pandoc#463 (comment)

"Functored" AST

Instead of doing e.g.

data Block = ...Block...

what if we did

data BlockF block = ...block...
   deriving Functor
newtype Block = MkBlock (BlockF Block)

This trick allows ASTs to be extensible, among other benefits.

I generally like this approach, but I had a specific problem in mind that would benefit which I would like to share. https://github.com/obsidiansystems/dombuilder-pandoc/blob/master/src/Reflex/Dom/Builder/Pandoc.hs is some code to translate the Pandoc AST to reflex-dom "dom builder action" order to use within a website built with reflex dom.

I would like to have my own custom handling of pandoc the pandoc AST --- e.g. parsing relative URLs in links into a Route AST --- without having to copy and paste that function. But if we do the above "functored AST" approach, I can replace

block :: DomBuilder t m => Block -> m ()

with

blockF :: DomBuilder t m => BlockF a -> m ()

blockF :: DomBuilder t m => BlockF (m ()) -> m ()

These can be thought of as specialized versions of traverse_ or sequence_, and they are very nice to work with!

block list filters don't get correctly applied to blocks nested in a bullet list or in a table cell

i stumbled upon the error within a table and than i found out that it applies also to bullet lists, which contain lists of lists of blocks like the tables.

i was not able to extend the property tests in order to reproduce this error. as far as i can see, arbitrary block lists are correctly generated so this case should appear sooner or later. an hypothesis is that everywhere, used to double check the correctness of transformations, features the same error ... but it seems unlikely.

in any case i could pinpoint the error with some specific tests, and any suggestion about how to improve them is welcome.

Support for deepseq-generics 0.2.x

NFData seems removed with the new version, and I've tried to use the one from deepseq 1.4.1.1, but got the following error:

[1 of 5] Compiling Text.Pandoc.Generic ( Text/Pandoc/Generic.hs, dist/build/Text/Pandoc/Generic.o )
[2 of 5] Compiling Text.Pandoc.Definition ( Text/Pandoc/Definition.hs, dist/build/Text/Pandoc/Definition.o )

Text/Pandoc/Definition.hs:70:23: parse error on input `|'

I'm still very new in Haskell. Thank you and it would be really appreciated to support the new deepseq-generics :)

Remove Null?

We have a Block constructor Null. I think this was added before we were using Builder in the readers; I wonder if there's any reason for it to exist. Note that there's nothing like this for Inline.

BSD License and GPL source file headers conflict

1f4e239 changed the license to BSD however the source headers still say GPL:

./Text/Pandoc/JSON.hs:23:   License     : GNU GPL, version 2 or above
./Text/Pandoc/Definition.hs:24:   License     : GNU GPL, version 2 or above
./Text/Pandoc/Builder.hs:25:   License     : GNU GPL, version 2 or above
./Text/Pandoc/Walk.hs:29:   License     : GNU GPL, version 2 or above
./Text/Pandoc/Generic.hs:22:   License     : GNU GPL, version 2 or above

Could you please update them?

Pandoc not parsing metadata when invoked inside Haskell program

I'm writing a Pandoc filter. Within my filter, I want to run Pandoc on another file to extract some information. I didn't find the right way to do this in the Pandoc manual or the Haddoc documentation, so I made my best guess. Unfortunately, the metadata isn't parsed correctly.

Here's a file (test.md) I want to parse from within my code:

---
title: This is a test
...

Hello world!

Here's a tiny program demonstrating the issue:

import Text.Pandoc.Class (PandocIO, readFileStrict, runIOorExplode)
import Text.Pandoc.JSON (Pandoc)
import Text.Pandoc.Readers (readMarkdown)
import Text.Pandoc.Options (def)
import Text.Pandoc.UTF8 (toText)

main :: IO ()
main = do
  doc <- runIOorExplode $ readTitle "test.md"
  putStrLn . show $ doc

readTitle :: FilePath -> PandocIO Pandoc
readTitle f = do
  s <- readFileStrict f
  let t = toText s
  md <- readMarkdown def t
  return md

When I run the program, I get:

Pandoc (Meta {unMeta = fromList []}) [HorizontalRule,Para [Str "title:",Space,Str "This",Space,Str "is",Space,Str "a",Space,Str "test",SoftBreak,Str "..."],Para [Str "Hello",Space,Str "world!"]]

But if I parse it from the command line, Pandoc correctly picks up the metadata.

$ pandoc -s -t native test.md
Pandoc (Meta {unMeta = fromList [("title",MetaInlines [Str "This",Space,Str "is",Space,Str "a",Space,Str "test"])]})
[Para [Str "Hello",Space,Str "world!"]]

Single constructor data types and JSON serialization

Hi everyone !

Until recently I worked under the assumption that a pandoc-types datatype with a single constructor (say Format) had its type erased from the JSON representation : instead of {"t": type, "c": content}, the representation was simply content.

I think that (maybe with the exception of Meta ?) this assumption was valid until the recent changes to the document model. Now, AFAICT some types with a single constructor have their types erased and some don't. I thought for a moment that the difference was that some where declared with newtype keyword (type erasure) and some with data keyword (no type erasure) which would make sense (if I understand correctly the difference between the two keywords in Haskell) but this second hypothesis doesn't hold either.

Could anyone explain me if there is a simple rule based on the definition of pandoc types that says if the type of the data will be erased in JSON representation ?

AFAICT, Format data (newtype) has its type erased in JSON, but RowSpan data (newtype) has its type serialized. Cell data (data) also have their types serialized. Unfortunately, I don't know enough of Haskell to pinpoint what parts of the code explain the difference between these cases ...

The context: I have developped a Python library (https://github.com/boisgera/pandoc) that reads the pandoc-types data models (for as many versions of pandoc as possible) to reproduce automatically the equivalent hierarchy of classes in Python, so that json data can be exchanged with the available pandoc executable to work with a pandoc document representation in Python. The target being the people (first and foremost : me 😉 ) that need to analyze and transform a document with a nice AST and are fluent in Python but not so much in Haskell (or in Lua). To continue to do that, I need to be able to infer automatically from the output of :browse Text.Pandoc.Definition in ghci the JSON serialization rule for each data type. This is why a simple and mechanical rule would help !

Cheers,

Use Map for key-value pairs

pandoc-types/Text/Pandoc/Definition.hs

Line 196 in dd50a6e

type Attr = (Text, [Text], [(Text, Text)])

I'm not sure what your policy is on breaking changes, but when/if you next plan to do a breaking release, I think we should consider changing this to type Attr = (Text, [Text], M.Map Text Text).

add ListAttributes support for BulletList

(this issue is copied from jgm/pandoc#9480)

since we have ListAttributes for OrderedList:

OrderedList ListAttributes [[Block]]
type ListAttributes = (Int, ListNumberStyle, ListNumberDelim)

so why not add ListAttributes for BulletList? the api of BulletList is just barely like:

BulletList [[Block]]

it will be great to add the attribute ListNumberStyle so the <ul> tag can be set to <ul style="list-style-type: circle;">

in my project, I want to distinguish between - list and * list, so I can set different style on them, but I can't do that now.

Tag 1.20 seems to be missing in git

There is a release 1.20 on GitHub, but running git tag does not include the 1.20 tag; it seems to be missing in git. The commit behind release 1.20 does not appear to be part of any branch; trying to checkout the commit id fails.

Filters that alter document structure

I have a complex html document that I've read into pandoc, and I'm trying to write filters that will isolate the content I'm after. Some examples of this are dropping certain Divs entirely, or replacing Tables by just the content of their rows.

I can write filters of the type Pandoc -> Pandoc which is workable for changing the top-level structure of a document, but would become very tedious when Blocks are nested. I could also write my functions to return Null :: Block when removing Blocks, but that doesn't feel like the right way to do it. Or is that perhaps precisely why Null is there in the first place?

I'd like to be able to write functions of type [Block] -> [Block] to use in filters, but I get the error

No instance for (Text.Pandoc.Walk.Walkable [Block] Pandoc).

I tried to think about how to write that instance, but it's hard to combine walking lists of Blocks with applying the function to them.

So I feel I'm either missing something obvious, or going about it in the wrong way. Is there perhaps a simple solution to what I want to do? (Sorry if this is a stupid or naive question)

Add Changelog

The introduction of Attributes in Images and Links in version 1.16 should be documented in Changelog.

Work with a StateT transformator

It would be nice if we can also use a StateT monad transformer when walking through the AST and thus keep track of the section, and update the state based on content that we see when "walking" over the tree.

Backward- and forward-compatibility

Hello,

I would like to inquire about the possibility to increase backward- and forward-compatibility in pandoc-types. For example, in pandoc-types 1.21, changes to the AST has broken the way a few filters work. The new Table block element is not crucial for many filters (e.g. pandoc-plot or pandoc-include-code), so the breakage is artificial.

I propose that the next time the AST is changed, a new block element, Unknown, be added. Then, the function toJSONFilter could be changed so that instead of throwing an error on incompatible pandoc-types version, the Unknown block element could be used as a placeholder for AST blocks that are not decoded appropriately.

This would allow for filters to be compatible with a wider range of pandoc-types and pandoc versions.

Let me know what you think.

Unable to build on Docker Hub: `ExitFailure (-9)`

Today, I tried to build pandoc via cabal on Docker Hub (which worked before, a few weeks ago), but now I only get an error. I've isolated the problem to pandoc-types, but unfortunately I can't make any sense out of the error messages. The Docker container builds correctly on my devices.

This is the output of the latest build: https://hub.docker.com/r/thriqon/full-pandoc/builds/be4dayresfnxycbghqowa65/ (near the bottom is the debug output produced with -v3).
The Dockerfile is this one: https://hub.docker.com/r/thriqon/full-pandoc/~/dockerfile/.

Many thanks in advance for any help!

base (>1 && <1) in version 1.19 on purpose?

With such restrictments, it's not possible to use this version. Is this on purpose? I don't see any mentions that this version is deprecated in the changelog as well.

missing DeriveTraversable needed for Text.Pandoc.Builder to build with GHC 7.8

hello!
its my semi annual "get pandoc building with the new GHC RC" extravaganza

looks like its just as simple as adding DeriveTraversable to the set of language pragmas for Text.Pandoc.Builder (assuming a suitablely patched version of regex pcre builtins)

still have 1-2 issues i need to sort out for pandoc proper, will report those suitably too

COMPLETE pragmas in Legacy.Definition

It is possible to convince GHC that a collection of patterns is complete, say by writing {-# COMPLETE Format :: D.Format #-} in Legacy.Definition. This would make it unnecessary to silence the incomplete pattern match warnings. I did not know that when writing the modules, and by the time I got to pandoc it was easier to silence the warnings while working rather than add the pragmas. I can write a request that adds them, if that would be welcome.

Just now I tested them on the pandoc commit right after I switched it to Legacy.* and they did suppress the warnings, but I got other warnings like

    Pattern match checker exceeded (2000000) iterations in
    an equation for ‘fixBlocks’. (Use -fmax-pmcheck-iterations=n
    to set the maximun number of iterations to n)
    |               
846 |   let fixBlocks (b : CodeBlock attr x : rest)

so it may not be a perfect solution.

Improve Walk

Walk is a bit of a mess. It seems we should be able to do something more elegant, using recursion schemes or something.

For motivation see jgm/pandoc#7130.
Here we have walk fixLinks where fixLinks is [Inline] -> [Inline].
It works fine if applied to [Inline]. However, it behaves differently if you apply it to Inlines. Seems like instead of having instances specifically for lists, we should have general instances that work for all Traversable/Foldable structures, including Many.

github actions caching

See https://github.com/reflex-frp/reflex/blob/develop/.github/workflows/haskell.yml does for example, so the dependencies are not built every time.

jgm / pandoc-types Goto Github PK

pandoc-types's People

Stargazers

Watchers

Forkers

pandoc-types's Issues

Recommend Projects

Recommend Topics

Recommend Org