jgm / pandoc-types Goto Github PK
View Code? Open in Web Editor NEWtypes for representing structured documents
Home Page: http://johnmacfarlane.net/pandoc
License: Other
types for representing structured documents
Home Page: http://johnmacfarlane.net/pandoc
License: Other
Here are a few thoughts I had while implementing table features in pandoc readers and writers. None of the suggestions have been thought through, so take the below with a grain of salt.
Promote RowHeadColumns from TableBody to the full table. The number of row heads will typically be constant for all table bodies; it is also relevant for, and should also apply to, the table head and foot.
Use grid-based data structure. Going from a grid structure to a list of rows and cells seems much simpler than the other way around. Most writers need to compute the table grid, so it might as well be the main structure. This could provide additional type-level guarantees and make it easier to access cells column-wise, e.g. when checking for the most frequent cell alignment in a column.
A possible data structure would be Array
from package array
. It is already a transitive dependency of pandoc-types (through deepseq) and described in the Haskell 2010 report.
HTML limits rowspan to a max value of 65534 (= 2¹⁶ - 2), and it would be reasonable to adopt this limit. RowSpan could be then be a newtype wrapper for Word16
.
The colspan
attribute is limited to a max value of 1000 in HTML. Like for rowspan, this seems like a reasonable limit. ColSpan could also wrap Word16
.
Cc: @despresc
The build excludes the latest version of QuickCheck as a dependency, but when than constraint is lifted the build succeeds and works just fine.
pandoc-types/Text/Pandoc/Builder.hs
Lines 196 to 197 in 154b91b
Is there a reason that we need isNull
? Many
is Foldable
, so isn't it always the same as null
? Can we mark it as deprecated now and eventually remove it?
There is no Int
meaning in the documentation of ListAttributes
tuple
If I go to http://hackage.haskell.org/package/pandoc-types, I cannot see or open any module such as the Text.Pandoc.Definition module. Some nice Samaritan on #pandoc (irc) pointed me to the source code (https://github.com/jgm/pandoc-types/blob/master/Text/Pandoc/Definition.hs), which was a big help in finding out the various types and their definition used by pandox, but formatted api documentation would be nice to have.
While working on msp-strath/Mary#46, I have noticed that pandoc-types-1.20
has the constraint QuickCheck (>=2.4 && <2.14)
but it uses liftShrink
in
Text.Pandoc.Arbitrary
and that was only introduced in QuickCheck-2.10
.
Is it possible to update the dependencies on hackage? Cheers!
The error is:
Text/Pandoc/Definition.hs:97:10: error:
• No instance for (Semigroup Pandoc)
arising from the superclasses of an instance declaration
• In the instance declaration for ‘Monoid Pandoc’
|
97 | instance Monoid Pandoc where
| ^^^^^^^^^^^^^
Text/Pandoc/Definition.hs:106:10: error:
• No instance for (Semigroup Meta)
arising from the superclasses of an instance declaration
• In the instance declaration for ‘Monoid Meta’
|
106 | instance Monoid Meta where
| ^^^^^^^^^^^
The Table Block currently includes the Caption data type, which is made up of an optional ShortCaption. I didn't see any mention of a short caption in the User Guide and also no way to specify one for a table in Pandoc's Markdown. I then assumed that this would be a field utilised by pandoc's LaTeX reader and writer for storing a short caption in LaTeX tables for use in a List of Tables, like so:
\caption[short caption]{long caption}
However, when I try this using an example .tex
file containing a table with a caption of the above form, the resulting AST's Table element does not contain a ShortCaption. Hence my assumption that the ShortCaption would be utilised by pandoc's LaTeX reader and writer is wrong. I am thus left wondering what the purpose of the ShortCaption is.
The dropWhileL
and dropWhileR
functions used in Text/Pandoc/Builder.hs require containers >= 0.3 as you can see by comparing the exports from containers-0.2.1 http://hackage.haskell.org/packages/archive/containers/0.2.0.1/doc/html/Data-Sequence.html I was going to make the obvious suggestion that "containers >= 0.3" be specified, but I think the intention is for pandoc to build with older versions of base. But it looks like versions of containers with these functions require base >= 4.2 to judge from http://hackage.haskell.org/package/containers-0.3.0.0
pandoc-types/Text/Pandoc/Builder.hs
Lines 280 to 281 in 71c5782
I've been defining a similar instance in pandoc-crossref for a while now. Now I can't due to duplicate instances. But the instance defined here is pretty useless too, because trying to use it will fail every time due to overlapping instances:
• Overlapping instances for ToMetaValue [Char]
Matching instances:
instance ToMetaValue String -- Defined in ‘Text.Pandoc.Builder’
instance ToMetaValue a => ToMetaValue [a]
-- Defined in ‘Text.Pandoc.Builder’
Consider adding an {-# OVERLAPPING #-}
pragma:
instance {-# OVERLAPPING #-} ToMetaValue String where
toMetaValue = MetaString
otherwise, it's very problematic for me and in general.
I come to get a runtime error with nullAttr
(from Text.Pandoc.Defitnition
) in pandoc-types-1.23.1 when I use it within a pandoc filter like bellow;
$ cat myfilter.hs
import Text.Pandoc.JSON
main :: IO ()
main = toJSONFilter block
block (Para _) = Div nullAttr []
block b = b
This leads to an unexpected error instead of the expected result [ Div ( "" , [] , [] ) [] ]
;
$ pandoc --filter myfilter.hs -f markdown -t native
foo
(hit ctrl-d)
Error running filter myfilter.hs:
Filter returned error status -11
This filter worked as expected before, though I'm not sure until which version. And I could get the expected result now if I redefined nullAttr
as the same definition as the original ("",[],[])
, hiding the imported definition.
Does somebody has any clue regarding this error?
-- Sorry in advance if this was to be issued in pandoc itself instead of here.
Can Emph
be made take a custom Attr
?
I would like this capability in order to define FontAwesome icons when using Semantic UI which requires the HTML tag to be <i>
(rather than anything else, say <span>
).
In other words, can the following be generated from a Pandoc AST?
<i class="icons tag" />
There is an issue compiling pandoc-theorem
against pandoc-types-1.23
:
pandoc-theorem/app/Main.hs:8:8: error:
• No instance for (Text.Pandoc.JSON.ToJSONFilter
IO
(Text.Pandoc.Definition.Block -> [Text.Pandoc.Definition.Block]))
arising from a use of ‘toJSONFilter’
(maybe you haven't applied a function to enough arguments?)
• In the expression: toJSONFilter $ toList . convertBlock
In an equation for ‘main’:
main = toJSONFilter $ toList . convertBlock
|
8 | main = toJSONFilter $ toList . convertBlock
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The relevant source can be found here.
pandoc-theorem
relies on an instance ToJSONFilter IO (Block -> [Block]
which has been removed in 183af9d
Should this issue be fixed on pandoc-theorem
's side or could the instance be re-introduced?
See the main todo list and the relevant issue. I would like to start implementing better table handling in Pandoc. Specifically, I would implement all but the last of these bullet points using one of the designs below (or a modified version of one of them).
I think something like this recently outlined approach is a good way forward for now. The representation is a little loose (any table in the intermediate representation is valid, so there are multiple ways to write a given table, but only one normalized way), but it should allow the readers and writers to be switched more easily. This is slightly modified version of that approach:
type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data CellType = DataCell | HeaderCell
data Cell = Cell Attr CellType (Maybe Aligment) RowSpan ColSpan [Block]
type Row = [Cell]
data Block =
...
| Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row]
...
The Maybe Alignment
on the individual cells allows the cells to override the alignment of the column(s) in which they reside. This makes it easier to specify one's intentions when a cell spans multiple columns with conflicting alignments, and has the advantage of allowing better \multicolumn
and \multirow
support in the LaTeX reader and writer. It also comes up naturally when one thinks of possible extensions to the supported markdown table formats.
A similar design has the following modifications:
data Cell = Cell Attr (Maybe Alignment) RowSpan ColSpan [Block]
data HeaderRow = Row Attr [Cell]
data BodyRow = Row Attr [Cell] [Cell]
data Block =
...
| Table Attr Caption ShortCaption [(Alignment, ColWidth)] [HeaderRow] [BodyRow] [HeaderRow]
...
This has the advantage of making explicit the table head/body/foot and row head/body structure that seems to be assumed in the first approach, where the first entirely header rows become the table head, and the last such rows become the table foot. Cells in the head and foot sections would correspond to th
cells, and cells in body section would correspond to td
cells. It does not require a CellType
, but one could still be added, making these even more similar to HTML tables. This approach has the disadvantage of making the table representation more complex.
I assume that the tables are normalized (laid on a grid with a given width so that overlapping cells and empty spaces can be dealt with in the table) like so, informally:
ColSpan
) would be lowered to fit. If it would extend past the bottom of the grid, its height (RowSpan
) would be lowered to fit.The table head, table foot, row head (the list of row head sections without the row body), and row body (the list of row body sections without the row head) should be normalized independently in any design where these exist (implicitly in the first, or explicitly in the second). The overall table width would be the length of the [(Alignment, ColWidth)]
list, and the row head/body width would add to that width. (The row head width would be the width of the first row in the row head).
Issue 332 in pandoc jgm/pandoc#332
requires an additional data field that goes with an Image.
Allowed values would be integers between 0 and 100, or maybe 200.
If empty, 100 is default.
It would be nice to be able to walk an AST and perform a replacement or query on all attributes. (We need to do that, e.g., in the EPUB writer.)
The instances wouldn't be hard to write, but probably this would require putting Attributes in a newtype or data type.
Discussion jgm/pandoc#463 (comment)
Instead of doing e.g.
data Block = ...Block...
what if we did
data BlockF block = ...block...
deriving Functor
newtype Block = MkBlock (BlockF Block)
This trick allows ASTs to be extensible, among other benefits.
I generally like this approach, but I had a specific problem in mind that would benefit which I would like to share. https://github.com/obsidiansystems/dombuilder-pandoc/blob/master/src/Reflex/Dom/Builder/Pandoc.hs is some code to translate the Pandoc AST to reflex-dom "dom builder action" order to use within a website built with reflex dom.
I would like to have my own custom handling of pandoc the pandoc AST --- e.g. parsing relative URLs in links into a Route AST --- without having to copy and paste that function. But if we do the above "functored AST" approach, I can replace
block :: DomBuilder t m => Block -> m ()
with
blockF :: DomBuilder t m => BlockF a -> m ()
or
blockF :: DomBuilder t m => BlockF (m ()) -> m ()
These can be thought of as specialized versions of traverse_
or sequence_
, and they are very nice to work with!
i stumbled upon the error within a table and than i found out that it applies also to bullet lists, which contain lists of lists of blocks like the tables.
i was not able to extend the property tests in order to reproduce this error. as far as i can see, arbitrary block lists are correctly generated so this case should appear sooner or later. an hypothesis is that everywhere
, used to double check the correctness of transformations, features the same error ... but it seems unlikely.
in any case i could pinpoint the error with some specific tests, and any suggestion about how to improve them is welcome.
NFData
seems removed with the new version, and I've tried to use the one from deepseq 1.4.1.1, but got the following error:
[1 of 5] Compiling Text.Pandoc.Generic ( Text/Pandoc/Generic.hs, dist/build/Text/Pandoc/Generic.o )
[2 of 5] Compiling Text.Pandoc.Definition ( Text/Pandoc/Definition.hs, dist/build/Text/Pandoc/Definition.o )
Text/Pandoc/Definition.hs:70:23: parse error on input `|'
I'm still very new in Haskell. Thank you and it would be really appreciated to support the new deepseq-generics :)
We have a Block constructor Null. I think this was added before we were using Builder in the readers; I wonder if there's any reason for it to exist. Note that there's nothing like this for Inline.
1f4e239 changed the license to BSD however the source headers still say GPL:
./Text/Pandoc/JSON.hs:23: License : GNU GPL, version 2 or above
./Text/Pandoc/Definition.hs:24: License : GNU GPL, version 2 or above
./Text/Pandoc/Builder.hs:25: License : GNU GPL, version 2 or above
./Text/Pandoc/Walk.hs:29: License : GNU GPL, version 2 or above
./Text/Pandoc/Generic.hs:22: License : GNU GPL, version 2 or above
Could you please update them?
I'm writing a Pandoc filter. Within my filter, I want to run Pandoc on another file to extract some information. I didn't find the right way to do this in the Pandoc manual or the Haddoc documentation, so I made my best guess. Unfortunately, the metadata isn't parsed correctly.
Here's a file (test.md
) I want to parse from within my code:
---
title: This is a test
...
Hello world!
Here's a tiny program demonstrating the issue:
import Text.Pandoc.Class (PandocIO, readFileStrict, runIOorExplode)
import Text.Pandoc.JSON (Pandoc)
import Text.Pandoc.Readers (readMarkdown)
import Text.Pandoc.Options (def)
import Text.Pandoc.UTF8 (toText)
main :: IO ()
main = do
doc <- runIOorExplode $ readTitle "test.md"
putStrLn . show $ doc
readTitle :: FilePath -> PandocIO Pandoc
readTitle f = do
s <- readFileStrict f
let t = toText s
md <- readMarkdown def t
return md
When I run the program, I get:
Pandoc (Meta {unMeta = fromList []}) [HorizontalRule,Para [Str "title:",Space,Str "This",Space,Str "is",Space,Str "a",Space,Str "test",SoftBreak,Str "..."],Para [Str "Hello",Space,Str "world!"]]
But if I parse it from the command line, Pandoc correctly picks up the metadata.
$ pandoc -s -t native test.md
Pandoc (Meta {unMeta = fromList [("title",MetaInlines [Str "This",Space,Str "is",Space,Str "a",Space,Str "test"])]})
[Para [Str "Hello",Space,Str "world!"]]
Hi everyone !
Until recently I worked under the assumption that a pandoc-types datatype with a single constructor (say Format
) had its type erased from the JSON representation : instead of {"t": type, "c": content}
, the representation was simply content
.
I think that (maybe with the exception of Meta
?) this assumption was valid until the recent changes to the document model. Now, AFAICT some types with a single constructor have their types erased and some don't. I thought for a moment that the difference was that some where declared with newtype
keyword (type erasure) and some with data
keyword (no type erasure) which would make sense (if I understand correctly the difference between the two keywords in Haskell) but this second hypothesis doesn't hold either.
Could anyone explain me if there is a simple rule based on the definition of pandoc types that says if the type of the data will be erased in JSON representation ?
AFAICT, Format
data (newtype
) has its type erased in JSON, but RowSpan
data (newtype
) has its type serialized. Cell
data (data
) also have their types serialized. Unfortunately, I don't know enough of Haskell to pinpoint what parts of the code explain the difference between these cases ...
The context: I have developped a Python library (https://github.com/boisgera/pandoc) that reads the pandoc-types data models (for as many versions of pandoc as possible) to reproduce automatically the equivalent hierarchy of classes in Python, so that json data can be exchanged with the available pandoc executable to work with a pandoc document representation in Python. The target being the people (first and foremost : me 😉 ) that need to analyze and transform a document with a nice AST and are fluent in Python but not so much in Haskell (or in Lua). To continue to do that, I need to be able to infer automatically from the output of :browse Text.Pandoc.Definition
in ghci
the JSON serialization rule for each data type. This is why a simple and mechanical rule would help !
Cheers,
SB
pandoc-types/Text/Pandoc/Definition.hs
Line 196 in dd50a6e
I'm not sure what your policy is on breaking changes, but when/if you next plan to do a breaking release, I think we should consider changing this to type Attr = (Text, [Text], M.Map Text Text)
.
(this issue is copied from jgm/pandoc#9480)
since we have ListAttributes for OrderedList:
OrderedList ListAttributes [[Block]]
type ListAttributes = (Int, ListNumberStyle, ListNumberDelim)
so why not add ListAttributes for BulletList? the api of BulletList is just barely like:
BulletList [[Block]]
it will be great to add the attribute ListNumberStyle
so the <ul>
tag can be set to <ul style="list-style-type: circle;">
in my project, I want to distinguish between -
list and *
list, so I can set different style on them, but I can't do that now.
There is a release 1.20 on GitHub, but running git tag
does not include the 1.20
tag; it seems to be missing in git. The commit behind release 1.20 does not appear to be part of any branch; trying to checkout the commit id fails.
I have a complex html document that I've read into pandoc, and I'm trying to write filters that will isolate the content I'm after. Some examples of this are dropping certain Divs entirely, or replacing Tables by just the content of their rows.
I can write filters of the type Pandoc -> Pandoc
which is workable for changing the top-level structure of a document, but would become very tedious when Block
s are nested. I could also write my functions to return Null :: Block
when removing Block
s, but that doesn't feel like the right way to do it. Or is that perhaps precisely why Null
is there in the first place?
I'd like to be able to write functions of type [Block] -> [Block]
to use in filters, but I get the error
No instance for (Text.Pandoc.Walk.Walkable [Block] Pandoc).
I tried to think about how to write that instance, but it's hard to combine walking lists of Blocks with applying the function to them.
So I feel I'm either missing something obvious, or going about it in the wrong way. Is there perhaps a simple solution to what I want to do? (Sorry if this is a stupid or naive question)
The introduction of Attributes in Images and Links in version 1.16 should be documented in Changelog.
It would be nice if we can also use a StateT
monad transformer when walking through the AST and thus keep track of the section, and update the state based on content that we see when "walking" over the tree.
Hello,
I would like to inquire about the possibility to increase backward- and forward-compatibility in pandoc-types. For example, in pandoc-types 1.21
, changes to the AST has broken the way a few filters work. The new Table
block element is not crucial for many filters (e.g. pandoc-plot or pandoc-include-code), so the breakage is artificial.
I propose that the next time the AST is changed, a new block element, Unknown
, be added. Then, the function toJSONFilter
could be changed so that instead of throwing an error on incompatible pandoc-types
version, the Unknown
block element could be used as a placeholder for AST blocks that are not decoded appropriately.
This would allow for filters to be compatible with a wider range of pandoc-types
and pandoc
versions.
Let me know what you think.
Today, I tried to build pandoc
via cabal on Docker Hub (which worked before, a few weeks ago), but now I only get an error. I've isolated the problem to pandoc-types
, but unfortunately I can't make any sense out of the error messages. The Docker container builds correctly on my devices.
This is the output of the latest build: https://hub.docker.com/r/thriqon/full-pandoc/builds/be4dayresfnxycbghqowa65/ (near the bottom is the debug output produced with -v3
).
The Dockerfile is this one: https://hub.docker.com/r/thriqon/full-pandoc/~/dockerfile/.
Many thanks in advance for any help!
With such restrictments, it's not possible to use this version. Is this on purpose? I don't see any mentions that this version is deprecated in the changelog as well.
hello!
its my semi annual "get pandoc building with the new GHC RC" extravaganza
looks like its just as simple as adding DeriveTraversable to the set of language pragmas for Text.Pandoc.Builder (assuming a suitablely patched version of regex pcre builtins)
still have 1-2 issues i need to sort out for pandoc proper, will report those suitably too
It is possible to convince GHC that a collection of patterns is complete, say by writing {-# COMPLETE Format :: D.Format #-}
in Legacy.Definition
. This would make it unnecessary to silence the incomplete pattern match warnings. I did not know that when writing the modules, and by the time I got to pandoc
it was easier to silence the warnings while working rather than add the pragmas. I can write a request that adds them, if that would be welcome.
Just now I tested them on the pandoc
commit right after I switched it to Legacy.*
and they did suppress the warnings, but I got other warnings like
Pattern match checker exceeded (2000000) iterations in
an equation for ‘fixBlocks’. (Use -fmax-pmcheck-iterations=n
to set the maximun number of iterations to n)
|
846 | let fixBlocks (b : CodeBlock attr x : rest)
so it may not be a perfect solution.
Walk is a bit of a mess. It seems we should be able to do something more elegant, using recursion schemes or something.
For motivation see jgm/pandoc#7130.
Here we have walk fixLinks
where fixLinks
is [Inline] -> [Inline]
.
It works fine if applied to [Inline]
. However, it behaves differently if you apply it to Inlines
. Seems like instead of having instances specifically for lists, we should have general instances that work for all Traversable/Foldable structures, including Many.
See https://github.com/reflex-frp/reflex/blob/develop/.github/workflows/haskell.yml does for example, so the dependencies are not built every time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.