jhellingman / tei2html Goto Github PK

XSLT stylesheets to convert TEI to HTML and ePub format.

License: GNU General Public License v3.0

Perl 15.59% XSLT 33.25% CSS 1.94% HTML 32.20% Max 12.85% C 0.83% Makefile 0.01% Batchfile 0.01% TeX 0.01% TSQL 0.31% Prolog 2.77% Raku 0.23%

xslt tei gutenberg epub html

tei2html's People

Contributors

Stargazers

Watchers

Forkers

plutext lukasdelascasas karin66 nkconnor gradyk dhd-verband editio arojascastro enercom25

tei2html's Issues

Handle @rendition element with multiple references (separated by spaces)

tei2html/css.xsl

Line 455 in a34ac7a

    
           <xsl:if test="not($node/ancestor::node()[last()]//tagsDecl/rendition[@id = $renditionId or @xml:id = $renditionId])">

Move generic copy template to preprocess.xsl

tei2html/normalize-table.xsl

Line 77 in 73af97e

<xsl:template match="@*|node()">

Cover image not always recognized in generated ePub

Open the ePub in various ePub readers. Depending on the device, the cover
image is not neatly rendered or not recognized at all.

Try to find out reasonable defaults:

http://blog.threepress.org/2009/11/20/best-practices-in-epub-cover-images/

Note: the Mobi format seems to use and extention:

<x-metadata><Cover>images\cover.jpg</Cover></x-metadata>

Original issue reported on code.google.com by jhellingman on 25 Apr 2010 at 9:50

Multiple section heads concatenated without spaces

What steps will reproduce the problem?
1. Create a TEI document with a chapter with two head elements (two-line
heading)
2. Generate an ePub from it.
3. Look at the table of contents in the left-hand pane (using the Calibre
viewer)

What is the expected output? What do you see instead?
Multiline headers are separated; now they are joined together.

Original issue reported on code.google.com by jhellingman on 7 Dec 2009 at 7:48

All IDs are generated

Currently, all IDs used for internal cross references in the output HTML
are generated, and thus are meaningless alphanumeric strings, whereas ids
in the source TEI are often manually added and have sensible meanings. It
would help to retain these IDs in the HTML output as well, and only
generate IDs when no ID is present in the TEI. This would also help to make
the HTML more stable, as now, every re-run of the stylesheet may generate
different IDs.

Original issue reported on code.google.com by jhellingman on 1 Nov 2006 at 1:32

ePub support

We need to support ePub, for easier deployment of ebooks to eReaders.

Since ePub is basically XHTML with a subset of CSS 2.0, we need to do the
following:

* Generate valid XHTML
* Replace unsupported CSS features with supported constructs (Where this
leads to loss in functionality in the HTML version, this should depend on
an ePub switch.)
* Generate the required metadata files for ePub.
* Package the whole into a ePub-compliant zip archive.

Original issue reported on code.google.com by jhellingman on 12 Nov 2009 at 2:14

Introduce page-lists in generated NCX file.

A proposed new feature to replace the non-standard Adobe page-map.

See
http://www.epubbooks.com/blog/20081209/marking-up-page-numbers-in-the-epub-ncx/

Original issue reported on code.google.com by jhellingman on 15 Jan 2010 at 7:57

Add support for running headers and footers.

Add support for running headers and footers.

See http://wiki.mobileread.com/wiki/EBook_Publisher for some implementation
hints, but realize that <header> and <footer> are not valid HTML tags.

Do something like:

in CSS:

 .pagehead {display:none; display:oeb-page-head}
 .pagefoot {display:none; display:oeb-page-foot}

And in HTML at opportune locations:

<div class="pagehead>Text of running header</div>

Original issue reported on code.google.com by jhellingman on 22 Apr 2010 at 7:30

Integrate table-normalization xslt into tables.xsl

Currently, the table-normalization step is done in a separate transformation 
before the main transformation. This requires additional glue code in perl to 
run this transformation. To simplify this, this should be done at the same time 
the tables are formatted in the tables.xsl.

Currently, a simple integration of the table-normalization into the main 
stylesheet leads to a number of unexpected results, either due to other 
(unrelated) templates matching, or some other overseen complication, which 
makes the trivial integration step (using a temporary node-tree in a variable) 
incorrect so far.

Original issue reported on code.google.com by jhellingman on 10 Sep 2014 at 5:29

Add support for PGTEI

Tei2html does not support PGTEI, as used by Project Gutenberg volunteers.

Minimal support will be needed for the following:

Understand unnumbered <div> elements. (DONE)
Understand plain CSS in @rend attributes. (should be a flag, consider pre-processing and use of @style attribute; @style support DONE.)
Understand <pgExtensions> elements.
Handle <q> elements by inserting quotation marks. (should be a flag; a big issue here is that <q> elements are often used as wrappers for elements that do not directly fit the structure of a TEI document, as to make them valid. We somehow need to distinguish those uses from the intended usage of text between quotation marks.)
Handle the various <divGen> types. (Partly DONE)

Original issue reported on code.google.com by jhellingman on 24 Mar 2011 at 10:50

Link back to TOC should point at relevant entry

Currently, the generated HTML contains a link back to the TOC. This points to 
the top-level of the TOC. It should be neater if that link links back to the 
relevant entry in the TOC when possible.

Two cases:

<divN id=toc>:

- For encoded pre-existing tocs, we need to look at the element with a target 
to the current chapter in the toc, and link to that. (Assuming we have one, if 
we have more, linking to the first will do.)

<divGen type=toc>:

- For generated tocs, we need to generate the links back to the toc, using 
generated ids that we will know of.

Third case: no toc: no link back will be generated:

Original issue reported on code.google.com by jhellingman on 24 Sep 2012 at 2:52

Errors by epubCheck


1. value of attribute "http-equiv" is invalid; must be a string matching the 
regular expression 
"([Dd][Ee][Ff][Aa][Uu][Ll][Tt]\-[Ss][Tt][Yy][Ll][Ee])|([Rr][Ee][Ff][Rr][Ee][Ss][
Hh])"

See: https://code.google.com/p/epubcheck/issues/detail?id=135

Use <meta charset="utf-8" /> instead.

2. Obsolete or irregular DOCTYPE statement. External DTD entities are not 
allowed. Use '<!DOCTYPE html>' instead.

Tricky thing with HTML5 doctype (some post-processing)?

3. attribute "summary" not allowed here; expected attribute "accesskey", "... 
or "xml:space" (with xmlns:ns1="http://www.idpf.org/2007/ops" 
xmlns:ns2="http://www.w3.org/2001/10/synthesis")

Remove the summary attribute from tables.

Original issue reported on code.google.com by jhellingman on 6 May 2014 at 2:19

Generate section with external references

Documents contain external references as hyperlinks.

In some formats those hyperlinks are life and can be followed, in other (print) 
they are not. In those cases it should be possible to collect them in a 
dedicated section generated by

 <divGen type="ExternalReferences"/>

Which includes all external references

* Each reference only occurs once in the list.
* Each reference links back to the source(s) (as link or by page number, 
depending on the output format.

Original issue reported on code.google.com by jhellingman on 27 Aug 2010 at 9:26

Incorrect in extract with Korea characters

What steps will reproduce the problem?
1. I don't know exactly how explain this, so please see the attachment file
2.
3.

What is the expected output? What do you see instead?
Expected: correct characters

but only half of it, I'm not family with perl

What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 13 Oct 2008 at 7:01

Attachments:

Deployment is complex

Currently, deploying this software is rather complex, requiring

* Download of various tools.
* Adjusting lines of code in various tools to point to the actual tool
locations.
* Heavy use of command-line interface.

Not all of this is avoidable:

* Not all tools can be bundled for because of licensing
* Configuration needs to be done.

However, I can prepare compiled executables, and an archive of all required
files in the right locations to make life easier for those wishing to use
these XSLT files.

Original issue reported on code.google.com by jhellingman on 12 Nov 2009 at 2:07

Improve accessibility of generated HTML output

Add semantic annotations to elements.

See details at the IDPF

Original issue reported on code.google.com by jhellingman on 14 May 2014 at 9:50

Need messages.xml to .po convertion script

Write a short XSLT stylesheet that accepts the messages.xml file, and dumps
a .po file, and a Perl script that can reverse that process.

Put those .po files in a special directory

This will enable the use of transifex.net for translations.

Original issue reported on code.google.com by jhellingman on 19 Apr 2010 at 11:16

Mark corrections in footnotes as such in list of corrections

Mark corrections in footnotes as such in list of corrections.

That is, say "page 23 (footnote)" or similar if the correction appears in a 
footnote on that page.

Original issue reported on code.google.com by jhellingman on 14 Feb 2011 at 2:56

TEI P5: Add support for <graphic> element.

The element replaces ad-hoc methods of including references to external images in P3.

Support element.
Support nested elements. (now non-standard element used as temporary hack)
Make sure elements for which the related image is not available can be suppressed in generated output.
Make sure legacy P3 files continue to be processed correctly (either directly or using a pre-processing step).

Including high-resolution images for high-resolution output devices

The current code is developed for producing web-based editions with 
low-resolution images. When we also want to produce version suitable for 
printing, we need ways to also specify alternative, high resolution images that 
can be printed, preferably without changes to the master files themselves.

The idea is to pull the images from an alternative path containing the high 
resolution versions.

The idea is to have a hires/ folder next to images/, which should contain the 
high resolution versions of the low resolution images. In images.xml, we 
collect all information about both, and verify this is the case.

Now, when generating a PDF for print, we use the hires instead of images folder 
as source.

Original issue reported on code.google.com by jhellingman on 27 Aug 2010 at 9:38

Can't locate SgmlSupport.pm

$ git clone https://github.com/jhellingman/tei2html
$ cd tei2html/tools
$ perl tei2html.pl
Can't locate SgmlSupport.pm in @INC (you may need to install the SgmlSupport module) (@INC contains: /usr/local/lib64/perl5/5.30 /usr/local/share/perl5/5.30 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at tei2html.pl line 14.
BEGIN failed--compilation aborted at tei2html.pl line 14.

Why isn't Perl loading the file from the current directory?

Validation issues on FlightDeck

These files use language codes that are not supported.

Validation:

MythsOfTheCherokee-pl-05.xhtml contains: la-x-bio
The xml:lang and lang attributes must use language codes that conform to 
[RFC5646]. The official list of language codes can be found here.

Best Practices:

Much of the Table of Contents is written with all capital letters. We recommend 
using mixed case.

Text color should not be set to black using CSS, because the text will be 
unreadable in Night Reading Mode on some devices. This EPUB has the following 
CSS rules which set text to black:

MythsOfTheCherokee.css – .transcribernote
MythsOfTheCherokee.css – .advertisment
MythsOfTheCherokee.css – body, a.hidden


This EPUB contains content files over 300 KB in size. While there is no formal 
standard regarding individual file sizes, larger file sizes can slow down the 
loading and response time on older eBook devices. Also, some major devices in 
the marketplace use the Adobe RMSDK, and that display engine has a hard limit 
of 300 KB on all content files. Please consider breaking these files into 
multiple parts to keep them under this limit.

See this file: MythsOfTheCherokee-ch2.xhtml
See this file: MythsOfTheCherokee-ch5.xhtml
See this file: MythsOfTheCherokee-ch6.xhtml
See this file: MythsOfTheCherokee-ix.xhtml


This EPUB does not define the beginning of its main content. Some reading 
systems use this setting when a reader opens your book for the first time or 
wants to navigate to the beginning of the book.

Please see our Handbook for an example showing how to define the start location.

http://ebookflightdeck.com/handbook/startlocation



We recommend adding a landmark navigation section to your nav document. Some 
retailers recommend specifying the location of the cover image, table of 
contents, and the start of the text in the landmarks.

Please see our Handbook entry for a template landmarks section, and the 
Retailer Grid for more information about retailer requirements.

http://ebookflightdeck.com/handbook/landmarks

Original issue reported on code.google.com by jhellingman on 13 Jun 2014 at 11:40

Make handling of rend-attributes consistent

Currently the rend attributes are used in a somewhat hap-hazard way. The
idea is to make them map to CSS 2.0 features in a consistent way.

Basically, we can recognize three types of values in the rend attributes

1. Shortcut values, for example rend="sc" to make the content of an element
small caps. This typically works on selected elements only.
2. tei2html specific values, for example rend="image(example.gif)" to
achieve certain rendering effects during the transformation from TEI to HTML.
3. CSS pass-through values, for example rend="background-color(red)", which
is translated to a CSS style-sheet rule background-color: red.

Currently, the rend attributes are not handled for all elements. (Although
CSS style-sheet rules are produced for them.)

Original issue reported on code.google.com by jhellingman on 21 Jan 2010 at 2:05

Add support for TEI P5

The tei2html stylesheets have been developed with the now very old TEI P4 version of TEI. The stylesheets should also work with TEI P5 and later documents. For this a few changes need to made:

Handle TEI top-level element (DONE)
Handle new style of REF elements, for both internal and external links (DONE)
Handle CHOICE elements for corrections, etc. (also list of errata needs to be adjusted). (DONE)
Handle removal of DIV0 element (will work by default, but may need some tweaking).
Correctly handle metadata (when dealing with TEI namespace). (DONE)
Correctly handle TEI documents that use the TEI namespace. (DONE)
Correctly handle external references in table of external references.
Verify usage of xml:id and xml:lang attributes is correct.

TEI P5: Support nested TEI documents

For multivolume works, it sometimes is needed to nest the top-level element. Support this in a reasonable way.

Use main TEI header for over-all output metadata.
Mostly ignore subsequent TEI headers, but generate a clear demarcation in the output.
Make sure unique ID's are generated for various items that used to be unique in a single document are generated uniquely for this case.

Footnotes outside div1 handled inconsistently

Footnotes are typically placed at the end of the div1 element they appear
in. This works fine, except when footnotes appear outside a div1 element,
in which case they have nowhere to go.

For this a number of solutions are possible.

1. Place at end of div1 (this may place some footnotes wildly out-of-order)
2. Place at end of chunk of div0 before its first div1.
3. Insert using explicit <divGen type=footnotes> element.

Original issue reported on code.google.com by jhellingman on 7 Dec 2009 at 7:46

Page numbers are not correctly placed in margin when they appear in verse

What steps will reproduce the problem?
1. Create a TEI file in which an <lg> element appears.
2. Create a <pb n="123"> tag somewhere in the <lg> element.
3. Render with the tei2html stylesheet.

What is the expected output? What do you see instead?

Expected: The pagenumber is rendered in the right margin, aligned as all
other line numbers.

Actual: The pagenumber is not fully in the right margin.

Original issue reported on code.google.com by jhellingman on 19 Oct 2006 at 10:07

Inclusion of images is complex


The inclusion of images is currently complex:

* image file names for images are derived from various types of information
** id attribute
** url attribute
** rend attribute
* File metadata is collected in, and retrieved from an imageinfo.xml file
generated on the fly.
* 'Standard' mechanisms are not used.
* Different images for use in screen and print output are not supported.
* XSLT code concerned is complex.

Need to investigate proper way to include illustrations in generated HTML,
and refactor the related code.

Original issue reported on code.google.com by jhellingman on 12 Nov 2009 at 2:11

Add support for ePub 3.1

ePub 3.1 is out since January 2017. For details see http://www.idpf.org/epub/31/spec/epub-spec.html, and in particular http://www.idpf.org/epub/31/spec/epub-changes.html.

For now, intent is to put this behind a setting.

Cleanup CSS for ePub generation

The current CSS stylesheets used contain various constructs that are not 
consistently supported in most ePub readers, to wit:

* margin: auto
* position: relative, absolute
* background-image: <url>
* float: <position>

These need to be replaced by constructs that do work even in simple ePub 
devices.

Original issue reported on code.google.com by jhellingman on 5 Jul 2010 at 12:11

Move setting of id to img tag, so SVG will be rendered with correct size.

tei2html/formulas.xsl

Line 85 in 474e1b1

    
           <xsl:copy-of select="f:set-class-attribute-with(., concat(f:formulaPosition(.), 'Math'))"/>

Title Page not rendered nicely in ADE

What steps will reproduce the problem?
1. Open any generated ePub in Adobe Digital Editions.
2. Look at the title page, it doesn't look nice, and is sometimes even
completely white.

What is the expected output? What do you see instead?

A neat looking titlepage, also recognizable as thumbnail.


The idea here is to do the following.

1. Some rules to recognize content that can be used for creating a neat
titlepage (for example, a specific illustration.)
2. Some standard title-page templates that can be used to create a
title-page from the metadata, more complex than the current simple title-page.
3. Place the title-page in a file of itself. (taking care of references
leading to it still work.)

Original issue reported on code.google.com by jhellingman on 10 Jan 2010 at 12:27

Configuration files for tei2html

Currently there are little options to change the generated output, other than 
setting parameters on the input, or changing the actual code or document.

It would be nice to activate or deactivate various features using a 
configuration file, e.g. tei2html.config, with the ability to set the following:

* Include images (Y/N/All/Important)
* Image path (<path>)
* Include external references (Y/N)
* Footnote location (Page/Chapter/Work)
* Generate colophon (Y/N)
* Generate table of contents (Front/Back/None)
* Additional CSS stylesheets (<name>)
* CSS stylesheet location (Internal/External)
* Generate marginal page-numbers (Y/N)
* Generate links to page-images (Y/N)

Things that can also be handled via CSS:

* Default table alignment (Left/Right/Center)
* Default verse alignment (Left/Right/Center)

Original issue reported on code.google.com by jhellingman on 1 Feb 2011 at 1:50

Output HTML is invalid when TEI contains nested paragraphs.

What steps will reproduce the problem?
1. Create a TEI file with a footnote in a paragraph with embedded text
document with verse  (that is <q><text><body><div1>....)
2. Render using tei2html
3. Validate in HTML validator

What is the expected output? What do you see instead?

Expected: Valid HTML.

Actual: Invalid HTML (additional </p> tags.)

Since the paragraph model of HTML and TEI do not match, the current code
closes HTML paragraphs at certain points where the stylesheet needs to
place an element (such as a table) in the HTML file. To do this, it checks
whether we are in a <p> in TEI. Sometimes, in TEI, however, we can be
inside a <p> twice (for example in <note>s). This is not handled correctly.

Original issue reported on code.google.com by jhellingman on 20 Oct 2006 at 7:12

heads with @type='super' are included in the ncx file

heads with @type='super' are included in the ncx file. They should be skipped.

Original issue reported on code.google.com by jhellingman on 10 Dec 2009 at 1:17

Handling of VIAF metadata

Add relevant attributes to documents, and process these to produce texts with consistent metadata.

See: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.canonical.html

Then add things like:

<name ref="https://viaf.org/viaf/109557338/" type="person">Seamus Heaney</name>

and

<author> <name key="Hugo, Victor (1802-1885)" ref="https://www.idref.fr/026927608/">Victor Hugo</name> </author>

OPF file contains entries for divisions hidden by rend="display(none)"

OPF file contains entries for divisions hidden by rend="display(none)"

Original issue reported on code.google.com by jhellingman on 18 Jan 2011 at 11:23