joseph / peregrin Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 17.0 12.16 MB

A library for inspecting Zhooks, Ochooks and EPUBs, and converting between them.

Home Page: http://ochook.org/peregrin

License: MIT License

Ruby 99.97% Python 0.03%

peregrin's People

Contributors

Stargazers

Watchers

Forkers

clee klacointe nono julien-c banux papernpen feedbooks omniplat rob-mcgrail f1nnix simonasdev

peregrin's Issues

Add unmanifested files to media array [EPUB]

Currently only files that appear in the OPF manifest are added to the array of media files for the book. This is correct according to the EPUB spec, but too strict in effect. Better to make any other files found in the archive available than to get 404s.

Take <html lang="xx"> to detect language of a zhook/orchook

I thought initially, that there was little point in zhooks requiring (or at least peregrin expecting) a meta tag to declare the language of a zhook, when <html lang="xx"> does that just fine and is actually becoming necessary anyway (for example, for proper hyphenation).

I was about to ask then to prioritize <html lang="xx"> over <meta>.

Now, though, after reading the following question: Should I declare the language of my XHTML document using a language attribute, the Content-Language HTTP header, or a meta element? at W3's internationalization's website, and seeing that there is apparently a point in choosing meta over lang or viceversa, I do not know any more. Does WHATWG have a specific guideline tied to HTML5 in that respect?. Anyhow, I thought about throwing the question.

Some things to also take into account, maybe, is that EPUB allows for multiple language declarations. With <html lang="xx"> you can only declare one single primary language, and thus does not let you declare several languages in equal measure. The OPF spec is not as explicit about what the language declaration actually means, but I would assume that it denotes what the target audience of the book is, hence, arguably, a meaning closer to what the meta tag seems to be for, more so than lang. Yet, while one could argue that it is advisable to always have a lang property in the <html> tag, the HTML5 spec's explanation of how the language of a node is to be determined seems to suggest that meta is only needed when there is more than one primary language (bilingual books?) or when, for whatever reason, languages in meta and lang should be different (?).

At any rate, peregrin now seems to be ignoring even the standard way of using meta to declare content language: <meta http-equiv="Content-Language" content="de, fr, it">. That is what everybody is using, and what the HTML5 specification mandates regarding content language.

PS: the point of xml:lang next to, or instead of plain lang, beats me. I cannot decipher the hieroglyphs in the HTML5 spec regarding that. No need help me with that on this thread though, unless it has anything to do with something that peregrin should be taking into account.

Converting to EPUB: preserve original unicode instead of converting to HTML entity

When converting to epub from an UTF-8 zhook, non-ASCII characters are converted to what I guess are hexadecimal HTML entities, like "Título" becomes Título.

According to Joseph, it is apparently a Nokogiri issue, and may be transparently fixed with new versions.

Nokogiri is the tool that does the parsing in Peregrin — it's using libxml2 behind the scenes to transform the document.

These two links might be referring to the same issue (and offer a workaround?):

Comments in the HTML break the componentizer

It stops componentizing if it sees a sibling of an article that is not an article — looks like it's currently incorrectly treating a comment as a sibling.

Wrong paths in .ncx after epub.write

After changing some components.src and write to epub .opf is correct, but .ncx path are not changed. Check this:

book = Peregrin::Book.new
book = Peregrin::Epub.read('my_validate_book.epub').to_book

book.components.each do | component |
    component.src = "new_" + component.src
end

epub = Peregrin::Epub.new book
epub.write 'new_not_proper_epub.epub'

Problem is in lib/formats/epub.rb build_ncx method. xml.content(:src => chapter.src) should rather write proper content.src. Each chapter don't have to be in seperate file.

Ochook validation vs easy converting

Before converting from one format to another, Peregrin validates the original ebook. In the case of Ochooks, it checks for an ochook.manifest file and a reference to this in the tag of index.html.

But when designing an ebook, it's helpful to test the conversion to EPUB with a single command, without having to zip up the directory into a Zhook first, and without having to manually maintain a manifest file.

Media files are dumped with wrong path

When converting an epub into a chook, like so:

peregrin book.epub book/

The images end up in book/OEBPS/images, and the CSS is in book/OEBPS/ instead of book/ which makes the links inside of the index.html not work.

URL fragment in src

I have an epub where I end up with a chapter with

src="9781416595267_col01.html#col01"

Now, when running read_resource on this chapter, peregrin blows up. If I remove '#col01' from src, peregrin succeeds.

I could patch peregrin to strip out fragments. But is there anything more intelligent that we should be doing, to maintain the semantics of the link within the epub?

Generated components should have guide data attached [Zhook]

The intermediate Book object doesn't have much meta info about components, which is useful when going from Zhook -> EPUB.

Zhook will create the Cover and Table of Contents components automatically, but there's no way to communicate to the destination format that these components are non-linear and have a particular "guide" type.

If we can say what the "guide" attribute of a component is, EPUB will be able to derive the linearity of the component (in the spine) from that. Ie, if cover.html is 'cover' in the EPUB guide, then it has a linear value of "no" in the spine.

In general, for both components and metadata, the intermediate book object should perhaps work with Component and Metadata objects, rather than arrays of filenames or hashes of key/values.

Add command line option to export as JSON

Internal hrefs should be resolved after componentization

If a single file is split up into many HTML files by the componentizer, any internal URLs (such as "#part6") will no longer work. When saving the components, we should comb through them for "a[href=/^(index.html)?#/]" and find the component to which the href is referring, prepending this to the href.

We don't have to worry about this for other hrefs, like img#src or link#href, since the things they point to will already work. Unless we're worried about JavaScript (and I'm not), a#href is the only attribute I think we need to update.

Duplicate componentization

For this particular snippet below, peregrin is doing some weird things with componentization when converting to EPUB. It adds to the first component (i.e., after the cover, index.html) parts for which it will also create a separate component, hence duplicating content once all components are put together.

<hgroup class="title-page">
    <h1 class="title-title">My ebook</h1>
    <h2 class="title-author">John Doe</h2>
</hgroup>

<article class="dedication">
    <h2>Dedication</h2>
    <p>To someone I care for</p>
</article>

<article class="text">
    <article>
        <h3>First text chapter</h3>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. In accumsan massa non nulla lacinia eu feugiat diam imperdiet. Vestibulum non felis mauris. In auctor est nec quam eleifend luctus imperdiet ut massa. Duis et tellus non felis viverra euismod. Praesent gravida ornare arcu, non fringilla sem pharetra ac.</p>
    </article>

    <article>
        <h3>Second text chapter</h3>
        <p>Proin imperdiet mi tempor nisl ullamcorper rhoncus. Curabitur luctus posuere neque, ac consequat quam volutpat nec. Fusce at est sem. Vivamus ante diam, ullamcorper at scelerisque in, auctor at est. Proin diam tortor, sollicitudin vitae tempus sed, tincidunt vitae augue. Quisque id est turpis. Phasellus non magna metus, in bibendum magna.</p>
    </article>
</article>

Peregrin will do it right if I delete the hgroup at the top or add an h2 inside the article class="text", which has none.

EPUB: XHTML documents use .html extension instead of .xhtml

Peregrin names html inside an EPUB using extension .html, which makes debugging/tweaking the EPUBs difficult since, as they are, browsers will not render as an EPUB reader would.

.xhtml documents, at least on the Mac, render differently in Chrome 12 and Safari 5.0.5 than .html do (for example .html documents ignore CSS namespaces — stumbled upon this when trying to CSS style elements with attribute epub:type, which requires namespace declaration in both the XHTML and CSS files)

Blueprint src paths are not relative to opf_root

Blueprint src paths are relative to the root of the EPUB zip file, rather than to the location of the OPF file. This is not terribly problematic, as blueprint files shouldn't be referenced in other blueprints or in components, etc, but it is a logical disparity.