danigm / epub-rs Goto Github PK

View Code? Open in Web Editor NEW

92.0 9.0 26.0 2.84 MB

Library to support the reading of epub files.

License: GNU General Public License v3.0

Rust 100.00%

epub-rs's Introduction

epub-rs

Rust library to support the reading of epub files.

Documentation: https://docs.rs/epub
Crate: https://crates.io/crates/epub

Install

Add this to your Cargo.toml:

[dependencies]
epub = "1.2.2"

MSRV

The minimum supported Rust version is 1.42.0.

epub-rs's People

Contributors

Stargazers

Watchers

epub-rs's Issues

Reading metadata attributes

Sorry if this is not the right place to ask. I would like to read some attributes like file-as from the creator element:
<dc:creator xmlns:ns0="http://www.idpf.org/2007/opf" ns0:role="aut" ns0:file-as="Deaver, Jeffery">Jeffery Deaver</dc:creator>. Is this somehow possible?

Not all epubs use cover-image

Some epubs may use custom string when assigning their cover images. My current workaround is below, some context is missing but i'm sure you can figure it out

        //Look for keys in the hashmap containing the word cover
        let pattern = r"(?i)cover";
        let regex = Regex::new(pattern).unwrap();

        let epub_resources = doc.resources.clone();
        println!("Resources {:?}", epub_resources);
        let cover_id = epub_resources.keys().find(|key| regex.is_match(key));

        if cover_id.is_some() {
            let cover = doc.get_resource(cover_id.unwrap());
            let cover_data = cover.unwrap().0;
            let mut f = fs::File
                ::create(&cover_path)
                .map_err(|err| format!("Error creating cover file: {}", err))?;
            f.write_all(&cover_data).map_err(|err| format!("Error writing cover data: {}", err))?;
        } else {
            //Return our error thumbnail placeholder
            return Ok(format!("{}/{}", get_home_dir(), "error.jpg"));
        }

bug: doesn't read all tags

Hello,

I think there might be a small bug with the metadata parsing.

According to the OPF spec, an epub can have multiple subject elements. The subject element is displayed as "tags" eg in calibre - it is not unusual for an epub to have half a dozen tags associated with it.

However, as you know, the metadata in epub-rs is dumped into a HashMap keyed on the element name. So, if there are multiple subject elements present, it appears to just silently overwrite each time, so that programmatically it is only possible to extract one value, whichever one was written last.

This means that if a book has multiple tags, epub-rs will only let you extract one of them.

Thanks much (and thanks for the lib)

bug: Assumes that meta tag always has name and content attributes

Hello,

I think I found a small bug that causes epub-rs to be unable to parse a small percentage of epubs.

When parsing the epub metadata, it assumes that all tags have both a name and content attribute. However, it seems to be possible from the spec (and in a small percentage of epubs) for a tag to not have these attributes, which means that the doc new() fn returns an Error.

Here is an example snippet from an epub:

    <dc:publisher>Penguin Publishing Group</dc:publisher>

    <dc:identifier opf:scheme="ISBN">978-0-698-18756-6</dc:identifier>

    <meta property="dcterms:modified">2015-08-10T18:12:03Z</meta>

    <meta name="cover" content="cover_img" />

I made a local fork of epub-rs which simply ignores any meta tags that lack the name and content attrs, that seems to resolve the issue.

Thanks much

Writing/modifying of epubs

I was thinking about writing a small library for editing the opf file in epubs but felt like it should instead be a part of this crate which I use for reading the metadata anyways. It would be nice if this library had APIs for not only reading but writing epubs, although that may be out of scope for this project.

The fallback attribute of the item in the manifest

Some EPUB OPF files are structured as follows

...
  <manifest>
    <item id="id1" href="cover.svg" media-type="image/svg+xml" properties="cover-image"/>
    <item id="id2" href="a_1_1.svg" media-type="image/svg+xml" fallback="id3"/>
    <item id="id3" href="a_1_1.xhtml" media-type="application/xhtml+xml"/>
    <item id="id4" href="a_1_2.svg" media-type="image/svg+xml" fallback="id5"/>
    <item id="id5" href="a_1_2.xhtml" media-type="application/xhtml+xml"/>
  </manifest>
  <spine toc="ncx" page-progression-direction="rtl">
    <itemref idref="id2"/>
    <itemref idref="id4"/>
  </spine>
...

The spines are reference "id2" and "id4". However, in the manifest, items id2 and id4 reference items id3 and id5 via the fallback attribute. I'm not sure why it behaves like this, or if this is in accordance with the EPUB standard. Could you ensure that the "spine" refers to the correct "item"?

Find the collection and and group position of an epub

When displaying several epubs, it can be useful to group together those belonging to a same collection (for example all the books of the Wayfarer series by Becky Chambers) and their position in that collection.

The epub spec says this is defined by the meta property belongs-to-collection. A real life example from one of my epubs is:

  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:title id="en_title" xml:lang="en">The Galaxy, and the Ground Within</dc:title>
    <dc:creator id="id">Becky Chambers</dc:creator>
[...]
    <opf:meta property="belongs-to-collection" id="id-2">Wayfarers</opf:meta>
    <opf:meta refines="#id-2" property="collection-type">series</opf:meta>
    <opf:meta refines="#id-2" property="group-position">4</opf:meta>
}</opf:meta>
  </metadata>

I don't think it's possible for me to find metadata by property instead of by id at the moment, which makes it impossible for me to retrieve the collection a book belongs to and its position in such a collection.

Relicensing to more permissive license?

Hi, I was wondering if you would consider re-licensing this crate to something more permissive, such as MIT/Apache. I noticed that this crate uses a GPL license, which seems unecessarily restrictive, and it might prevent use of this crate in other open source projects (since it forces any dependent crates to also use the GPL license), which isn't necessarily desirable. I was wondering if there was a particular reason for using this license, since none of this crate's dependencies use the GPL license, and I don't see any other benefits for using the GPL license, as this isn't a particularly complex crate.

Consider replacing PathBuf with Utf8PathBuf

… in places such as https://github.com/danigm/epub-rs/blob/master/src/doc.rs#L74.

PathBuf isn't really very portable. serde happens to support it, but only by failing (de)serialization on non-utf8 data. camino offers UTF-8 validated path types. It seems like all the paths this crate deals with are read from XML that has to be valid UTF-8 (or UTF-16, which can be converted) anyways, so this should only improve usability and not break any weird niche use cases.

bug: epubs with percent encoded files in the manifest are not parsed correctly

Hi,

I think I found a bug in when reading epubs where files in the manifest use percent encoded filenames or paths, while the matching files and directories in the archieve are "normal".

For example if I try to read an epub containing the file a file (with % encoding).html with and the following manifest

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid" version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
...
  <manifest>
    <item href="a%20file%20%28with%20%25%20encoding%29.html" id="SomeId1" media-type="application/xhtml+xml" />
    ...
  </manifest>
...
</package>

I cannot access the corresponding page but get a zip::result::ZipError::FileNotFound error.

If epub-rs tries to percent decode a path when a FileNotFound error occured it could read more files.

Add toc navigation

Epub documents include a toc.nxc file that's not been parsed with epub-rs. We should provide a way to parse this and store that information.

I've just implemented that in the libgepub project:

https://gitlab.gnome.org/GNOME/libgepub/issues/6

File not Found by EpubDoc on Windows

I opened an epub file using epub-rs, but was unable to properly access the files within. As a result, I was not able to read the chapters and subsequently print them. Then, I tested epub-rs using the same code below on one of my Linux machines and found that epub-rs works as expected. I suspected that something regarding the way the file path is constructed in prevents the proper file from being retrieved with the ZipArchive. I have created a pull request #23 with a fix that works for Windows and Linux. Feel free to change anything as you see fit.

What I Did
Code:

pub fn main() {
    let epubfile = "test.epub";
    let mut file = File::open(epubfile).unwrap();
    let mut buffer = Vec::new();

    file.read_to_end(&mut buffer).unwrap();

    let cursor = Cursor::new(buffer);

    let doc = EpubDoc::from_reader(cursor);

    let mut doc = doc.unwrap_or_else(|err| {
        eprintln!("An error occurred while processing the epub file: {}", err);
        process::exit(1);
    });

    let current = doc.get_current_str();
    println!("{:?}", current);
}

What I Expected
The output of the first chapter:

Ok("<?xml version=\"1.0\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n  <head>\n    <title>\"Cover\"</title>\n  </head>\n  <body>\n    <div style=\"text-align: center\">\n      <img src=\"@public@vhost@g@gutenberg@html@files@28885@28885-h@[email protected]\" alt=\"Cover\" style=\"max-width: 100%; \" />\n      \n    </div>\n  </body>\n</html>")

What I Received
The output:

Err(specified file not found in archive)

System Specs
Operating System: Windows 10 Pro
Version: 20H2
OS Build: 19042.685
Processor: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz 3.50 GHz
Installed RAM: 48.0 GB
System Type: 64-bit operating system, x64-based processor
Graphics Card: NVIDIA GeForce GTX 1080 Ti

bug: epubs with a Byte Order Mark (BOM) are not parsed

Another small bug:

It is possible for utf-8 encoded XML to contain a Byte Order Mark (BOM) in the first few bytes of the file. Typically in the case of epubs, it is bytes: ef, bb, bf.

This seems to confuse whatever XML parser is used by epub-rs though, it results in an Error while creating the epub. This only affects a very few epubs but it appears they are a valid case.

I worked around this locally by putting in this ugly check in XMLReader, but presumably there is some nicer way! :

        if content[0]==0xefu8 && content[1]==0xbbu8 && content[2]==0xbfu8 {
            XMLReader {
                reader: ParserConfig::new()
                    .add_entity("nbsp", " ")
                    .add_entity("copy", "©")
                    .add_entity("reg", "®")
                    .create_reader(&content[3..])
            }
        } else {
            XMLReader {
                reader: ParserConfig::new()
                    .add_entity("nbsp", " ")
                    .add_entity("copy", "©")
                    .add_entity("reg", "®")
                    .create_reader(content)
            }
        }

It does appear to deal with the error at least - and the other BOMs (eg UTF-16 BOMs) don't seem to occur in practice for epubs.

I have attached an example container.xml from an epub with a BOM (from META-INF/container.xml).

container.zip

Thanks!

Cover isn't necessarily specificed in metadata

The path to the book cover is not necessarily specified in metadata, as `src/doc.rs#L213-216 seem to assume.

According to https://www.w3.org/publishing/epub3/epub-packages.html#sec-cover-image, the cover-image can be specified via the cover-image property of the manifest.

An example from an epub of mine:

  <manifest>
    [...]
    <item id="cover" href="images/cover.jpg" media-type="image/jpeg" properties="cover-image"/>
    [...]
  </manifest>

I'm not entirely sure what "coverimagestandard".into() does in src/doc.rs#L215 though?

A naive solution I can think of would be to try to retrieve the cover metadata, and if it fails to look up in the manifest for the item containing the cover-image property, and return its id.

Does it make sense to you too?

Update on crates.io

Hi, I was seeing if I could use this library in a wasm project. However, the published version cannot be compiled with the wasm32-unknown-unknown target because it relies on some C dependencies in the zip library. This has already been fixed in #28, but the change hasn't been pushed yet.
I'd appreciate if you could update it. :)

EPubDoc::get_release_identifier documentation link leads to 404

Error with EPubDoc::get_release_identifier documentation.

https://www.w3.org/publishing/epub3/epub-packages.html#sec-metadata-elem-identifiers-pid leads to a page which has since been moved.

I believe it's been moved to https://www.w3.org/TR/epub/

Note linearity of spine items

Currently, as far as I can tell from reading the docs, there's no way, for a given spine entry read out of an EpubDoc's spine, to discern the value of its linear attribute. This is inconvenient; knowing a spine-entry's linearity is often useful and relevant to proper rendering of a book. Is there some method I'm missing by which the information can be accessed? If not, I'd recommend adding one.

Reading string values keeps firing error

I'm trying to suss out an issue I am having. Every time I attempt to read a string value from an epub I get the following error. Specified file not found in archive

let doc = EpubDoc::new(input_file);
    assert!(doc.is_ok());
    let mut doc = doc.unwrap();
    {
        let title = doc.metadata.get("Title");
        println!("Book title {:?}", title);
    }
    let len_pages = doc.resources.len();
    println!("Num Pages: {}", len_pages);

    let len = doc.spine.len();
    for i in 1..len {
        let n = doc.go_next();
        match n {
            Ok(v) => {
                println!("ID: {}", doc.get_current_id().unwrap());
                let current = doc.get_current_str();
                match current {
                    Ok(v) => println!("Value {:?}", v), 
                    Err(e) => println!("Text Err {:?}", e.description()) /*Specified file not found in archive*/
                }
            },
            Err(e) => println!("General Error: {:?}", e),
        }
    }

I pulled a sample book from here to make sure the book I had locally wasn't corrupted.