vers-one / epubreader Goto Github PK
View Code? Open in Web Editor NEW.NET library for reading EPUB files
Home Page: https://os.vers.one/EpubReader/
License: The Unlicense
.NET library for reading EPUB files
Home Page: https://os.vers.one/EpubReader/
License: The Unlicense
EpubReader currently lacks nullable reference type annotations, mostly because it needs to target .NET Framework which locks the C# compiler version to C# 7.3 while nullable reference types require at least C# 8.0. However, there is a way to specify an explicit C# compiler version in csproj file via <LangVersion>x.x</LangVersion>
project property. This should work even for .NET Framework and .NET Standard 1.0, as long as the code doesn't use any runtime features of the newer C# compiler. The only downside of this approach is the lack of nullable annotation attributes which require the project using them to NOT have any targets other than .NET Core >= 3, .NET >= 5, or .NET Standard 2.1. This leads to two main consequences:
void Assert([DoesNotReturnIf(false)] bool condition, string? message = null)
{
if (!condition)
{
throw ...
}
}
String.IsNullOrEmpty(test)
check C# compiler still treats test
as potentially null
. The only workaround is to add an explicit if (test != null) { ... }
check.However, these downsides seem like reasonable tradeoffs for having nullable reference type annotations in EpubReader and most importantly, they don't affect the consumers of the library in any negative way.
Switch to C# 10.0 compiler and add nullable reference type annotations for VersOne.Epub assembly.
Documentation: https://learn.microsoft.com/en-us/dotnet/csharp/nullable-references
Something different about the image asset path. file OEBPS/assets/zr0ggkC.png
was not found in archive.
I am getting this exception, which seems to be related to a bad spine. After looking at your source code, I see that this exception is thrown when STRICTEPUB
is set.
Is there a way to set this or do I need to compile the library myself?
Spine for reference:
<manifest>
<item href="page-template.xpgt" id="pt" media-type="application/vnd.adobe.page-template+xml"/>
<item href="stei_9780140177381_oeb_css_r1.css" id="style" media-type="text/css"/>
<item href="stei_9780140177381_msr_cvi_r1.jpg" id="coverimagestandard" media-type="image/jpeg"/>
<item href="stei_9780140177381_msr_cvt_r1.jpg" id="thumbimagestandard" media-type="image/jpeg"/>
<item href="stei_9780140177381_msr_ppl_r1.jpg" id="PPCthumbnailimage" media-type="image/jpeg"/>
<item href="stei_9780140177381_oeb_cover_r1.html" id="cover" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_toc_r1.html" id="toc" media-type="application/xhtml+xml"/>
<item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
<item href="stei_9780140177381_oeb_fm1_r1.html" id="fm1" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_fm2_r1.html" id="fm2" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_tp_r1.html" id="tp" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_ded_r1.html" id="ded" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_fm3_r1.html" id="fm3" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c01_r1.html" id="c01" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c02_r1.html" id="c02" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c03_r1.html" id="c03" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c04_r1.html" id="c04" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c05_r1.html" id="c05" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c06_r1.html" id="c06" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c07_r1.html" id="c07" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c08_r1.html" id="c08" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c09_r1.html" id="c09" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c10_r1.html" id="c10" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c11_r1.html" id="c11" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c12_r1.html" id="c12" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c13_r1.html" id="c13" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c14_r1.html" id="c14" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c15_r1.html" id="c15" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c16_r1.html" id="c16" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c17_r1.html" id="c17" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c18_r1.html" id="c18" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c19_r1.html" id="c19" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c20_r1.html" id="c20" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c21_r1.html" id="c21" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c22_r1.html" id="c22" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c23_r1.html" id="c23" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c24_r1.html" id="c24" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c25_r1.html" id="c25" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c26_r1.html" id="c26" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c27_r1.html" id="c27" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c28_r1.html" id="c28" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c29_r1.html" id="c29" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c30_r1.html" id="c30" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c31_r1.html" id="c31" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_c32_r1.html" id="c32" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_bm1_r1.html" id="bm1" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_ftn_r1.html" id="ftn" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_cop_r1.html" id="cop" media-type="application/xhtml+xml"/>
<item href="stei_9780140177381_oeb_001_r1.jpg" id="stei_9780140177381_oeb_001_r1" media-type="image/jpeg"/>
<item href="stei_9780140177381_oeb_002_r1.jpg" id="stei_9780140177381_oeb_002_r1" media-type="image/jpeg"/>
<item href="stei_9780140177381_oeb_003_r1.jpg" id="stei_9780140177381_oeb_003_r1" media-type="image/jpeg"/>
<item href="stei_9780140177381_oeb_004_r1.jpg" id="stei_9780140177381_oeb_004_r1" media-type="image/jpeg"/>
</manifest>
<spine>
<itemref idref="cover"/>
<itemref idref="toc"/>
<itemref idref="fm1"/>
<itemref idref="fm2"/>
<itemref idref="tp"/>
<itemref idref="cop"/>
<itemref idref="ded"/>
<itemref idref="fm3"/>
<itemref idref="c01"/>
<itemref idref="c02"/>
<itemref idref="c03"/>
<itemref idref="c04"/>
<itemref idref="c05"/>
<itemref idref="c06"/>
<itemref idref="c07"/>
<itemref idref="c08"/>
<itemref idref="c09"/>
<itemref idref="c10"/>
<itemref idref="c11"/>
<itemref idref="c12"/>
<itemref idref="c13"/>
<itemref idref="c14"/>
<itemref idref="c15"/>
<itemref idref="c16"/>
<itemref idref="c17"/>
<itemref idref="c18"/>
<itemref idref="c19"/>
<itemref idref="c20"/>
<itemref idref="c21"/>
<itemref idref="c22"/>
<itemref idref="c23"/>
<itemref idref="c24"/>
<itemref idref="c25"/>
<itemref idref="c26"/>
<itemref idref="c27"/>
<itemref idref="c28"/>
<itemref idref="c29"/>
<itemref idref="c30"/>
<itemref idref="c31"/>
<itemref idref="c32"/>
<itemref idref="bm1"/>
<itemref idref="ftn"/>
</spine>
Hello @vers-one,
I think the newest version of the NuGet package doesn't contain the latest version of the assemblies.
When analyzing the assembly in your latest NuGet package I can see that the assembly version is still 2.0.5, despite your adjustions in the project files:
[assembly: TargetFramework(".NETStandard,Version=v1.3", FrameworkDisplayName = "")]
[assembly: AssemblyCompany("vers")]
[assembly: AssemblyConfiguration("Release")]
[assembly: AssemblyCopyright("vers, 2015-2018")]
[assembly: AssemblyFileVersion("2.0.4.0")]
[assembly: AssemblyInformationalVersion("2.0.4")]
[assembly: AssemblyProduct("VersOne.Epub")]
[assembly: AssemblyTitle("VersOne.Epub")]
[assembly: AssemblyVersion("2.0.4.0")]
This leads to my pull request not being included:
internal class Program
{
static void Main(string[] args)
{
var ePub = EpubReader.ReadBook(@"C:\Users\Jann Flepp\Downloads\Tom Christiansen - Perl Cookbook.epub");
var points = GetNavigationPoints(ePub.Schema.Navigation.NavMap).ToArray();
Console.WriteLine("Any playorder null: " + (points.Any(p => p.PlayOrder == null) ? "true" : "false"));
}
private static IEnumerable<EpubNavigationPoint> GetNavigationPoints(IEnumerable<EpubNavigationPoint> map)
{
foreach (var point in map)
{
yield return point;
foreach (var subPoint in GetNavigationPoints(point.ChildNavigationPoints))
{
yield return subPoint;
}
}
}
}
With NuGet Package
<PackageReference Include="VersOne.Epub" Version="2.0.5" />
Any playorder null: true
With Reference to master branch project
<ProjectReference Include="..\VersOne.Epub\VersOne.Epub.csproj" />
Any playorder null: false
Could you verify my assumptions?
Thanks for your help!
epubBook.Schema.Package.Metadata.MetaItems
does not seem to be showing "title-type" or "display-seq", which limits the ability to use dc:title
tags for grouping books into collections/reading lists.
See the following:
<dc:title id="t3">The New French Cuisine Masters</dc:title>
<meta refines="#t3" property="title-type">collection</meta>
<meta refines="#t3" property="display-seq">3</meta>
https://www.w3.org/publishing/epub3/epub-packages.html#sec-title-type
When I run:
epubBook.Schema.Package.Metadata.MetaItems.Select(item => item.Property)
Thanks for this library! Very useful in my project.
My question is: Should I build an HTML parser to display the chapter contents once I have parsed the .epub and have the HTML? The platform I am building for is not one with a built-in HTML/web parser. Any suggestions? Or is there a generally used HTML parsing library? Should this be built into the package?
The library is returning the wrong text for ContentFileName. In the case for this epub, it should return "Text/chapter01.xhtml" while it is returning "Text/../Text/chapter01.xhml". I'm not sure where the extra relative path is coming from, given it's not in the XML.
Code:
var navItems = await book.GetNavigationAsync();
foreach (var navigationItem in navItems)
{
if (navigationItem.NestedItems.Count > 0)
{
var nestedChapters = new List<BookChapterItem>();
foreach (var nestedChapter in navigationItem.NestedItems)
{
if (nestedChapter.Link == null) continue;
// BUG: nestedChapter.Link.ContentFileName -> Is returning "/Text/../Text/chapter01.xhtml" when it should be "Text/chapter01.xhtml"
var key = BookService.CleanContentKeys(nestedChapter.Link.ContentFileName);
if (mappings.ContainsKey(key))
{
nestedChapters.Add(new BookChapterItem()
{
Title = nestedChapter.Title,
Page = mappings[key],
Part = nestedChapter.Link.Anchor ?? string.Empty,
Children = new List<BookChapterItem>()
});
}
}
CreateToCChapter(navigationItem, nestedChapters, chaptersList, mappings);
}
Toc.ncx:
<navPoint id="navPoint5">
<navLabel>
<text>Day 0: Backstory and the Bridal Wars</text>
</navLabel>
<content src="Text/chapter1.xhtml"/>
</navPoint>
<navPoint id="navPoint6">
<navLabel>
<text>Day 1, Morning: The Start of a Slow Life</text>
</navLabel>
<content src="Text/chapter2.xhtml"/>
</navPoint>
Manifest:
<manifest>
<item id="cover" href="Text/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="frontmatter1.xhtml" href="Text/frontmatter1.xhtml" media-type="application/xhtml+xml"/>
<item id="frontmatter2.xhtml" href="Text/frontmatter2.xhtml" media-type="application/xhtml+xml"/>
<item id="toc.xhtml" href="Text/toc.xhtml" media-type="application/xhtml+xml" properties="nav"/>
<item id="prologue.xhtml" href="Text/prologue.xhtml" media-type="application/xhtml+xml"/>
<item id="prologue2.xhtml" href="Text/prologue2.xhtml" media-type="application/xhtml+xml"/>
<item id="insert1.xhtml" href="Text/insert1.xhtml" media-type="application/xhtml+xml"/>
<item id="prologue2_1.xhtml" href="Text/prologue2_1.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter1.xhtml" href="Text/chapter1.xhtml" media-type="application/xhtml+xml"/>
...
The file is under copyright
This is a EPUB 2 document and I have tested on v3.1.1, v3.1.0 and it is not working.
This happens when I run WpfDemo. It cannot register fonts. There error is in BookHtmlContent.cs, line 148
Uri packageUri = new Uri(fontFile.Key + ":");
fontFile.Key is fonts/00001.ttf
Now that .NET 7 is out, it's time to upgrade.
Migrate:
.NET 7 announcement: https://devblogs.microsoft.com/dotnet/announcing-dotnet-7/
When i try to read epub that has not cover image and any tag for cover in content.opf, throws an exception and can't read epub. If there is image and added in content.opf, it is ok but if there is no cover tag in content.opf, throws an exception
This exception was originally thrown at this call stack: VersOne.Epub.Internal.BookCoverReader.ReadEpub2CoverFromGuide(VersOne.Epub.EpubSchema, System.Collections.Generic.Dictionary<string, VersOne.Epub.EpubByteContentFileRef>) VersOne.Epub.Internal.BookCoverReader.ReadEpub2Cover(VersOne.Epub.EpubSchema, System.Collections.Generic.Dictionary<string, VersOne.Epub.EpubByteContentFileRef>) VersOne.Epub.Internal.BookCoverReader.ReadBookCover(VersOne.Epub.EpubSchema, System.Collections.Generic.Dictionary<string, VersOne.Epub.EpubByteContentFileRef>) VersOne.Epub.Internal.ContentReader.ParseContentMap(VersOne.Epub.EpubBookRef, VersOne.Epub.Options.ContentReaderOptions) VersOne.Epub.EpubReader.OpenBookAsync.AnonymousMethod__1() System.Threading.Tasks.Task<TResult>.InnerInvoke() in Future.cs System.Threading.Tasks.Task..cctor.AnonymousMethod__272_0(object) in Task.cs System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, object) in ExecutionContext.cs System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() in ExceptionDispatchInfo.cs
I try to read epub in UWP with the following code.
But get the Exception:
System.InvalidOperationException: 'Synchronous operations should not be performed on the UI thread. Consider wrapping this method in Task.Run.'
public async void test() { EpubBook epubBook = await EpubReader.ReadBookAsync("C:\test.epub"); }
Please add Stream support in the methods to open epub.
Creators roles are missing. In the example we have to creators, but when it's inspected, EpubReader has null
Role value, whereas it should be "aut" and "ill".
<dc:creator id="creator01">Ameko Kaeruda</dc:creator>
<meta property="alternate-script" refines="#creator01" xml:lang="ja">蛙田 アメコ</meta>
<meta property="file-as" refines="#creator01">Kaeruda, Ameko</meta>
<meta property="role" refines="#creator01" scheme="marc:relators">aut</meta>
<dc:creator id="creator02">Sencha</dc:creator>
<meta property="alternate-script" refines="#creator02" xml:lang="ja">せんちゃ</meta>
<meta property="file-as" refines="#creator02">Sencha</meta>
<meta property="role" refines="#creator02" scheme="marc:relators">ill</meta>
```
## EPUB specification link
https://www.w3.org/publishing/epub3/epub-packages.html#sec-role
There are 6 schema classes derived from List<T>
:
Epub2NcxHead : List<Epub2NcxHeadMeta>
Epub2NcxNavigationMap : List<Epub2NcxNavigationPoint>
Epub2NcxPageList : List<Epub2NcxPageTarget>
EpubGuide : List<EpubGuideReference>
EpubManifest : List<EpubManifestItem>
EpubSpine : List<EpubSpineItemRef>
The inheritance (rather than composition) was chosen to match the XML schema. For example, the <spine>
section of the OPF package may look like this:
<spine toc="ncx">
<itemref id="itemref-1" idref="item-1" />
<itemref id="itemref-2" idref="item-2" />
</spine>
EpubSpine
class lets the consumer access the child nodes in an intuitive way: spine[0].Id
. In case of composition, it would look like this: spine.Items[0].Id
which would not match the XML schema (since there is no <items>
element in <spine>
).
However, this also prevents the consumer to use both object and collection initializers together. C# syntax allows to use either object initializer:
EpubSpine spine = new EpubSpine()
{
Toc = "ncx"
};
or collection initializer:
EpubSpine spine = new EpubSpine()
{
new EpubSpineItemRef()
{
Id = "itemref-1",
IdRef = "item-1"
},
new EpubSpineItemRef()
{
Id = "itemref-2",
IdRef = "item-2"
}
};
but not both.
I think switching from inheritance to composition and adding intermediate Items
property which doesn't exist in the XML schema is a reasonable price to pay to get the in-place initialization support for those classes.
This is going to be a breaking change but hopefully a minor one since only a small set of consumers of this library use the raw schema classes and the fix is relatively simple (replacing spine[0].Id
with spine.Items[0].Id
).
Seems that there is something wrong when reading the cover tag of this particular EPUB (can not share it as it is copyrighted, but the publisher is Packt Publishing). Seems to me that in this case, the library should just log a warning and simply not read/use the cover image eg. treating it as if it has no cover (similar to #75).
I still have not found the root cause of this particular EPUB (the cover file Images/default_cover.jpeg does exist), but it seems like a non-critical error that should not cause the whole book from loading.
There was an exception when opening epub book: /books/MyBook.epub
VersOne.Epub.EpubPackageException: Incorrect EPUB manifest: item with ID = "Images/default_cover.jpeg" is missing.
at VersOne.Epub.Utils.TaskExtensionMethods.ExecuteAndUnwrapAggregateException[T](Task`1 task)
Currently, EpubContentFileRef
class (and the classes derived from it) require an instance of the EpubBookRef
to be passed to its constructor as an argument. At the same time, EpubBookRef
instance contains the Content
property which in turn contains collections of EpubContentFileRef
instances thus creating a circular dependency.
This approach was chosen a few years ago because a book (EpubBookRef
) needs to contain references to its content files (EpubContentFileRef
) while a content file needs to have an access to the physical EPUB file which is only available through the EpubBookRef.EpubFile
property. It also needs the content directory path which is available through the EpubBookRef.Schema.ContentDirectoryPath
property.
In order to create an instance of the EpubContentFileRef
class, the caller has to pass a partially initialized EpubBookRef
instance to the constructor of the EpubContentFileRef
class and then use the collection of EpubContentFileRef
items to complete the construction of the EpubBookRef
class. However, with the addition of nullable reference type annotations (#65) partially initialized instances are no longer possible.
EpubBookRef
argument in the constructor of the EpubContentFileRef
class with IZipFile epubFile
and string contentDirectoryPath
arguments.bool IsDisposed
property to the IZipFile
interface and implement it in the ZipFile
class.epubFile.IsDisposed
property in the ReadXX
/ GetContentStream
methods and throw ObjectDisposedException
if the file was already disposed.IZipFile epubFile
still belongs to the EpubBookRef
class and destroying an instance of the EpubContentFileRef
class doesn't dispose the file.Having .NET Standard 1.3 in the list of target frameworks:
lets the library to be used in projects targeting some older frameworks (e.g. .NET Core 1.0). However, this also leads to an excessive list of unnecessary package dependencies when the library is imported in a project targeting a newer framework (e.g. .NET 6):Microsoft.NETCore.Platforms.1.1.0
Microsoft.NETCore.Targets.1.1.0
runtime.native.System.4.3.0
runtime.native.System.IO.Compression.4.3.0
System.Buffers.4.3.0
System.Collections.4.3.0
System.Diagnostics.Debug.4.3.0
System.Diagnostics.Tracing.4.3.0
System.Globalization.4.3.0
System.IO.4.3.0
System.IO.Compression.4.3.0
System.Reflection.4.3.0
System.Reflection.Primitives.4.3.0
System.Resources.ResourceManager.4.3.0
System.Runtime.4.3.0
System.Runtime.Extensions.4.3.0
System.Runtime.Handles.4.3.0
System.Runtime.InteropServices.4.3.0
System.Text.Encoding.4.3.0
System.Threading.4.3.0
System.Threading.Tasks.4.3.0
Adding an explicit .NET Standard 2.0 support should prevent unnecessary package dependencies.
If you need to support .NET Standard 1.x, we recommend that you also target .NET Standard 2.0. .NET Standard 1.x is distributed as a granular set of NuGet packages, which creates a large package dependency graph and results in developers downloading a lot of packages when building.
Hello
I'm trying to install this package and I get this error:
Severity Code Description Project File Line Suppression State
Error Could not install package 'VersOne.Epub 2.0.1'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.6.1', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author. 0
My project is targeting net 4.6.1
Thanks for the help
I really like your epub reader library works like a charm. I think you should feature it in this repo by doing a pull request there https://github.com/thangchung/awesome-dotnet-core#serialization. It could be underneath serialization or misc.
I am converting my epub file URL to stream and saving to local DB as bytes like below:
Stream stream;
HttpClient client = new HttpClient();
var response = await client.GetAsync(fileUrl);
stream = await response.Content.ReadAsStreamAsync();
epubBook = EpubReader.ReadBook(stream);
//saving to folder
byte[] bytes = await response.Content.ReadAsByteArrayAsync();
string filename = Path.GetFileName(fileUrl);
var folderPath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
var filePath = Path.Combine(folderPath, filename);
File.WriteAllBytes(filePath, bytes);
This is working fine for most of the files. But some file URLs showing System.AggregateException
.
Exception Details
System.AggregateException: One or more errors occurred. (Version number '1.1' is invalid. Line 1, position 16.) ---> System.Xml.XmlException: Version number '1.1' is invalid. Line 1, position 16.
at System.Xml.XmlTextReaderImpl.Throw (System.Exception e) [0x00027] in <0757e7484a1349cca3b4558c721885b2>:0
at System.Xml.XmlTextReaderImpl.Throw (System.String res, System.String arg) [0x00029] in <0757e7484a1349cca3b4558c721885b2>:0
at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration (System.Boolean isTextDecl) [0x0061f] in <0757e7484a1349cca3b4558c721885b2>:0
at System.Xml.XmlTextReaderImpl.Read () [0x000c6] in <0757e7484a1349cca3b4558c721885b2>:0
at System.Xml.Linq.XDocument.Load (System.Xml.XmlReader reader, System.Xml.Linq.LoadOptions options) [0x00016] in <89374192b20a41739cf7c5bb822846fe>:0
at System.Xml.Linq.XDocument.Load (System.IO.Stream stream, System.Xml.Linq.LoadOptions options) [0x0000f] in <89374192b20a41739cf7c5bb822846fe>:0
at System.Xml.Linq.XDocument.Load (System.IO.Stream stream) [0x00000] in <89374192b20a41739cf7c5bb822846fe>:0
at VersOne.Epub.Internal.XmlUtils+<>c__DisplayClass0_0.b__0 () [0x00000] in <7c46dbfe3ebf403389304a938822832e>:0
at System.Threading.Tasks.Task`1[TResult].InnerInvoke () [0x0000f] in <46c2fa109b574c7ea6739f9fe2350976>:0
at System.Threading.Tasks.Task.Execute () [0x00000] in <46c2fa109b574c7ea6739f9fe2350976>:0
--- End of stack trace from previous location where exception was thrown ---
Sample file URLs having this issue:
I am using EpubReader.Cross Nuget for parsing the epub file.
I have uploaded a sample project for the easy reference.
How do we get the filename of the coverimage? I noticed that in pages that reference it you can find it from the images by the filename. In the epubbook there is the coverimage byte, but there is no filename or mimetype of this.
Thanks for the library, usually... it works pretty well. I have several efiles which use identifiers that are longer than 50 characters. This makes the library return a null.
Am I doing something funky, that would make this break? It usually works.
Thanks,
Adam
public string ReadEpubFile(string sPnF)
{
Epub book = null;
try
{
if (File.Exists(sPnF))
{
book = new Epub(sPnF);
}
}
catch (Exception e)
{
throw new Exception(" public string ReadEpubFile( "+sPnF+" ) :: " +e.Source+" :: "+ e.Message);
}
if (book != null)
{
return book.GetContentAsPlainText();
}
else
{
return "";
}
}
EPUB 3.2 specification lets EPUB books to declare media overlays which are essentially just embedded audio narrations synchronized with the text content. A reading software can use this information to highlight the words in the text as a narrator speaks.
This was first implemented by Apple as a non-standard extension for EPUB 2 books (under the name of Read Aloud). Later this feature was added to the EPUB 3 standard.
There is only a small amount of EPUB 3 books that contain media overlays and very few reader apps that actually support them. But nevertheless, it might be useful to have such support in this library.
Note that this enhancement only parses SMIL files but doesn't perform any post-processing for the parsed data. There is a subsequent enhancement #84 which will expose the parsed data as easy to consume narration objects on the EpubBook
level.
EPUB Media Overlays 3.2 specification: https://www.w3.org/publishing/epub32/epub-mediaoverlays.html
I have used the Daisy Pipeline2 to convert daisy2.02 and daisy3 books to epub. When using the library to read the converted books the library crashes (Even with all ignore and suppress options activated) and reports the following exceptions -
System.AggregateException: 'One or more errors occurred. (Object reference not set to an instance of an object.)'
Inner Exception
NullReferenceException: Object reference not set to an instance of an object.
The exceptions that my software reports don't really provide much information. To get more information about what's happening and what's causing the library to crash, you would have to run the epub file in the source code. I couldn't run the source code project myself, because I can't find the framework 4.6 installer for my computer.
There are two issues with parsing EPUB 2 NCX navigation lists:
NullReferenceException
to be thrown.navTarget
elements inside navigation lists are always ingored.Here's a minimal NCX file to reproduce both of those issues:
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/">
<head />
<docTitle />
<navMap />
<navList id="navlist-1">
<navLabel>
<text>Test label</text>
</navLabel>
<navTarget id="navtarget-1">
<navLabel>
<text>Test label</text>
</navLabel>
</navTarget>
</navList>
</ncx>
Those issues have not been caught earlier because navigation lists are very rare in EPUB books. (They are used for secondary tables of contents, e.g. a list of illustrations, a list of tables, etc.)
EPUB 2 specification doesn't contain explicit requirements on how book cover should be represented in the OPF schema file. Instead it provides only a vague recommendation to use a <guide>/<reference type="cover">
element mentioning the Chicago Manual of Style as the source of the list of applicable <reference>
element types.
Most EPUB 2 books use <meta name="cover" content="..." />
element to define the cover, where the value of the content
attribute points to a <manifest>/<item>
element of the actual cover image. However there are some books that don't follow this pattern, hence all the hacks and heuristics currently present in the BookCoverReader
:
EpubReader/Source/VersOne.Epub/Readers/BookCoverReader.cs
Lines 37 to 67 in d160caf
EPUB 3 on the other hand does define an explicit requirement for cover images by requesting to specify them via <manifest>/<item properties="cover-image">
elements. EpubReader parses these <manifest>/<item>
elements along with their properties
attributes correctly but does not currently use this information to obtain a cover image of an EPUB 3 book.
BookCoverReader
.I am using epubreader NuGet package for parsing .epub files.
My Code:
string fileName = "SampleEPUB.epub";
var assembly = typeof(MainPage).GetTypeInfo().Assembly;
Stream stream = assembly.GetManifestResourceStream($"{assembly.GetName().Name}.{fileName}");
EpubBook epubBook = EpubReader.ReadBook(stream);
foreach (EpubNavigationItem chapter in epubBook.Navigation)
{
chapterDetails.Add(new ChapterDetails() { title = chapter.Title, htmlData = chapter.HtmlContentFile?.Content, subChapters = chapter.NestedItems });
}
When parsing the epub file like above, I am getting only one chapter. If I click the chapter there is no data.
When we open that epub files using Adobe Digital Editions 4.5.11, there are lot of chapters and contents. I need to parse all the chapters and TOC in the epub file. Please help me to find the issue behind this.
I have added a sample project here having .epub files for the reference.
@versfx Hello! I have a problem with displaying book's cover image. It looks like the EpubBookRef entity doesn't have the necessary MetaItem. You can see on the picture below. This is strange. The book cover image can be displaying if I use my Pocket Book reader. There are another ways to display it?
Unrecognized ePub file
System.AggregateException”(In System.Private.CoreLib.dll )
File Download Url:
https://drive.google.com/file/d/1zLEv2IL9C8FiLg6T0hm34a5kA-j3IQM3/view?usp=sharing
opf:event
and opf:scheme
attributes in the following example are always skipped during the parsing:
<package xmlns="http://www.idpf.org/2007/opf"
xmlns:opf="http://www.idpf.org/2007/opf"
xmlns:dc="http://purl.org/dc/elements/1.1/" ...>
<metadata>
<dc:date opf:event="...">...</dc:date>
<dc:identifier ... opf:scheme="...">...</dc:identifier>
...
This is due to the fact that those attributes are not part of DC (Dublin Core Metadata Element Set) XML namespace. Instead, they are extra attributes added by the EPUB 2 standard which is why they appear in the opf
XML namespace in the example above. EpubReader didn't account for this fact, so the values of those attributes were always null
after parsing.
date
metadata element: https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2.7identifier
metadata element: https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2.10I have attached 3 ePub files that fail to be parsed by ePubReader.
I found these files in the wild, by google searching by file type to build up a ePub test
dataset to test ePubReader against.
I have other files that fail too but for same reasons as the ones attached (TOC error, etc)
Good job so far.
Hello,
Is it possible to make EpubReader to be used in .NetCore applications?
Thanks for your efforts.
Br,
Sergey Sypalo | Blog at http:\sypalo.com
EpubReader throws NullReferenceException
while trying to extract a cover for a EPUB 2 book if the following conditions are met:
metadata/meta
element defining a cover.guide
section is not present in the OPF package.metadata
section: https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2guide
section is optional in the OPF package: https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#AppendixABookCoverReader
class uses the following algorithm to extract covers from EPUB 2 books:
metadata
section has the following meta
item: <meta name="cover" content="..." />
. If there is an item with the name
set to cover
, then its content
attribute will point to an item
in the manifest
section representing the cover image.meta
item, then look for <reference type="cover" href="..." />
element in the guide
section. BookCoverReader
throws a NullReferenceException
here if the guide
section is not present.It is possible for a EPUB 2 book to not have a cover image as well the guide section. This issue was not caught before because most EPUB 2 books have a cover image, and those that don't have it, include a guide section even though it is not required by the EPUB 2 specification.
Most VersOne.Epub.Schema.*
types in EpubReader were designed to conform to EPUB 2, before EPUB 3 specification was officially released. EPUB 3 extended some of the schema XML elements with new attributes in such a way that made it impossible to add them to EpubReader's schema types without breaking backwards compatibility.
For example, <dc:title>
XML element didn't have any attributes in EPUB 2, so the EpubBook.Schema.Package.Metadata.Titles
property was typed as List<string>
. However, EPUB 3 added id
, dir
, and xml:lang
attributes which requires to replace the List<string>
with something like List<EpubMetadataTitle>
where EpubMetadataTitle
will be a new class with all properties parsed from the EPUB 3 attributes. This will obviously be a breaking change for the consumers using this property. Another example is the <dc:language>
element which got an optional id
attribute.
The test case data for all integration tests is stored in JSON files. VersOne.Epub.Test project uses Json.NET (Newtonsoft.Json Nuget package) for serializing and deserializing those JSON files. It turns out, there are a few limitations and inconveniences in Json.NET that affect its use in VersOne.Epub.Test:
PreserveReferencesHandling
option which instructs the JSON serializer to store only a single copy of an object within the JSON file cannot be used with classes that don't have default constructors (i.e. parameterless constructors). Nullable reference type annotations in VersOne.Epub require almost every class to have a non-default constructor which in turn makes it impossible to use the built-in reference tracking in Json.NET.string
values into DateTime
during deserialization if the content of the string looks like a date/time.The first issue essentially makes it impossible to use strongly-typed serialization and deserialization operations. The only workaround is to parse the JSON file into a generic JObject
object and deserialize its content manually. However, such generic parsing and writing operations can be done with System.Text.Json which also performs them more efficiently.
Replace Json.NET with System.Text.Json for integration tests.
Json.NET documentation: https://www.newtonsoft.com/json/help/
Json.NET Nuget package: https://www.nuget.org/packages/Newtonsoft.Json/
System.Text.Json documentation: https://learn.microsoft.com/en-us/dotnet/standard/serialization/system-text-json/overview
Migration guidelines: https://learn.microsoft.com/en-us/dotnet/standard/serialization/system-text-json/migrate-from-newtonsoft
Performance comparison: https://devblogs.microsoft.com/dotnet/whats-next-for-system-text-json/#performance
Since i don't know C# i would like to know how to develop an E pub reader using vb.net.
Do you have a plan to provide this feature?
EPUB 3 standard supports remote manifest items and metadata links (i.e. files referenced by absolute URLs like http://example.com/book/123/font.ttf
as opposed to local files like Content/font.ttf
which are packaged within the EPUB file). EpubReader doesn't support remote manifest items and treats all absolute URLs as file names within the EPUB file.
Most of EPUB books don't contain references to remote resources.
Replace the implementation of the EpubContentFile
/ EpubContentFileRef
classes and the classes derived from them with the following class hierarchy:
EpubContentFile
|-EpubLocalContentFile
| |-EpubLocalTextContentFile
| |-EpubLocalByteContentFile
|-EpubRemoteContentFile
| |-EpubRemoteTextContentFile
| |-EpubRemoteByteContentFile
EpubContentFileRef
|-EpubLocalContentFileRef
| |-EpubLocalTextContentFileRef
| |-EpubLocalByteContentFileRef
|-EpubRemoteContentFileRef
| |-EpubRemoteTextContentFileRef
| |-EpubRemoteByteContentFileRef
Add ContentLocation
property to the base classes with the following type: enum EpubContentLocation { LOCAL, REMOTE }
.
Add ContentFileType
property to the base classes with the following type: enum EpubContentFileType { TEXT, BYTE_ARRAY }
.
Implement EpubContentCollection
and EpubContentCollectionRef
classes with two properties: Local
and Remote
which contain local and remote files / file references respectively. Use these classes for Html
, Css
, Images
, Fonts
, and AllFiles
properties in the EpubContent
/ EpubContentRef
classes.
Implement content downloader for remote content files.
Add ContentDownloaderOptions
class to enable / disable downloading remote content and to let the application to supply its own content downloader.
Extract the code to load local content and download remote content out of the EpubContentFileRef
class into two separate classes: EpubLocalContentLoader
and EpubRemoteContentLoader
. Pass the reference to the content loader through the constructor parameter in the EpubContentFileRef
class.
Even though EPUB specification restricts types of remote resources to just audio, video, and font files, it would be better to relax this restriction in the EpubReader to allow all types of files to be remote resources to make the overall design simpler for the consumer of the library. However, EpubReader should still check that all HTML files in the EPUB spine, as well as the cover image and the EPUB 2 NCX / EPUB 3 navigation documents are local resources since these files are essential for constructing the parsed EPUB schema of the book.
This solution introduces a breaking change: application will have to replace book.Content.<property>
(where <property>
is one of the following properties: Html
, Css
, Images
, Fonts
, or AllFiles
) with book.Content.<property>.Local
(unless application needs to handle remote items too).
Additionally, if application stores references to content files, then the following type replacement will be required:
EpubContentFile
→ EpubLocalContentFile
EpubTextContentFile
→ EpubLocalTextContentFile
EpubByteContentFile
→ EpubLocalByteContentFile
EpubContentFileRef
→ EpubLocalContentFileRef
EpubTextContentFileRef
→ EpubLocalTextContentFileRef
EpubByteContentFileRef
→ EpubLocalByteContentFileRef
href
attribute in the EPUB 3 specification: https://www.w3.org/publishing/epub32/epub-packages.html#attrdef-hrefHi,
I couldn't figure out how to email so I hope you'll forgive the question here. I've been trying to get the plain text of each chapter from epubs using your library. I've been "foreaching" through the chapters and then using chapter id values (current and next) to find ranges of relevant elements for each chapter but keep getting stuck in the particulars of trimming html.
If you can think of a way to do this or there is already some function that might achieve that result would you be willing to message me or respond here?
Happy to make a small contribution to your favorite charity or paypal account for your time.
Thanks,
Dave Gerding
Loading EPUB produced (converted from AZW3) using Calibre 3.3, I get following exception:
Unmanaged exception: System.AggregateException: There were one of more errors. ---> System.Exception: Incorrect EPUB manifest: item with href = "My%20converted%20epub%20from%20Calibre_split_001.html" is missing.
in VersFx.Formats.Text.Epub.Readers.ChapterReader.GetChapters(EpubBookRef bookRef, List1 navigationPoints) in VersFx.Formats.Text.Epub.Readers.ChapterReader.GetChapters(EpubBookRef bookRef) in VersFx.Formats.Text.Epub.EpubBookRef.<GetChaptersAsync>b__33_0() in System.Threading.Tasks.Task
1.InnerInvoke()
in System.Threading.Tasks.Task.Execute()
in VersFx.Formats.Text.Epub.EpubReader.ReadBook(String filePath)
This is related to href as URL Encoded (look at those "%20"), while Content.Html has decoded fileNames.
I simply added a:
contentFileName = Uri.UnescapeDataString(contentFileName);
after "if", inside VersFx.Formats.Text.Epub\Readers\ChapterReader.cs.
This solved my issue!
We have developed an application based on this library, now we want to read epub file from MemoryStream. Kindly provide a sample to read a file from memory.
Hi, Thanks for the great project. I have some epub books where the ttf fonts have a mime-type of "application/x-font-truetype". I notice this mime-type is not included in the /Readers/ContentReaders.cs file. As a result, these fonts are not included in the EpubContent.Fonts list - they are being classed as EpubContentType.OTHER.
The fonts also appear to have 'ccs/' prefixed to their path (as does the .css file) even though in the epub archive manifest there is no css folder (the .css file is in the 'Styles' folder). Could you explain why this is?
Many thanks,
Will
Projects like PCL in Xamarin do not install nuget. More works by adding reference to DLL.
Please add support.
I have a book that fails to open after updating this library from v3.1.2 to v3.2
My Code:
using var epubBook = EpubReader.OpenBook(filePath, BookReaderOptions);
BookReaderOptions = = new()
{
PackageReaderOptions = new PackageReaderOptions()
{
IgnoreMissingToc = true
}
};
<?xml version="1.0" encoding="utf-8"?>
<package unique-identifier="fanficfare-uid" version="2.0" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:identifier id="fanficfare-uid">fanficfare-uid:www.royalroad.com-u87585-s26294</dc:identifier>
<dc:title id="id">He Who Fights With Monsters</dc:title>
<dc:creator opf:role="aut">Shirtaloon (Travis Deverell)</dc:creator>
<dc:contributor id="id-2">FanFicFare [https://github.com/JimmXinu/FanFicFare]</dc:contributor>
<dc:language>en</dc:language>
<dc:date opf:event="publication">2019-07-28</dc:date>
<dc:date opf:event="creation">2022-09-17</dc:date>
<dc:date opf:event="modification">2022-09-17</dc:date>
<meta content="2022-09-17T00:00:30" name="calibre:timestamp" />
<dc:description>Some description here</dc:description>
<dc:subject>High Fantasy</dc:subject>
<dc:subject>Last Update: 2022/09/17</dc:subject>
<dc:subject>LitRPG</dc:subject>
<dc:subject>Adventure</dc:subject>
<dc:publisher>www.royalroad.com</dc:publisher>
<dc:identifier opf:scheme="URL">https://www.royalroad.com/fiction/26294</dc:identifier>
<dc:source>https://www.royalroad.com/fiction/26294</dc:source>
</metadata>
<manifest>
<item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml" />
<item href="OEBPS/stylesheet.css" id="style" media-type="text/css" />
<item href="OEBPS/title_page.xhtml" id="title_page" media-type="application/xhtml+xml" />
<item href="OEBPS/file0001.xhtml" id="file0001" media-type="application/xhtml+xml" />
...
<item href="OEBPS/file0153.xhtml" id="file0153" media-type="application/xhtml+xml" />
</manifest>
<spine toc="ncx">
<itemref idref="title_page" linear="yes" />
<itemref idref="file0001" linear="yes" />
...
<itemref idref="file0153" linear="yes" />
</spine>
</package>
I have validated all items exist and table of contents looks fine.
Stack Trace:
[18:47:11 WRN] [BookService] There was an exception when opening epub book: E:\Books\Wont Open\He_Who_Fights_With_Monsters_-_Shirtaloon.epub
System.AggregateException: One or more errors occurred. (Object reference not set to an instance of an object.)
---> System.NullReferenceException: Object reference not set to an instance of an object.
at VersOne.Epub.Internal.BookCoverReader.ReadEpub2CoverFromGuide(EpubSchema epubSchema, Dictionary`2 imageContentRefs)
at VersOne.Epub.Internal.BookCoverReader.ReadEpub2Cover(EpubSchema epubSchema, Dictionary`2 imageContentRefs)
at VersOne.Epub.Internal.BookCoverReader.ReadBookCover(EpubSchema epubSchema, Dictionary`2 imageContentRefs)
at VersOne.Epub.Internal.ContentReader.ParseContentMap(EpubBookRef bookRef, ContentReaderOptions contentReaderOptions)
at VersOne.Epub.EpubReader.<>c__DisplayClass10_0.<OpenBookAsync>b__1()
at System.Threading.Tasks.Task`1.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
at VersOne.Epub.EpubReader.OpenBookAsync(IZipFile zipFile, String filePath, EpubReaderOptions epubReaderOptions)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at System.Threading.Tasks.Task`1.get_Result()
at VersOne.Epub.EpubReader.OpenBook(String filePath, EpubReaderOptions epubReaderOptions)
This is an issue to keep track of the work on the improvements to support EPUB 3 features that are currently not supported.
I have a valid epub, but the content.OPF file contains an item with a blank media type:
Sadly I have to be able to process files that pass epub check, and this does. Below is the stack trace from the fail. Ideally this would gracefully just ignore the file. Sadly I cannot provide the file as it contains copyrighted material, but I believe just adding the the file and a blank media-type to any should produce the same issue.
Thank you!
at VersOne.Epub.Internal.PackageReader.ReadManifest(XElement manifestNode)
at VersOne.Epub.Internal.PackageReader.d__0.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable1.ConfiguredTaskAwaiter.GetResult() at VersOne.Epub.Internal.SchemaReader.<ReadSchemaAsync>d__0.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable
1.ConfiguredTaskAwaiter.GetResult()
at VersOne.Epub.EpubReader.d__10.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at VersOne.Epub.EpubReader.d__9.MoveNext()
EpubBook.Content.AllFiles
;EpubBook.Content.AllFiles
will changed accordingly;Some pictures in the last chapter of the file
I want to export an epub file after I edit, help me please!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.