Coder Social home page Coder Social logo

epub recognized as xls about fido HOT 24 CLOSED

openpreserve avatar openpreserve commented on September 26, 2024
epub recognized as xls

from fido.

Comments (24)

techmaurice avatar techmaurice commented on September 26, 2024

This has probably to do with the fact there is not a signature available yet for epub.

Looking at the container signature for fmt/61 this is probably because this particular signature has the same bytes on certain positions that are also in your epub files.

Could you please send or attach a few epub files so I can take a look at them and possibly create a signature for them?

from fido.

anjackson avatar anjackson commented on September 26, 2024

There's an ePub signature here:

  <mime-type type="application/epub+zip">
    <acronym>EPUB</acronym>
    <_comment>Electronic Publication</_comment>
    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
        <match value="mimetypeapplication/epub+zip" type="string" offset="30"/>
      </match>
    </magic>
    <glob pattern="*.epub"/>
  </mime-type>

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

Thanks, will add this to the extension xml file.

from fido.

anjackson avatar anjackson commented on September 26, 2024

I guess you may have to set it up so that this takes precedence over the ZIP signature.

Note that the above signature is consistent with the proposed 'file magic' given in this section of the ePub spec.

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

When added to extensions.xml it has precedence over PRONOM signatures.

Thanks for the link to the spec.

from fido.

adamfarquhar avatar adamfarquhar commented on September 26, 2024

I guess that the file magic in the epub spec is just too weak to be that useful for identification in a broader context. The test for epub should be strengthened similar to the tests for ooxml, odf, jar or any of the many formats that are also based on zip.

Cheers,

Adam.

From: Andy Jackson [mailto:[email protected]]
Sent: 29 June 2013 12:44
To: openplanets/fido
Subject: Re: [fido] epub recognized as xls (#32)

I guess you may have to set it up so that this takes precedence over the ZIP signature.

Note that the above signature is consistent with the proposed 'file magic' given in this section of the ePub spec. http://www.idpf.org/epub/30/spec/epub30-ocf.html#app-media-type


Reply to this email directly or view it on GitHub #32 (comment) .

Adam Farquhar
Head of Digital Scholarship
Collections Division
T:+44 (0)20 7412 7832

[email protected]
The British Library
London

NW1 2DB

http://www.bl.uk/
The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

from fido.

anjackson avatar anjackson commented on September 26, 2024

@adamfarquhar It's not that the ePub sig is not sensitive enough - there is no ePub signature in PRONOM.

from fido.

adamfarquhar avatar adamfarquhar commented on September 26, 2024

Andy – Yes; I see that the tika signature is precise enough. I had scanned the xml too quickly. Perhaps the easiest fix then would be to get it added to pronom. Can you goose that along? It seems useful and not very controversial to add.

Cheers,

Adam.

From: Andy Jackson [mailto:[email protected]]
Sent: 30 June 2013 14:35
To: openplanets/fido
Cc: Farquhar, Adam
Subject: Re: [fido] epub recognized as xls (#32)

@adamfarquhar https://github.com/adamfarquhar It's not that the ePub sig is not sensitive enough - there is no ePub signature in PRONOM http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1270&strPageToDisplay=signatures .


Reply to this email directly or view it on GitHub #32 (comment) .

Adam Farquhar
Head of Digital Scholarship
Collections Division
T:+44 (0)20 7412 7832

[email protected]
The British Library
London

NW1 2DB

http://www.bl.uk/
The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

from fido.

anjackson avatar anjackson commented on September 26, 2024

I'll suggest it to David Clipsham. (done)

from fido.

vladox avatar vladox commented on September 26, 2024

I think the problem is that fido uses DROID 4, with DROID 6.1 ePub is correctly
recognized as "fmt/483".

from fido.

anjackson avatar anjackson commented on September 26, 2024

Fido does not use DROID 4 - it doesn't use DROID at all. It uses the PRONOM database, which has this entry for ePub. That PRONOM entry only contains a file extension, which is how it identified your ePub file. PRONOM contains no internal 'magic number' signature for ePub, and so cannot identify ePub bytestreams without such contextual hints.

from fido.

Dclipsham avatar Dclipsham commented on September 26, 2024

Hi All,

I added a PRONOM container signature as of 18/12/12, but container signatures will not work with DROID 4 (DROID 6 is the minimum). I'll add a binary variant in the next release for backward compatibility, which we aim to produce w/c 22 July in conjunction with the next DROID release (probably 6.1.3)

from fido.

vladox avatar vladox commented on September 26, 2024

I have actually found this link: http://www.nationalarchives.gov.uk/PRONOM/fmt/483

The "container" method is used to recognize it, so it seems that fido as to be extended to read the container signature.

From the Source description in that page:

"This format can be identified via a container signature in DROID version 6 or later. The PRONOM database cannot currently represent container signatures."

from fido.

anjackson avatar anjackson commented on September 26, 2024

Ah, my apologies, I missed the fact that there was a container signature. Fido only partially implements container signature support at present, which is why it doesn't work at the moment.

from fido.

Kris-LIBIS avatar Kris-LIBIS commented on September 26, 2024

Hi,

We need this badly. Latest droid does not do the trick either so I worked around this by creating an extension:

  <format>
    <puid>fmt/483</puid>
    <name>ePub format</name>
    <version>1.0</version>
    <alias>EPUB</alias>
    <mime>application/epub+zip</mime>
    <extension>epub</extension>
    <has_priority_over>x-fmt/263</has_priority_over>
    <has_priority_over>fmt/61</has_priority_over>
    <signature>
      <name>EPUB file</name>
      <pattern>
        <position>BOF</position>
        <regex>(?s)\APK\x03\x04</regex>
      </pattern>
      <pattern>
        <position>BOF</position>
        <regex>(?s)\A.{30}mimetypeapplication/epub\+zip</regex>
      </pattern>
    </signature>
    <details/>
  </format>

Maybe this could be added to the fido_extensions.xml until the container signatures work properly in fido?

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

Hi All, thanks for the comments and suggestions.

@Kris-LIBIS: I will publish an update of fido_extensions.xml ASAP, for the time being you could add this ePUB sig to fido_extensions.xml.

And I will investigate why the container signature does not work properly.

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

The ePub signature has been added to fido_extensions.xml, the update has been pushed with the 1.1.6 release.

It seems like the container signature is alright but the precedence in the container signature file is set wrong. The addition of the format information to the extension file fixes this.

Please note FIDO will still report it is a match from the container signature file. Will investigate what is wrong with the container signature file and send this information to PRONOM.

Not closing this issue yet...

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

The bug submitted by @atomotic has been fixed, FIDO now correctly matches ePub files as container-type using the PRONOM container file. The fixed version is tagged and committed as version 1.1.8.

The bug of multiple matches was caused by the read_container() function matching only the first regex where it should have matched all regexes (applicable when the signature consists of more than one regex).

This fix has impact on matches of all signatures of the PRONOM container signature file, please check this if you rely on FIDO in a production environment.

The addtion of the ePub signature to the extension file has been commented out for the time being as this fix seems to tackle the issue.

Please report back if this fixes the issue for you.

Note that the read_container() function is not yet fully compatible with the container signature file and it does not handle them the way DROID does. It is still lacking matching on byte positions and is not yet able to parse OLE2 files the way it should be done.

from fido.

Dclipsham avatar Dclipsham commented on September 26, 2024

Backward compatible versions of the signatures for ePub and Apple's iBooks were included in signature release v69, which become available on 19th July. This should assist users tied to older versions of DROID.

David

from fido.

Kris-LIBIS avatar Kris-LIBIS commented on September 26, 2024

Hi Maurice,

Fido now correctly recognises the epubs. This did the trick.

Thanks.

Unfortunately a mime type is not included, but that's another problem.

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

Hi Kris,

Thanks for reporting back.

The mime type is not included because the entry is missing in PUID fmt/483
@Dclipsham might want to pick this up?

from fido.

Dclipsham avatar Dclipsham commented on September 26, 2024

Will do. Next release will be mid-late September, but I'll ensure this is included.

David

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

Thanks!

from fido.

techmaurice avatar techmaurice commented on September 26, 2024

I stated earlier the precedence for ePub was set wrong but it turned out that was not the case.

Bug is confirmed to be fixed, closing this issue.

from fido.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.